Tracing Distributed Apps on OCI

A simple application is easy to reason about because it does its work in one place. A distributed application is not, because a single request entering the front of the system fans out across many services, each calling others, before a response is assembled and returned. When such a request is slow or fails, the cause is somewhere in that web of calls, but the web is invisible. Each service knows only about its own part, no single service sees the whole journey, and so a problem that spans services is exactly the kind that is hardest to find. Distributed tracing exists to make the journey visible. It follows a single request across every service it touches, recording the path and the time spent at each step, so that the one slow or failing link stands out from the rest.

The problem tracing solves

Consider what happens without tracing when a request through a distributed system is slow. You know the overall request took too long, but you do not know where the time went. You could log into each service in turn and look at its own metrics, but each service only sees its own work, and lining up its view with the others by hand, matching timestamps and guessing which log line corresponds to which request, is slow and error prone. Worse, the services may all look individually healthy, because the problem is not that any one of them is broken but that one of them is slow specifically for this kind of request. The information needed to diagnose this simply does not exist in any single place. Tracing creates it, by stitching together the view from every service into one coherent picture of the request.

No single service sees the whole journey. Tracing is what stitches the views together.

How a trace follows a request

The mechanism behind tracing is simple in concept. When a request enters the system, it is given a unique identifier. As the request moves from one service to the next, that identifier travels with it, passed along in the calls between services. Each service, as it does its part of the work, records what it did and how long it took, tagged with the request's identifier. Afterward, all these separate records, which share the same identifier, are gathered and assembled into a single trace, a complete timeline of the request's journey across every service. The key idea is the propagation of the identifier across service boundaries, because that shared thread is what lets records produced in different places be recognised as belonging to the same request. Without it, the records are just disconnected fragments.

Element	What it is	Why it matters
Trace identifier	A unique id given to each request	The thread that links records across services
Span	One unit of work in one service	Shows time spent at each step
Propagation	Passing the id along between services	Lets records be reassembled into one trace
Context	Extra data carried with the request	Adds meaning, such as which customer

Reading a trace

The payoff of tracing is the assembled trace itself, which is usually shown as a timeline of the request broken into its constituent spans, each span being one unit of work in one service. Reading this timeline, you see the whole request laid out, with the duration of each span visible, and the slow part is immediately obvious because its span dominates the picture. A request that takes four seconds might reveal a single span, perhaps one database call deep inside one service, that accounts for three and a half of those seconds, and now you know exactly where to look. This is a transformation of the diagnostic task, from guessing across many services to reading a single picture, and it is why tracing is indispensable in any system of real complexity. The relationship between traces and the broader practice of watching applications is set out in application performance monitoring.

A framework for adopting tracing

Tracing delivers value fastest when introduced in a sensible order rather than all at once.

Instrument the entry points. Start tracing where requests enter the system, so every request gets an identifier from the start.
Propagate across boundaries. Ensure the identifier is passed along every call between services, because a break in propagation means a broken trace.
Cover the slow paths first. Add detailed spans to the services and operations most likely to be where time goes, rather than instrumenting everything at once.
Add meaningful context. Tag traces with the data that makes them useful, so you can find the traces for a particular customer or operation.
Sample sensibly. Tracing every request can be heavy, so decide how much to capture, keeping enough to be representative and to catch the rare slow case.

This order gets useful traces flowing quickly and builds toward full coverage, while keeping the overhead of tracing under control through sensible sampling.

The cost of tracing and how to manage it

Tracing is not free. Recording and transmitting the data for every request across every service adds work, and capturing absolutely everything can become a meaningful overhead in a high volume system. The usual answer is sampling, capturing a representative fraction of requests in full rather than all of them, which keeps the cost down while still giving a faithful picture. The subtlety is making sure the sampling does not throw away the interesting cases, the slow requests and the errors, which is why good sampling keeps those even when discarding ordinary fast requests. Managing this tradeoff well is part of keeping observability proportionate, a theme that runs through the wider monitoring and observability practice and connects to how data is collected and kept in centralised logging.

Seeing the system whole

The reason distributed tracing matters is that a distributed system cannot be understood from the inside of any one of its parts. Each service sees only its own work, and the behaviour that determines a user's experience lives in the spaces between services, in the path a request takes and the time it spends at each step. Tracing is the only thing that makes that path visible, turning a diffuse problem that spans many services into a specific slow span you can point to. It is the observability layer that reveals the structure of a distributed system and where its time and errors actually accumulate, and it belongs in the toolkit of anyone running such a system seriously. It is a core part of the practice described in the complete monitoring and observability guide. When your services are many and the slowness is hiding among them, our OCI monitoring and observability practice sets up tracing so the slow link shows itself.

Free white paper

Go deeper on this topic with The OCI Managed Services and Observability Handbook, what good looks like when you run an OCI estate. An independent analyst style report with comparison tables and recommendations, free with a work email. Prefer a monthly summary instead? The OCI Brief delivers one practical OCI briefing a month.

Part of a series
This guide is part of OCI Operations & Observability — our complete pillar guide on the topic.

About the author

Morten Andersen, Co-founder of OCI Specialists — 20 years of enterprise IT experience in OCI migration, security, networking, and 24/7 operations. Full profile · LinkedIn

Moving Oracle workloads to OCI, or already running on OCI and not sure the architecture or the spend is right? Most teams bring in a specialist before they commit to a region, a shape, or a Universal Credits number. OCISpecialists.com plans the landing zone, runs the migration, and manages the estate after go live, on a fixed project fee, a managed monthly retainer, or a cost optimization fee paid only on verified savings.