System Design - Distributed Tracing

Create a free account to save your progress

Earn XP, track streaks, and sync your dashboard across devices.

Lesson

A user clicks "Place Order" and gets a timeout. You check the API gatewayWhat is api gateway?A single entry point that sits in front of multiple backend services, routing requests to the right one and handling shared concerns like authentication and rate limiting., 504. You check the order service, it called inventory. Inventory called pricing. Pricing was waiting on a 28-second database query from a missing indexWhat is index?A data structure the database maintains alongside a table so it can find rows by specific columns quickly instead of scanning everything.. That investigation took 45 minutes. With distributed tracingWhat is distributed tracing?Tracking a single request as it travels through multiple services, showing timing and dependencies at each step., it takes 30 seconds, you search the trace ID and immediately spot the slow spanWhat is span?One unit of work within a distributed trace, with a start time, duration, and optional attributes describing the operation..

Core concepts

Traces and spans

A trace is the complete record of a request's journey through your system. It has a unique trace ID (128-bit) that travels with the request everywhere.

A spanWhat is span?One unit of work within a distributed trace, with a start time, duration, and optional attributes describing the operation. represents a single operation within the trace: an HTTPWhat is http?The protocol browsers and servers use to exchange web pages, API data, and other resources, defining how requests and responses are formatted. request, a database query, a cache lookup. Each span has:

A trace ID: shared across all spans in the same request
A span ID: unique to this operation
A parent span ID: the span that triggered this one
A start time and duration
Attributes: metadata like http.method, db.statement, http.status_code
Status: OK, ERROR, or UNSET

Trace: abc-123
├── Span: API Gateway (2ms)
│   └── Span: Auth middleware (1ms)
├── Span: Order Service (350ms)
│   ├── Span: Validate cart (5ms)
│   ├── Span: Inventory check (120ms)
│   │   └── Span: DB query (115ms)
│   └── Span: Process payment (220ms)
│       ├── Span: Stripe API call (200ms)
│       └── Span: Save to DB (18ms)
└── Span: Send confirmation email (50ms)

This tree shows exactly where time was spent. If that DB query under "Inventory check" suddenly jumps from 5ms to 500ms, you spot it instantly.

Context propagation

For tracing to work across services, the trace ID must travel with the request via the W3C traceparent HTTP header:

traceparent: 00-abc123def456789012345678-span456-01
             |   |                         |       |
             |   trace ID (128-bit)        span ID flags
             version

When Service A calls Service B, it includes the traceparent header. Service B reads it, creates a child span with the same trace ID, and passes it along downstream. This is automatic with OpenTelemetryWhat is opentelemetry?A vendor-neutral open standard and SDK for collecting distributed traces, metrics, and logs from your applications.'s HTTP instrumentation.

If one service in the chain does not propagate the context, the trace breaks, you get two disconnected traces. This is the most common tracing problem in practice.

Sampling strategies

Tracing every request generates enormous data. A service at 10,000 req/s with dozens of spans per trace makes full collection too expensive. Sampling keeps a representative subset.

Strategy	How it works	Pros	Cons
Always-on (100%)	Trace every request	Complete data	Extremely expensive at scale
Head-based	Decide at entry point	Simple, low overhead	Might miss rare errors
Tail-based	Decide after trace is complete	Keeps all errors and slow requests	Complex, requires buffering
Rate-limited	Trace N requests/second	Predictable cost	Over/under-samples with traffic changes
Priority-based	Always trace certain types (e.g., payments)	Important flows always traced	Requires app-level config

HeadWhat is head?A special pointer in Git that indicates the commit you are currently working on - usually the tip of the active branch.-based sampling

At the entry point, a random decision is made: trace or skip. If "yes," every downstream service creates spans. If "no," no spans anywhere.

import { TraceIdRatioBasedSampler } from '@opentelemetry/sdk-trace-base';

const sdk = new NodeSDK({
  sampler: new TraceIdRatioBasedSampler(0.1), // Trace 10% of requests
});

The downside: you decide before seeing the outcome. A request that errors out has a 90% chance of not being traced.

Tail-based sampling

All spans are collected by the OTel Collector. After a trace completes, the collector decides whether to keep it based on rules:

yaml

# OTel Collector configuration for tail-based sampling
processors:
  tail_sampling:
    decision_wait: 10s
    policies:
      - name: errors
        type: status_code
        status_code: { status_codes: [ERROR] }
      - name: slow-requests
        type: latency
        latency: { threshold_ms: 2000 }
      - name: baseline
        type: probabilistic
        probabilistic: { sampling_percentage: 5 }

Tail-based sampling gives low storage costs with complete visibility into errors and slow requests. The tradeoff is operational complexity, the collector must buffer all spans until the trace completes.

When tracing is overkill

Tracing adds latencyWhat is latency?The time delay between sending a request and receiving the first byte of the response, usually measured in milliseconds., complexity, and cost. Skip it when:

MonolithWhat is monolith?A software architecture where the entire application lives in a single codebase and deploys as one unit. Simpler to build and debug than microservices.: structured logs with request IDs give you everything. A trace with one spanWhat is span?One unit of work within a distributed trace, with a start time, duration, and optional attributes describing the operation. is a log entry with extra steps.
Very low traffic: at 10 req/min, you can log every request in detail.
Batch processing: tracing is for request-response flows, not CSV processing jobs.
Early-stage projects: with 1-2 services, invest in logging and metrics first. Add tracing at 3+ services.

Visualizing traces

Tools like Jaeger, Tempo, Zipkin, and Datadog APM show traces as waterfall diagrams, horizontal bars representing spans, arranged by time. You instantly see which service is the bottleneck, which calls are parallel vs sequential, and where errors occurred.

The key to getting value is not the tool, it is the discipline of adding meaningful attributes to spans. A trace with "HTTPWhat is http?The protocol browsers and servers use to exchange web pages, API data, and other resources, defining how requests and responses are formatted. GET" and "DB Query" is barely useful. Spans named "validate-cart," "check-inventory-warehouse-east," and "charge-stripe-payment-intent" tell a story.

AI pitfall

AI will generate tracing code that creates a span for every function call, resulting in traces with 100+ spans for a simple request. What AI gets wrong: too many spans make the waterfall view unreadable and increase storage costs. Create spans only for operations that cross a boundary (network call, database query) or represent a meaningful business step (validate cart, charge payment). Skip spans for pure in-memory computation.

Good to know

Tail-based sampling is the most cost-effective tracing strategy. Instead of deciding whether to trace a request at the start (head-based), you trace everything and only keep interesting traces at the end, errors, slow responses, and a random sample of normal requests. This ensures you never miss an error trace while keeping storage costs under control.

Edge case

Trace context propagation breaks when your request crosses a non-HTTP boundary. If your service publishes a message to a queue and a consumer picks it up later, the trace context must be manually injected into the message and extracted by the consumer. AI-generated code almost never handles this, it only propagates context through HTTP headers.

Done

Complete & Next