A user clicks "Place Order" and gets a timeout. You check the API gatewayWhat is api gateway?A single entry point that sits in front of multiple backend services, routing requests to the right one and handling shared concerns like authentication and rate limiting., 504. You check the order service, it called inventory. Inventory called pricing. Pricing was waiting on a 28-second database query from a missing indexWhat is index?A data structure the database maintains alongside a table so it can find rows by specific columns quickly instead of scanning everything.. That investigation took 45 minutes. With distributed tracingWhat is distributed tracing?Tracking a single request as it travels through multiple services, showing timing and dependencies at each step., it takes 30 seconds, you search the trace ID and immediately spot the slow spanWhat is span?One unit of work within a distributed trace, with a start time, duration, and optional attributes describing the operation..
Core concepts
Traces and spans
A trace is the complete record of a request's journey through your system. It has a unique trace ID (128-bit) that travels with the request everywhere.
A spanWhat is span?One unit of work within a distributed trace, with a start time, duration, and optional attributes describing the operation. represents a single operation within the trace: an HTTPWhat is http?The protocol browsers and servers use to exchange web pages, API data, and other resources, defining how requests and responses are formatted. request, a database query, a cache lookup. Each span has:
- A trace ID: shared across all spans in the same request
- A span ID: unique to this operation
- A parent span ID: the span that triggered this one
- A start time and duration
- Attributes: metadata like
http.method,db.statement,http.status_code - Status: OK, ERROR, or UNSET
Trace: abc-123
├── Span: API Gateway (2ms)
│ └── Span: Auth middleware (1ms)
├── Span: Order Service (350ms)
│ ├── Span: Validate cart (5ms)
│ ├── Span: Inventory check (120ms)
│ │ └── Span: DB query (115ms)
│ └── Span: Process payment (220ms)
│ ├── Span: Stripe API call (200ms)
│ └── Span: Save to DB (18ms)
└── Span: Send confirmation email (50ms)This tree shows exactly where time was spent. If that DB query under "Inventory check" suddenly jumps from 5ms to 500ms, you spot it instantly.
Context propagation
For tracing to work across services, the trace ID must travel with the request via the W3C traceparent HTTP header:
traceparent: 00-abc123def456789012345678-span456-01
| | | |
| trace ID (128-bit) span ID flags
versionWhen Service A calls Service B, it includes the traceparent header. Service B reads it, creates a child span with the same trace ID, and passes it along downstream. This is automatic with OpenTelemetryWhat is opentelemetry?A vendor-neutral open standard and SDK for collecting distributed traces, metrics, and logs from your applications.'s HTTP instrumentation.
If one service in the chain does not propagate the context, the trace breaks, you get two disconnected traces. This is the most common tracing problem in practice.
Sampling strategies
Tracing every request generates enormous data. A service at 10,000 req/s with dozens of spans per trace makes full collection too expensive. Sampling keeps a representative subset.
| Strategy | How it works | Pros | Cons |
|---|---|---|---|
| Always-on (100%) | Trace every request | Complete data | Extremely expensive at scale |
| Head-based | Decide at entry point | Simple, low overhead | Might miss rare errors |
| Tail-based | Decide after trace is complete | Keeps all errors and slow requests | Complex, requires buffering |
| Rate-limited | Trace N requests/second | Predictable cost | Over/under-samples with traffic changes |
| Priority-based | Always trace certain types (e.g., payments) | Important flows always traced | Requires app-level config |
HeadWhat is head?A special pointer in Git that indicates the commit you are currently working on - usually the tip of the active branch.-based sampling
At the entry point, a random decision is made: trace or skip. If "yes," every downstream service creates spans. If "no," no spans anywhere.
import { TraceIdRatioBasedSampler } from '@opentelemetry/sdk-trace-base';
const sdk = new NodeSDK({
sampler: new TraceIdRatioBasedSampler(0.1), // Trace 10% of requests
});The downside: you decide before seeing the outcome. A request that errors out has a 90% chance of not being traced.
Tail-based sampling
All spans are collected by the OTel Collector. After a trace completes, the collector decides whether to keep it based on rules:
# OTel Collector configuration for tail-based sampling
processors:
tail_sampling:
decision_wait: 10s
policies:
- name: errors
type: status_code
status_code: { status_codes: [ERROR] }
- name: slow-requests
type: latency
latency: { threshold_ms: 2000 }
- name: baseline
type: probabilistic
probabilistic: { sampling_percentage: 5 }Tail-based sampling gives low storage costs with complete visibility into errors and slow requests. The tradeoff is operational complexity, the collector must buffer all spans until the trace completes.
When tracing is overkill
Tracing adds latencyWhat is latency?The time delay between sending a request and receiving the first byte of the response, usually measured in milliseconds., complexity, and cost. Skip it when:
- MonolithWhat is monolith?A software architecture where the entire application lives in a single codebase and deploys as one unit. Simpler to build and debug than microservices.: structured logs with request IDs give you everything. A trace with one spanWhat is span?One unit of work within a distributed trace, with a start time, duration, and optional attributes describing the operation. is a log entry with extra steps.
- Very low traffic: at 10 req/min, you can log every request in detail.
- Batch processing: tracing is for request-response flows, not CSV processing jobs.
- Early-stage projects: with 1-2 services, invest in logging and metrics first. Add tracing at 3+ services.
Visualizing traces
Tools like Jaeger, Tempo, Zipkin, and Datadog APM show traces as waterfall diagrams, horizontal bars representing spans, arranged by time. You instantly see which service is the bottleneck, which calls are parallel vs sequential, and where errors occurred.
The key to getting value is not the tool, it is the discipline of adding meaningful attributes to spans. A trace with "HTTPWhat is http?The protocol browsers and servers use to exchange web pages, API data, and other resources, defining how requests and responses are formatted. GET" and "DB Query" is barely useful. Spans named "validate-cart," "check-inventory-warehouse-east," and "charge-stripe-payment-intent" tell a story.