When a user clicks a button and something is slow, where do you look? In a monolithWhat is monolith?A software architecture where the entire application lives in a single codebase and deploys as one unit. Simpler to build and debug than microservices., you check one app's logs and metrics. In a distributed system, where that click might trigger your APIWhat is api?A set of rules that lets one program talk to another, usually over the internet, by sending requests and getting responses., which calls an auth service, a database, a cache, and a third-party payment providerWhat is provider?A wrapper component that makes data available to all components nested inside it without passing props manually., finding the bottleneck is genuinely hard without the right tools.
Distributed tracingWhat is distributed tracing?Tracking a single request as it travels through multiple services, showing timing and dependencies at each step. solves this by giving every request a unique ID (a trace ID) and recording every step of that request's journey as a timeline. Think of it like a detailed travel itinerary: instead of knowing only that a trip took 8 hours, you know you spent 30 minutes at the airport, 5 hours flying, 2 hours in customs, and 30 minutes getting a taxi.
Traces and spans
The mental model
A trace is the complete picture of one request, from the moment it enters your system to the moment a response goes back to the user. A spanWhat is span?One unit of work within a distributed trace, with a start time, duration, and optional attributes describing the operation. is one unit of work within that trace.
Trace: POST /checkout (450ms total)
├── Span: Auth middleware (12ms)
├── Span: Validate cart (8ms)
├── Span: Charge payment (380ms) ← bottleneck!
│ └── Span: Stripe API call (375ms)
└── Span: Save order to DB (15ms)Every span records a start time, duration, status (success or error), and any attributes you add (like the SQLWhat is sql?A language for querying and managing data in relational databases, letting you insert, read, update, and delete rows across tables. query that ran or the HTTPWhat is http?The protocol browsers and servers use to exchange web pages, API data, and other resources, defining how requests and responses are formatted. URL that was called).
OpenTelemetryWhat is opentelemetry?A vendor-neutral open standard and SDK for collecting distributed traces, metrics, and logs from your applications. setup
Why OpenTelemetry
OpenTelemetry (OTel) is the vendor-neutral standard for distributed tracingWhat is distributed tracing?Tracking a single request as it travels through multiple services, showing timing and dependencies at each step., metrics, and logs. Instead of committing to Datadog's or Honeycomb's proprietary SDKWhat is sdk?A pre-built library from a service provider that wraps their API into convenient functions you call in your code instead of writing raw HTTP requests., you instrument your code once with OTel and can switch backends without changing application code.
npm install @opentelemetry/sdk-node \
@opentelemetry/auto-instrumentations-node \
@opentelemetry/exporter-otlp-httpInitializing the SDK
Set up the OTel SDK before your application starts. This is important, it needs to patch Node.js modules before they're imported:
// tracing.ts - import this first, before everything else
import { NodeSDK } from '@opentelemetry/sdk-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-otlp-http';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { Resource } from '@opentelemetry/resources';
import { SemanticResourceAttributes } from '@opentelemetry/semantic-conventions';
const sdk = new NodeSDK({
resource: new Resource({
[SemanticResourceAttributes.SERVICE_NAME]: 'checkout-service',
[SemanticResourceAttributes.SERVICE_VERSION]: process.env.APP_VERSION,
}),
traceExporter: new OTLPTraceExporter({
url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT,
}),
instrumentations: [
getNodeAutoInstrumentations(), // Auto-instruments HTTP, Express, databases
],
});
sdk.start();// server.ts - import tracing first
import './tracing';
import express from 'express';
// ... rest of your appThe auto-instrumentation handles HTTPWhat is http?The protocol browsers and servers use to exchange web pages, API data, and other resources, defining how requests and responses are formatted. requests, database calls, and most popular libraries automatically without you writing any tracing code.
Creating custom spans
Adding business context
Auto-instrumentation handles infrastructure. For business logic, create spans manually to capture what your code is actually doing:
import { trace, SpanStatusCode } from '@opentelemetry/api';
const tracer = trace.getTracer('checkout-service');
async function processCheckout(order: Order): Promise<OrderResult> {
return tracer.startActiveSpan('processCheckout', async (span) => {
try {
span.setAttributes({
'order.id': order.id,
'order.items_count': order.items.length,
'order.total': order.total,
'customer.id': order.customerId,
});
const result = await submitOrder(order);
span.setStatus({ code: SpanStatusCode.OK });
return result;
} catch (err) {
span.setStatus({
code: SpanStatusCode.ERROR,
message: (err as Error).message,
});
span.recordException(err as Error);
throw err;
} finally {
span.end();
}
});
}Setting attributes on spans (like order.id and order.total) lets you search traces in your backend by those values, for example, finding all traces related to a specific order.
Context propagation
Linking spans across services
Context propagation is what makes distributed tracingWhat is distributed tracing?Tracking a single request as it travels through multiple services, showing timing and dependencies at each step. "distributed." When service A calls service B, it needs to pass the current trace ID so service B's spans appear in the same trace tree.
OpenTelemetryWhat is opentelemetry?A vendor-neutral open standard and SDK for collecting distributed traces, metrics, and logs from your applications. handles this automatically for HTTPWhat is http?The protocol browsers and servers use to exchange web pages, API data, and other resources, defining how requests and responses are formatted. calls when you use auto-instrumentation. It injects the W3C traceparent header:
traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
^ ^ ^ ^
version trace ID (128 bit) span ID flagsWhen the downstream service receives this header, it reads the trace ID and creates its spans as children of the incoming spanWhat is span?One unit of work within a distributed trace, with a start time, duration, and optional attributes describing the operation.. The result is a single, unified trace across all your services.
// This header injection happens automatically with auto-instrumentation
// But if you're making raw fetch calls, you need to propagate manually:
import { propagation, context } from '@opentelemetry/api';
async function callDownstreamService(url: string, data: unknown) {
const headers: Record<string, string> = {
'Content-Type': 'application/json',
};
// Inject current trace context into headers
propagation.inject(context.active(), headers);
return fetch(url, {
method: 'POST',
headers,
body: JSON.stringify(data),
});
}Sampling strategies
Not tracing everything
Tracing every request at high traffic is expensive. Sampling lets you collect a representative subset without breaking the bank:
| Strategy | How it works | Best for |
|---|---|---|
| Head sampling | Decide at the first span | Simple, low overhead |
| Tail sampling | Decide after the trace completes | Sample 100% of errors, 1% of successes |
| Rate-based | Trace N% of all requests | Predictable cost |
| Adaptive | Adjust rate based on traffic | Dynamic workloads |
import { TraceIdRatioBasedSampler } from '@opentelemetry/sdk-trace-node';
const sdk = new NodeSDK({
sampler: new TraceIdRatioBasedSampler(
process.env.NODE_ENV === 'production' ? 0.1 : 1.0
),
// ... rest of config
});Choosing a tracing backend
| Backend | Hosting | Best for |
|---|---|---|
| Jaeger | Self-hosted | Open source, full control |
| Grafana Tempo | Self-hosted or cloud | Cost-effective, integrates with Grafana |
| Honeycomb | SaaS | Best query experience, developer-focused |
| Datadog APM | SaaS | All-in-one observability |
| AWS X-Ray | SaaS (AWS) | AWS-native workloads |
All of these accept OpenTelemetryWhat is opentelemetry?A vendor-neutral open standard and SDK for collecting distributed traces, metrics, and logs from your applications. data, so you can switch between them without changing your instrumentation code.
Quick reference
| Concept | Definition | Key detail |
|---|---|---|
| Trace | Full journey of one request | Has a unique 128-bit trace ID |
| Span | One unit of work in a trace | Has parent span ID (except root) |
| Context propagation | Passing trace ID between services | Via traceparent HTTP header |
| Auto-instrumentation | Automatic span creation | Covers HTTP, DB, queues |
| Custom span | Manually created span | For business logic |
| Sampling | Tracing a fraction of requests | Balance cost vs. coverage |