Production Engineering/
Lesson

When a user clicks a button and something is slow, where do you look? In a monolithWhat is monolith?A software architecture where the entire application lives in a single codebase and deploys as one unit. Simpler to build and debug than microservices., you check one app's logs and metrics. In a distributed system, where that click might trigger your APIWhat is api?A set of rules that lets one program talk to another, usually over the internet, by sending requests and getting responses., which calls an auth service, a database, a cache, and a third-party payment providerWhat is provider?A wrapper component that makes data available to all components nested inside it without passing props manually., finding the bottleneck is genuinely hard without the right tools.

Distributed tracingWhat is distributed tracing?Tracking a single request as it travels through multiple services, showing timing and dependencies at each step. solves this by giving every request a unique ID (a trace ID) and recording every step of that request's journey as a timeline. Think of it like a detailed travel itinerary: instead of knowing only that a trip took 8 hours, you know you spent 30 minutes at the airport, 5 hours flying, 2 hours in customs, and 30 minutes getting a taxi.

Traces and spans

The mental model

A trace is the complete picture of one request, from the moment it enters your system to the moment a response goes back to the user. A spanWhat is span?One unit of work within a distributed trace, with a start time, duration, and optional attributes describing the operation. is one unit of work within that trace.

Trace: POST /checkout (450ms total)
├── Span: Auth middleware (12ms)
├── Span: Validate cart (8ms)
├── Span: Charge payment (380ms)      ← bottleneck!
│   └── Span: Stripe API call (375ms)
└── Span: Save order to DB (15ms)

Every span records a start time, duration, status (success or error), and any attributes you add (like the SQLWhat is sql?A language for querying and managing data in relational databases, letting you insert, read, update, and delete rows across tables. query that ran or the HTTPWhat is http?The protocol browsers and servers use to exchange web pages, API data, and other resources, defining how requests and responses are formatted. URL that was called).

The Stripe API call taking 375ms out of a 450ms checkout is exactly the kind of insight that's impossible to get from logs or metrics alone. You'd know checkout was slow; tracing tells you specifically why.
02

OpenTelemetryWhat is opentelemetry?A vendor-neutral open standard and SDK for collecting distributed traces, metrics, and logs from your applications. setup

Why OpenTelemetry

OpenTelemetry (OTel) is the vendor-neutral standard for distributed tracingWhat is distributed tracing?Tracking a single request as it travels through multiple services, showing timing and dependencies at each step., metrics, and logs. Instead of committing to Datadog's or Honeycomb's proprietary SDKWhat is sdk?A pre-built library from a service provider that wraps their API into convenient functions you call in your code instead of writing raw HTTP requests., you instrument your code once with OTel and can switch backends without changing application code.

npm install @opentelemetry/sdk-node \
  @opentelemetry/auto-instrumentations-node \
  @opentelemetry/exporter-otlp-http

Initializing the SDK

Set up the OTel SDK before your application starts. This is important, it needs to patch Node.js modules before they're imported:

// tracing.ts - import this first, before everything else
import { NodeSDK } from '@opentelemetry/sdk-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-otlp-http';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { Resource } from '@opentelemetry/resources';
import { SemanticResourceAttributes } from '@opentelemetry/semantic-conventions';

const sdk = new NodeSDK({
  resource: new Resource({
    [SemanticResourceAttributes.SERVICE_NAME]: 'checkout-service',
    [SemanticResourceAttributes.SERVICE_VERSION]: process.env.APP_VERSION,
  }),
  traceExporter: new OTLPTraceExporter({
    url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT,
  }),
  instrumentations: [
    getNodeAutoInstrumentations(), // Auto-instruments HTTP, Express, databases
  ],
});

sdk.start();
// server.ts - import tracing first
import './tracing';
import express from 'express';
// ... rest of your app

The auto-instrumentation handles HTTPWhat is http?The protocol browsers and servers use to exchange web pages, API data, and other resources, defining how requests and responses are formatted. requests, database calls, and most popular libraries automatically without you writing any tracing code.

03

Creating custom spans

Adding business context

Auto-instrumentation handles infrastructure. For business logic, create spans manually to capture what your code is actually doing:

import { trace, SpanStatusCode } from '@opentelemetry/api';

const tracer = trace.getTracer('checkout-service');

async function processCheckout(order: Order): Promise<OrderResult> {
  return tracer.startActiveSpan('processCheckout', async (span) => {
    try {
      span.setAttributes({
        'order.id': order.id,
        'order.items_count': order.items.length,
        'order.total': order.total,
        'customer.id': order.customerId,
      });

      const result = await submitOrder(order);

      span.setStatus({ code: SpanStatusCode.OK });
      return result;
    } catch (err) {
      span.setStatus({
        code: SpanStatusCode.ERROR,
        message: (err as Error).message,
      });
      span.recordException(err as Error);
      throw err;
    } finally {
      span.end();
    }
  });
}

Setting attributes on spans (like order.id and order.total) lets you search traces in your backend by those values, for example, finding all traces related to a specific order.

04

Context propagation

Linking spans across services

Context propagation is what makes distributed tracingWhat is distributed tracing?Tracking a single request as it travels through multiple services, showing timing and dependencies at each step. "distributed." When service A calls service B, it needs to pass the current trace ID so service B's spans appear in the same trace tree.

OpenTelemetryWhat is opentelemetry?A vendor-neutral open standard and SDK for collecting distributed traces, metrics, and logs from your applications. handles this automatically for HTTPWhat is http?The protocol browsers and servers use to exchange web pages, API data, and other resources, defining how requests and responses are formatted. calls when you use auto-instrumentation. It injects the W3C traceparent header:

traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
                ^  ^                                ^                ^
                version  trace ID (128 bit)         span ID          flags

When the downstream service receives this header, it reads the trace ID and creates its spans as children of the incoming spanWhat is span?One unit of work within a distributed trace, with a start time, duration, and optional attributes describing the operation.. The result is a single, unified trace across all your services.

// This header injection happens automatically with auto-instrumentation
// But if you're making raw fetch calls, you need to propagate manually:
import { propagation, context } from '@opentelemetry/api';

async function callDownstreamService(url: string, data: unknown) {
  const headers: Record<string, string> = {
    'Content-Type': 'application/json',
  };

  // Inject current trace context into headers
  propagation.inject(context.active(), headers);

  return fetch(url, {
    method: 'POST',
    headers,
    body: JSON.stringify(data),
  });
}
05

Sampling strategies

Not tracing everything

Tracing every request at high traffic is expensive. Sampling lets you collect a representative subset without breaking the bank:

StrategyHow it worksBest for
Head samplingDecide at the first spanSimple, low overhead
Tail samplingDecide after the trace completesSample 100% of errors, 1% of successes
Rate-basedTrace N% of all requestsPredictable cost
AdaptiveAdjust rate based on trafficDynamic workloads
import { TraceIdRatioBasedSampler } from '@opentelemetry/sdk-trace-node';

const sdk = new NodeSDK({
  sampler: new TraceIdRatioBasedSampler(
    process.env.NODE_ENV === 'production' ? 0.1 : 1.0
  ),
  // ... rest of config
});
Always sample 100% of errors, even if you sample a small fraction of successful requests. A tool like Grafana Tempo or Honeycomb's Refinery can do tail-based sampling to achieve this.
06

Choosing a tracing backend

BackendHostingBest for
JaegerSelf-hostedOpen source, full control
Grafana TempoSelf-hosted or cloudCost-effective, integrates with Grafana
HoneycombSaaSBest query experience, developer-focused
Datadog APMSaaSAll-in-one observability
AWS X-RaySaaS (AWS)AWS-native workloads

All of these accept OpenTelemetryWhat is opentelemetry?A vendor-neutral open standard and SDK for collecting distributed traces, metrics, and logs from your applications. data, so you can switch between them without changing your instrumentation code.

07

Quick reference

ConceptDefinitionKey detail
TraceFull journey of one requestHas a unique 128-bit trace ID
SpanOne unit of work in a traceHas parent span ID (except root)
Context propagationPassing trace ID between servicesVia traceparent HTTP header
Auto-instrumentationAutomatic span creationCovers HTTP, DB, queues
Custom spanManually created spanFor business logic
SamplingTracing a fraction of requestsBalance cost vs. coverage