Observability Pillars

Create a free account to save your progress

Earn XP, track streaks, and sync your dashboard across devices.

Lesson

Monitoring is a predefined set of checks: is the server up? Is CPU above 80%? You decide in advance what to watch. Observability is different, it is the ability to ask arbitrary questions about your system's behavior after the fact, without having anticipated those questions in advance. Monitoring tells you WHAT is broken; observability helps you understand WHY.

The three pillars explained

Observability rests on three types of telemetryWhat is telemetry?Data your application automatically sends about its own performance and behavior - logs, metrics, and traces collected for monitoring. data. Each answers different questions, and they work best together.

Logs

Logs are discrete events with a timestamp and a message. They tell you what happened at a specific moment.

json

{
  "timestamp": "2025-03-15T14:32:01.445Z",
  "level": "error",
  "service": "auth-service",
  "message": "Authentication failed",
  "userId": "user-42",
  "reason": "invalid_password",
  "ip": "203.0.113.42"
}

Raw logs become useless at scale, you need structure, aggregation, and search to make them actionable.

Metrics

Metrics are numeric values measured over time. They track "request count per second," "p99 latencyWhat is latency?The time delay between sending a request and receiving the first byte of the response, usually measured in milliseconds.," or "active database connections."

http_requests_total{method="GET", path="/api/users", status="200"} 14523
http_request_duration_seconds{method="GET", path="/api/users", quantile="0.99"} 0.45

Metrics are cheap to store and ideal for dashboards and alerts. They answer "how many?" and "how fast?" but not "why?"

Traces

Traces follow a single request as it travels across multiple services. A trace for "user loads their dashboard" might show: API gatewayWhat is api gateway?A single entry point that sits in front of multiple backend services, routing requests to the right one and handling shared concerns like authentication and rate limiting. (2ms) -> auth service (15ms) -> user service (8ms) -> database query (120ms) -> recommendation service (350ms). Now you can see the recommendation service is the bottleneck.

Each trace contains spans, one spanWhat is span?One unit of work within a distributed trace, with a start time, duration, and optional attributes describing the operation. per operation. Spans are nested, showing the parent-child relationship between calls.

Comparing the three pillars

Aspect	Logs	Metrics	Traces
Data type	Text events	Numeric time series	Request flow graphs
Best for	Debugging specific errors	Alerting and trends	Finding bottlenecks across services
Storage cost	High (verbose text)	Low (compact numbers)	Medium (structured spans)
Query speed	Slow (full-text search)	Fast (pre-aggregated)	Medium (indexed by trace ID)
Cardinality risk	Low (free-form text)	High (label explosion)	Medium (controlled spans)
Example question	"Why did this request fail?"	"What is the error rate this hour?"	"Where did this request spend its time?"
When to add	Day 1	Day 1	When you have 3+ services

The cardinality problem

Cardinality is the number of unique values a label can have. It is the single most important concept for controlling observability costs. Every unique combination of labels creates a new time series in your metrics database.

# Low cardinality - 5 time series. Fine.
http_requests_total{method="GET"}
http_requests_total{method="POST"}
...

# High cardinality - millions of time series. Expensive.
http_requests_total{method="GET", userId="user-1"}
http_requests_total{method="GET", userId="user-2"}
http_requests_total{method="GET", userId="user-3"}
... (millions more)

The rule: never put unbounded values (user IDs, email addresses, request IDs) as metric labels. If you need per-user analysis, use logs or traces, they handle high cardinality naturally because they are not pre-aggregated.

Label type	Example values	Cardinality	Safe for metrics?
HTTP method	GET, POST, PUT	~5	Yes
Status code	200, 404, 500	~20	Yes
Service name	auth, payments	~10-50	Yes
Environment	prod, staging	~3	Yes
User ID	user-1, user-2, ...	Millions	No, use logs/traces
Request ID	uuid-abc, uuid-def	Unlimited	No, use traces
Email address	alice@..., bob@...	Millions	No, never

OpenTelemetryWhat is opentelemetry?A vendor-neutral open standard and SDK for collecting distributed traces, metrics, and logs from your applications.: the standard

Before OpenTelemetry (OTel), every vendor had its own SDKWhat is sdk?A pre-built library from a service provider that wraps their API into convenient functions you call in your code instead of writing raw HTTP requests., switching from Datadog to Grafana meant rewriting all instrumentation. OTel provides a single, vendor-neutral standard for collecting logs, metrics, and traces. Your app uses the OTel SDK, the OTel Collector receives that telemetryWhat is telemetry?Data your application automatically sends about its own performance and behavior - logs, metrics, and traces collected for monitoring. and exports it to whatever backend you choose.

Your App (OTel SDK) --> OTel Collector --> Jaeger (traces)
                                       --> Prometheus (metrics)
                                       --> Loki (logs)

Instrument once, export anywhere. Switch vendors by changing the Collector's configuration, not your application code.

import { NodeSDK } from '@opentelemetry/sdk-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import { OTLPMetricExporter } from '@opentelemetry/exporter-metrics-otlp-http';

const sdk = new NodeSDK({
  traceExporter: new OTLPTraceExporter({
    url: 'http://otel-collector:4318/v1/traces',
  }),
  metricReader: new PeriodicExportingMetricReader({
    exporter: new OTLPMetricExporter({
      url: 'http://otel-collector:4318/v1/metrics',
    }),
  }),
});

sdk.start();

The SDK auto-instruments common libraries (Express, PostgreSQL, Redis, HTTPWhat is http?The protocol browsers and servers use to exchange web pages, API data, and other resources, defining how requests and responses are formatted. clients), you get traces and metrics out of the box. Add custom spans and metrics only for business-specific operations.

Observability is not optional

In a distributed system, when a request touches six services and fails, you need traces to find which one broke, metrics to understand if it is a pattern, and logs to read the actual error message. Start with structured logs on day one. Add metrics for dashboards and alerts. Add traces when you have multiple services calling each other.

AI pitfall

AI will recommend setting up all three observability pillars (logs, metrics, traces) from day one, complete with Prometheus, Grafana, Jaeger, and an OpenTelemetry collector. What AI gets wrong: for a small team with 1-2 services, this is massive overkill. Start with structured JSON logs to stdout and a managed service like Datadog or Grafana Cloud. Add custom metrics and traces when you actually need them.

Good to know

Cardinality is the silent killer of observability costs. A metric label like user_id creates a separate time series for every user. With 100K users and 10 metrics, that is 1 million time series, which can cost thousands of dollars per month in Prometheus/Grafana Cloud. Only use low-cardinality labels (status codes, endpoint paths, service names) on metrics.

Edge case

OpenTelemetry auto-instrumentation is powerful but can generate enormous volumes of trace data. A single HTTP request in a Node.js app can produce 10+ spans (Express middleware, DNS lookup, TCP connect, TLS handshake, database query). Without sampling, trace storage costs can exceed your entire hosting budget. Always configure sampling from the start.

Done

Complete & Next