Monitoring is a predefined set of checks: is the server up? Is CPU above 80%? You decide in advance what to watch. Observability is different, it is the ability to ask arbitrary questions about your system's behavior after the fact, without having anticipated those questions in advance. Monitoring tells you WHAT is broken; observability helps you understand WHY.
The three pillars explained
Observability rests on three types of telemetryWhat is telemetry?Data your application automatically sends about its own performance and behavior - logs, metrics, and traces collected for monitoring. data. Each answers different questions, and they work best together.
Logs
Logs are discrete events with a timestamp and a message. They tell you what happened at a specific moment.
{
"timestamp": "2025-03-15T14:32:01.445Z",
"level": "error",
"service": "auth-service",
"message": "Authentication failed",
"userId": "user-42",
"reason": "invalid_password",
"ip": "203.0.113.42"
}Raw logs become useless at scale, you need structure, aggregation, and search to make them actionable.
Metrics
Metrics are numeric values measured over time. They track "request count per second," "p99 latencyWhat is latency?The time delay between sending a request and receiving the first byte of the response, usually measured in milliseconds.," or "active database connections."
http_requests_total{method="GET", path="/api/users", status="200"} 14523
http_request_duration_seconds{method="GET", path="/api/users", quantile="0.99"} 0.45Metrics are cheap to store and ideal for dashboards and alerts. They answer "how many?" and "how fast?" but not "why?"
Traces
Traces follow a single request as it travels across multiple services. A trace for "user loads their dashboard" might show: API gatewayWhat is api gateway?A single entry point that sits in front of multiple backend services, routing requests to the right one and handling shared concerns like authentication and rate limiting. (2ms) -> auth service (15ms) -> user service (8ms) -> database query (120ms) -> recommendation service (350ms). Now you can see the recommendation service is the bottleneck.
Each trace contains spans, one spanWhat is span?One unit of work within a distributed trace, with a start time, duration, and optional attributes describing the operation. per operation. Spans are nested, showing the parent-child relationship between calls.
Comparing the three pillars
| Aspect | Logs | Metrics | Traces |
|---|---|---|---|
| Data type | Text events | Numeric time series | Request flow graphs |
| Best for | Debugging specific errors | Alerting and trends | Finding bottlenecks across services |
| Storage cost | High (verbose text) | Low (compact numbers) | Medium (structured spans) |
| Query speed | Slow (full-text search) | Fast (pre-aggregated) | Medium (indexed by trace ID) |
| Cardinality risk | Low (free-form text) | High (label explosion) | Medium (controlled spans) |
| Example question | "Why did this request fail?" | "What is the error rate this hour?" | "Where did this request spend its time?" |
| When to add | Day 1 | Day 1 | When you have 3+ services |
The cardinality problem
Cardinality is the number of unique values a label can have. It is the single most important concept for controlling observability costs. Every unique combination of labels creates a new time series in your metrics database.
# Low cardinality - 5 time series. Fine.
http_requests_total{method="GET"}
http_requests_total{method="POST"}
...
# High cardinality - millions of time series. Expensive.
http_requests_total{method="GET", userId="user-1"}
http_requests_total{method="GET", userId="user-2"}
http_requests_total{method="GET", userId="user-3"}
... (millions more)The rule: never put unbounded values (user IDs, email addresses, request IDs) as metric labels. If you need per-user analysis, use logs or traces, they handle high cardinality naturally because they are not pre-aggregated.
| Label type | Example values | Cardinality | Safe for metrics? |
|---|---|---|---|
| HTTP method | GET, POST, PUT | ~5 | Yes |
| Status code | 200, 404, 500 | ~20 | Yes |
| Service name | auth, payments | ~10-50 | Yes |
| Environment | prod, staging | ~3 | Yes |
| User ID | user-1, user-2, ... | Millions | No, use logs/traces |
| Request ID | uuid-abc, uuid-def | Unlimited | No, use traces |
| Email address | alice@..., bob@... | Millions | No, never |
OpenTelemetryWhat is opentelemetry?A vendor-neutral open standard and SDK for collecting distributed traces, metrics, and logs from your applications.: the standard
Before OpenTelemetry (OTel), every vendor had its own SDKWhat is sdk?A pre-built library from a service provider that wraps their API into convenient functions you call in your code instead of writing raw HTTP requests., switching from Datadog to Grafana meant rewriting all instrumentation. OTel provides a single, vendor-neutral standard for collecting logs, metrics, and traces. Your app uses the OTel SDK, the OTel Collector receives that telemetryWhat is telemetry?Data your application automatically sends about its own performance and behavior - logs, metrics, and traces collected for monitoring. and exports it to whatever backend you choose.
Your App (OTel SDK) --> OTel Collector --> Jaeger (traces)
--> Prometheus (metrics)
--> Loki (logs)Instrument once, export anywhere. Switch vendors by changing the Collector's configuration, not your application code.
import { NodeSDK } from '@opentelemetry/sdk-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import { OTLPMetricExporter } from '@opentelemetry/exporter-metrics-otlp-http';
const sdk = new NodeSDK({
traceExporter: new OTLPTraceExporter({
url: 'http://otel-collector:4318/v1/traces',
}),
metricReader: new PeriodicExportingMetricReader({
exporter: new OTLPMetricExporter({
url: 'http://otel-collector:4318/v1/metrics',
}),
}),
});
sdk.start();The SDK auto-instruments common libraries (Express, PostgreSQL, Redis, HTTPWhat is http?The protocol browsers and servers use to exchange web pages, API data, and other resources, defining how requests and responses are formatted. clients), you get traces and metrics out of the box. Add custom spans and metrics only for business-specific operations.
Observability is not optional
In a distributed system, when a request touches six services and fails, you need traces to find which one broke, metrics to understand if it is a pattern, and logs to read the actual error message. Start with structured logs on day one. Add metrics for dashboards and alerts. Add traces when you have multiple services calling each other.
user_id creates a separate time series for every user. With 100K users and 10 metrics, that is 1 million time series, which can cost thousands of dollars per month in Prometheus/Grafana Cloud. Only use low-cardinality labels (status codes, endpoint paths, service names) on metrics.