Logs tell you what happened. Metrics tell you how things are going. A single error log tells you one request failed. A metric tells you that the error rate jumped from 0.1% to 5% in the last ten minutes. Metrics are the backbone of alerting, dashboards, and capacity planning.
Metric types
Every metrics system supports three fundamental types. Understanding them is essential because choosing the wrong type leads to meaningless data.
Counter
A counter only goes up. It counts cumulative occurrences: total requests served, total errors, total bytes transferred. You never set a counter to a specific value, you only increment it.
import { Counter } from 'prom-client';
const httpRequests = new Counter({
name: 'http_requests_total',
help: 'Total number of HTTP requests',
labelNames: ['method', 'path', 'status'],
});
// In your request handler
app.use((req, res, next) => {
res.on('finish', () => {
httpRequests.inc({
method: req.method,
path: req.route?.path || 'unknown',
status: res.statusCode,
});
});
next();
});To get the request rate (requests per second), you use a query function like Prometheus's rate():
rate(http_requests_total[5m])This computes the per-second rate over the last 5 minutes. Never look at a counter's raw value, it is meaningless without rate() or increase().
Gauge
A gauge goes up and down. It measures a current value: active connections, queue depth, memory usage, temperatureWhat is temperature?A setting that controls how creative or predictable an AI's output is. Low temperature gives consistent answers; high temperature produces more varied responses..
import { Gauge } from 'prom-client';
const activeConnections = new Gauge({
name: 'db_active_connections',
help: 'Number of active database connections',
});
// Set to current value
activeConnections.set(pool.activeCount);
// Or increment/decrement
activeConnections.inc(); // connection opened
activeConnections.dec(); // connection closedGauges are useful for saturation metrics: "how full is this resource right now?"
Histogram
A histogram measures the distribution of values, typically request durations or response sizes. Instead of just knowing the average, you know the 50th percentile, 95th percentile, and 99th percentile.
import { Histogram } from 'prom-client';
const requestDuration = new Histogram({
name: 'http_request_duration_seconds',
help: 'Request duration in seconds',
labelNames: ['method', 'path'],
buckets: [0.01, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10],
});
app.use((req, res, next) => {
const end = requestDuration.startTimer({
method: req.method,
path: req.route?.path || 'unknown',
});
res.on('finish', () => end());
next();
});The buckets define the boundaries for counting. With the buckets above, you will know how many requests took less than 10ms, less than 50ms, less than 100ms, and so on. Choose buckets that match your SLO targets.
| Metric type | Direction | Use case | Example |
|---|---|---|---|
| Counter | Only up | Counting events | Total requests, total errors, bytes sent |
| Gauge | Up and down | Current state | Active connections, queue size, CPU % |
| Histogram | Distribution | Latency / size | Request duration, response size |
The RED method
RED stands for Rate, Errors, Duration. It was designed by Tom Wilkie for request-driven services (APIs, web servers, microservicesWhat is microservices?An architecture where an application is split into small, independently deployed services that communicate over the network, each owning its own data.). If you instrument nothing else, instrument these three.
| Signal | What it measures | Metric type | PromQL example |
|---|---|---|---|
| Rate | Requests per second | Counter + rate() | rate(http_requests_total[5m]) |
| Errors | Failed requests per second | Counter + rate() | rate(http_requests_total{status=~"5.."}[5m]) |
| Duration | Time per request | Histogram | histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) |
RED answers the three most important questions for any service: "How busy is it?" (Rate), "Is it broken?" (Errors), "Is it slow?" (Duration). These three metrics, on a single dashboard, give you an instant health checkWhat is health check?An API endpoint that verifies your application and its dependencies are working, so monitoring tools can alert you when something fails. for any service.
Building a RED dashboard
A practical RED dashboard has three panels per service:
- Request Rate: line chart of
rate(http_requests_total[5m]), split by status codeWhat is status code?A three-digit number in an HTTP response that tells the client what happened: 200 means success, 404 means not found, 500 means the server broke.. You see normal traffic patterns and spikes. - Error Rate: line chart of
rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) * 100. Percentage of requests failing. - LatencyWhat is latency?The time delay between sending a request and receiving the first byte of the response, usually measured in milliseconds.: line chart with p50, p95, p99 lines. The gap between p50 and p99 tells you how consistent your performance is.
The USE method
USE stands for Utilization, Saturation, Errors. It was designed by Brendan Gregg for infrastructure resources (CPU, memory, disk, network). While RED covers your application layer, USE covers the underlying resources.
| Signal | What it measures | Question it answers |
|---|---|---|
| Utilization | % of resource in use | "How full is it?" |
| Saturation | Queued work waiting | "Is there a backlog?" |
| Errors | Resource errors | "Is it failing?" |
Apply USE to every resource in your system:
| Resource | Utilization | Saturation | Errors |
|---|---|---|---|
| CPU | % busy time | Run queue length | Machine check exceptions |
| Memory | % used | Swap usage | OOM kills |
| Disk I/O | % busy time | I/O queue depth | Read/write errors |
| Network | Bandwidth % | Dropped packets | Interface errors |
| DB connections | Active / max pool | Waiting threads | Connection timeouts |
Custom business metrics
Technical metrics tell you if the system is healthy. Business metrics tell you if the product is working. Both matter, but teams often forget the second kind.
const ordersCreated = new Counter({
name: 'orders_created_total',
help: 'Total orders created',
labelNames: ['plan', 'country'],
});
const checkoutDuration = new Histogram({
name: 'checkout_duration_seconds',
help: 'Time from cart to order completion',
buckets: [5, 10, 30, 60, 120, 300],
});
const cartValue = new Histogram({
name: 'cart_value_dollars',
help: 'Cart value at checkout',
buckets: [10, 25, 50, 100, 250, 500, 1000],
});
const activeSubscriptions = new Gauge({
name: 'active_subscriptions',
help: 'Current active subscriptions',
labelNames: ['plan'],
});When orders-per-minute drops to zero at 2 PM on a Tuesday, that is a bigger signal than CPU at 90%. Business metrics catch problems that technical metrics miss, a silent deployment that broke the checkout flow, a third-party payment providerWhat is provider?A wrapper component that makes data available to all components nested inside it without passing props manually. outage, or a pricing change that killed conversion.
Prometheus and Grafana
Prometheus is a time-series database that scrapes metrics from your services via HTTPWhat is http?The protocol browsers and servers use to exchange web pages, API data, and other resources, defining how requests and responses are formatted.. Your application exposes a /metrics endpointWhat is endpoint?A specific URL path on a server that handles a particular type of request, like GET /api/users., and Prometheus pulls data every 15-60 seconds.
Grafana is a visualization platform that queries Prometheus (and other data sources) and renders dashboards. Together, they form the most widely used open-source observability stack.
The flow:
Your App (/metrics endpoint)
|
v
Prometheus (scrape every 15s, store time series)
|
v
Grafana (query PromQL, render dashboards)
|
v
Alertmanager (send alerts when thresholds are crossed)Prometheus uses pull-based collection: it reaches out to your services. This is the opposite of push-based systems (like StatsD) where your app sends metrics to a collector. The pull model means Prometheus controls the scrape interval, can detect when a target is down (scrape fails), and does not require your application to know where to send data.