System Design - Metrics and Dashboards

Create a free account to save your progress

Earn XP, track streaks, and sync your dashboard across devices.

Lesson

Logs tell you what happened. Metrics tell you how things are going. A single error log tells you one request failed. A metric tells you that the error rate jumped from 0.1% to 5% in the last ten minutes. Metrics are the backbone of alerting, dashboards, and capacity planning.

Metric types

Every metrics system supports three fundamental types. Understanding them is essential because choosing the wrong type leads to meaningless data.

Counter

A counter only goes up. It counts cumulative occurrences: total requests served, total errors, total bytes transferred. You never set a counter to a specific value, you only increment it.

import { Counter } from 'prom-client';

const httpRequests = new Counter({
  name: 'http_requests_total',
  help: 'Total number of HTTP requests',
  labelNames: ['method', 'path', 'status'],
});

// In your request handler
app.use((req, res, next) => {
  res.on('finish', () => {
    httpRequests.inc({
      method: req.method,
      path: req.route?.path || 'unknown',
      status: res.statusCode,
    });
  });
  next();
});

To get the request rate (requests per second), you use a query function like Prometheus's rate():

rate(http_requests_total[5m])

This computes the per-second rate over the last 5 minutes. Never look at a counter's raw value, it is meaningless without rate() or increase().

Gauge

A gauge goes up and down. It measures a current value: active connections, queue depth, memory usage, temperatureWhat is temperature?A setting that controls how creative or predictable an AI's output is. Low temperature gives consistent answers; high temperature produces more varied responses..

import { Gauge } from 'prom-client';

const activeConnections = new Gauge({
  name: 'db_active_connections',
  help: 'Number of active database connections',
});

// Set to current value
activeConnections.set(pool.activeCount);

// Or increment/decrement
activeConnections.inc();  // connection opened
activeConnections.dec();  // connection closed

Gauges are useful for saturation metrics: "how full is this resource right now?"

Histogram

A histogram measures the distribution of values, typically request durations or response sizes. Instead of just knowing the average, you know the 50th percentile, 95th percentile, and 99th percentile.

import { Histogram } from 'prom-client';

const requestDuration = new Histogram({
  name: 'http_request_duration_seconds',
  help: 'Request duration in seconds',
  labelNames: ['method', 'path'],
  buckets: [0.01, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10],
});

app.use((req, res, next) => {
  const end = requestDuration.startTimer({
    method: req.method,
    path: req.route?.path || 'unknown',
  });
  res.on('finish', () => end());
  next();
});

The buckets define the boundaries for counting. With the buckets above, you will know how many requests took less than 10ms, less than 50ms, less than 100ms, and so on. Choose buckets that match your SLO targets.

Metric type	Direction	Use case	Example
Counter	Only up	Counting events	Total requests, total errors, bytes sent
Gauge	Up and down	Current state	Active connections, queue size, CPU %
Histogram	Distribution	Latency / size	Request duration, response size

The RED method

RED stands for Rate, Errors, Duration. It was designed by Tom Wilkie for request-driven services (APIs, web servers, microservicesWhat is microservices?An architecture where an application is split into small, independently deployed services that communicate over the network, each owning its own data.). If you instrument nothing else, instrument these three.

Signal	What it measures	Metric type	PromQL example
Rate	Requests per second	Counter + rate()	`rate(http_requests_total[5m])`
Errors	Failed requests per second	Counter + rate()	`rate(http_requests_total{status=~"5.."}[5m])`
Duration	Time per request	Histogram	`histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))`

RED answers the three most important questions for any service: "How busy is it?" (Rate), "Is it broken?" (Errors), "Is it slow?" (Duration). These three metrics, on a single dashboard, give you an instant health checkWhat is health check?An API endpoint that verifies your application and its dependencies are working, so monitoring tools can alert you when something fails. for any service.

Building a RED dashboard

A practical RED dashboard has three panels per service:

Request Rate: line chart of rate(http_requests_total[5m]), split by status codeWhat is status code?A three-digit number in an HTTP response that tells the client what happened: 200 means success, 404 means not found, 500 means the server broke.. You see normal traffic patterns and spikes.
Error Rate: line chart of rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) * 100. Percentage of requests failing.
LatencyWhat is latency?The time delay between sending a request and receiving the first byte of the response, usually measured in milliseconds.: line chart with p50, p95, p99 lines. The gap between p50 and p99 tells you how consistent your performance is.

The USE method

USE stands for Utilization, Saturation, Errors. It was designed by Brendan Gregg for infrastructure resources (CPU, memory, disk, network). While RED covers your application layer, USE covers the underlying resources.

Signal	What it measures	Question it answers
Utilization	% of resource in use	"How full is it?"
Saturation	Queued work waiting	"Is there a backlog?"
Errors	Resource errors	"Is it failing?"

Apply USE to every resource in your system:

Resource	Utilization	Saturation	Errors
CPU	% busy time	Run queue length	Machine check exceptions
Memory	% used	Swap usage	OOM kills
Disk I/O	% busy time	I/O queue depth	Read/write errors
Network	Bandwidth %	Dropped packets	Interface errors
DB connections	Active / max pool	Waiting threads	Connection timeouts

Custom business metrics

Technical metrics tell you if the system is healthy. Business metrics tell you if the product is working. Both matter, but teams often forget the second kind.

const ordersCreated = new Counter({
  name: 'orders_created_total',
  help: 'Total orders created',
  labelNames: ['plan', 'country'],
});

const checkoutDuration = new Histogram({
  name: 'checkout_duration_seconds',
  help: 'Time from cart to order completion',
  buckets: [5, 10, 30, 60, 120, 300],
});

const cartValue = new Histogram({
  name: 'cart_value_dollars',
  help: 'Cart value at checkout',
  buckets: [10, 25, 50, 100, 250, 500, 1000],
});

const activeSubscriptions = new Gauge({
  name: 'active_subscriptions',
  help: 'Current active subscriptions',
  labelNames: ['plan'],
});

When orders-per-minute drops to zero at 2 PM on a Tuesday, that is a bigger signal than CPU at 90%. Business metrics catch problems that technical metrics miss, a silent deployment that broke the checkout flow, a third-party payment providerWhat is provider?A wrapper component that makes data available to all components nested inside it without passing props manually. outage, or a pricing change that killed conversion.

Prometheus and Grafana

Prometheus is a time-series database that scrapes metrics from your services via HTTPWhat is http?The protocol browsers and servers use to exchange web pages, API data, and other resources, defining how requests and responses are formatted.. Your application exposes a /metrics endpointWhat is endpoint?A specific URL path on a server that handles a particular type of request, like GET /api/users., and Prometheus pulls data every 15-60 seconds.

Grafana is a visualization platform that queries Prometheus (and other data sources) and renders dashboards. Together, they form the most widely used open-source observability stack.

The flow:

Your App (/metrics endpoint)
    |
    v
Prometheus (scrape every 15s, store time series)
    |
    v
Grafana (query PromQL, render dashboards)
    |
    v
Alertmanager (send alerts when thresholds are crossed)

Prometheus uses pull-based collection: it reaches out to your services. This is the opposite of push-based systems (like StatsD) where your app sends metrics to a collector. The pull model means Prometheus controls the scrape interval, can detect when a target is down (scrape fails), and does not require your application to know where to send data.

You do not need to self-host Prometheus and Grafana. Managed services like Grafana Cloud, Datadog, and New Relic accept Prometheus-format metrics and provide hosted dashboards. The concepts and PromQL queries are the same regardless of where the data lives.

AI pitfall

AI-generated Grafana dashboards tend to have 20+ panels that nobody looks at. What AI gets wrong: more panels does not mean more observability. The best dashboards have 4-6 panels that answer the question "is the system healthy?" at a glance. Start with the four golden signals (latency, traffic, errors, saturation) and add panels only when you have a specific question that existing panels cannot answer.

Good to know

Percentile metrics (p50, p95, p99) are far more useful than averages for understanding user experience. An average response time of 200ms might hide the fact that 1% of users experience 5-second response times. Always alert on p99 latency, not average latency.

Edge case

Prometheus's default 15-second scrape interval means you can miss short-lived spikes. If your service gets a 5-second burst of 500 errors and recovers, the error rate metric might show a small blip or nothing at all. For critical services, consider a shorter scrape interval (5 seconds) or use push-based metrics for immediate visibility.

Done

Complete & Next