Production Engineering - Metrics & Dashboards

Create a free account to save your progress

Earn XP, track streaks, and sync your dashboard across devices.

Lesson

Logs tell you what happened. Metrics tell you how your system is behaving right now, and how that compares to yesterday, last week, or last deployment. Together they give you a complete picture of your application's health.

If error tracking is your smoke detector, metrics are your vital signs monitor. They give you a continuous, quantitative view of system health rather than alerting only when something breaks.

The four golden signals

Google's Site Reliability Engineering book identified four signals that matter most for any user-facing service. If you can only monitor four things, monitor these:

Signal	Definition	Example metric
Latency	How long requests take	p99 API response time
Traffic	How much demand exists	Requests per second
Errors	How often requests fail	HTTP 5xx error rate
Saturation	How full your resources are	CPU %, memory %, queue depth

Start by getting these four into a dashboard before adding anything else. A dashboard with 40 metrics you don't understand is less useful than four metrics you check every day.

Prometheus and Grafana

How Prometheus works

Prometheus is a pull-based metrics system. Instead of your app pushing metrics somewhere, Prometheus periodically scrapes an HTTPWhat is http?The protocol browsers and servers use to exchange web pages, API data, and other resources, defining how requests and responses are formatted. endpointWhat is endpoint?A specific URL path on a server that handles a particular type of request, like GET /api/users. (/metrics) that your app exposes. This means metrics collection continues even if your aggregation service has a hiccup.

import { Registry, Counter, Histogram, Gauge } from 'prom-client';

const register = new Registry();

// Count how many HTTP requests you've handled
const httpRequestsTotal = new Counter({
  name: 'http_requests_total',
  help: 'Total number of HTTP requests',
  labelNames: ['method', 'route', 'status_code'],
  registers: [register],
});

// Track how long requests take (distribution)
const httpRequestDuration = new Histogram({
  name: 'http_request_duration_seconds',
  help: 'HTTP request duration in seconds',
  labelNames: ['method', 'route'],
  buckets: [0.01, 0.05, 0.1, 0.5, 1, 2, 5],
  registers: [register],
});

// Expose the metrics endpoint
app.get('/metrics', async (req, res) => {
  res.set('Content-Type', register.contentType);
  res.end(await register.metrics());
});

Instrumenting your routes

Wrap your route handlers with metric recording middlewareWhat is middleware?A function that runs between receiving a request and sending a response. It can check authentication, log data, or modify the request before your main code sees it.:

app.use((req, res, next) => {
  const end = httpRequestDuration.startTimer({
    method: req.method,
    route: req.route?.path || req.path,
  });

  res.on('finish', () => {
    httpRequestsTotal.inc({
      method: req.method,
      route: req.route?.path || req.path,
      status_code: res.statusCode,
    });
    end(); // Records the duration
  });

  next();
});

Use a Gauge for values that go up and down (active connections, queue size). Use a Counter for things that only increase (requests processed, errors). Use a Histogram for distributions (response times, file sizes).

Business metrics

Beyond infrastructure

Technical metrics tell you if your servers are healthy. Business metrics tell you if your product is working. Both matter, and a good dashboard includes both:

const ordersCompleted = new Counter({
  name: 'orders_completed_total',
  help: 'Total number of completed orders',
  labelNames: ['payment_method'],
  registers: [register],
});

const activeUsers = new Gauge({
  name: 'active_users_current',
  help: 'Number of currently active users',
  registers: [register],
});

const checkoutDuration = new Histogram({
  name: 'checkout_duration_seconds',
  help: 'Time taken to complete checkout flow',
  buckets: [5, 10, 30, 60, 120, 300],
  registers: [register],
});

// Record when a checkout completes
async function processCheckout(order: Order) {
  const end = checkoutDuration.startTimer();
  await submitOrder(order);
  end();
  ordersCompleted.inc({ payment_method: order.paymentMethod });
}

When your p99 APIWhat is api?A set of rules that lets one program talk to another, usually over the internet, by sending requests and getting responses. latencyWhat is latency?The time delay between sending a request and receiving the first byte of the response, usually measured in milliseconds. spikes, check if orders completed also dropped. If they did, users are being impacted. If they didn't, the latency might be a background job, not a user-facing issue.

Building useful dashboards

Dashboard design principles

A dashboard nobody looks at is worthless. Design dashboards for the questions you actually ask during incidents:

Question	Metric to show	Visualization
Is the app up?	Uptime / error rate	Status indicator
How fast is it?	p50, p95, p99 latency	Time series graph
How busy is it?	Requests per second	Time series graph
Are users succeeding?	Error rate, conversion	Time series + percentage
Is it running out of resources?	CPU %, memory %, disk	Gauge or time series

Grafana query examples

Once Prometheus is scraping your metrics, Grafana can visualize them with PromQL:

# Request rate (per second, averaged over 5 minutes)
rate(http_requests_total[5m])

# Error rate as a percentage
rate(http_requests_total{status_code=~"5.."}[5m])
  / rate(http_requests_total[5m]) * 100

# 99th percentile response time
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))

Alerting on p99 latency is almost always better than alerting on average latency. Averages hide the worst-case experience that 1% of your users are having, and that 1% is often your most important users.

Alerting strategy

Setting meaningful thresholds

Alerts that fire too often train your team to ignore them. Alerts that never fire give you false confidence. Set thresholds based on real user impact:

yaml

# Prometheus alerting rule (alert.rules.yml)
groups:
  - name: api_alerts
    rules:
      - alert: HighErrorRate
        expr: rate(http_requests_total{status_code=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Error rate above 5% for 2 minutes"

      - alert: SlowAPI
        expr: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) > 2
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "p99 latency above 2 seconds for 5 minutes"

The for: 2m clause prevents alerts from firing on a single spike, the condition must be true continuously for 2 minutes before it pages anyone.

Quick reference

Tool	Role	When you need it
Prometheus	Metric collection + storage	Core of any metrics setup
Grafana	Visualization + alerting	Building dashboards
`prom-client`	Node.js Prometheus SDK	Instrumenting your app
Datadog	All-in-one (metrics + logs + traces)	Simpler setup, higher cost
CloudWatch	AWS-native metrics	Already on AWS

Done

Complete & Next