Production Engineering/
Lesson

Logs tell you what happened. Metrics tell you how your system is behaving right now, and how that compares to yesterday, last week, or last deployment. Together they give you a complete picture of your application's health.

If error tracking is your smoke detector, metrics are your vital signs monitor. They give you a continuous, quantitative view of system health rather than alerting only when something breaks.

The four golden signals

Google's Site Reliability Engineering book identified four signals that matter most for any user-facing service. If you can only monitor four things, monitor these:

SignalDefinitionExample metric
LatencyHow long requests takep99 API response time
TrafficHow much demand existsRequests per second
ErrorsHow often requests failHTTP 5xx error rate
SaturationHow full your resources areCPU %, memory %, queue depth

Start by getting these four into a dashboard before adding anything else. A dashboard with 40 metrics you don't understand is less useful than four metrics you check every day.

02

Prometheus and Grafana

How Prometheus works

Prometheus is a pull-based metrics system. Instead of your app pushing metrics somewhere, Prometheus periodically scrapes an HTTPWhat is http?The protocol browsers and servers use to exchange web pages, API data, and other resources, defining how requests and responses are formatted. endpointWhat is endpoint?A specific URL path on a server that handles a particular type of request, like GET /api/users. (/metrics) that your app exposes. This means metrics collection continues even if your aggregation service has a hiccup.

import { Registry, Counter, Histogram, Gauge } from 'prom-client';

const register = new Registry();

// Count how many HTTP requests you've handled
const httpRequestsTotal = new Counter({
  name: 'http_requests_total',
  help: 'Total number of HTTP requests',
  labelNames: ['method', 'route', 'status_code'],
  registers: [register],
});

// Track how long requests take (distribution)
const httpRequestDuration = new Histogram({
  name: 'http_request_duration_seconds',
  help: 'HTTP request duration in seconds',
  labelNames: ['method', 'route'],
  buckets: [0.01, 0.05, 0.1, 0.5, 1, 2, 5],
  registers: [register],
});

// Expose the metrics endpoint
app.get('/metrics', async (req, res) => {
  res.set('Content-Type', register.contentType);
  res.end(await register.metrics());
});

Instrumenting your routes

Wrap your route handlers with metric recording middlewareWhat is middleware?A function that runs between receiving a request and sending a response. It can check authentication, log data, or modify the request before your main code sees it.:

app.use((req, res, next) => {
  const end = httpRequestDuration.startTimer({
    method: req.method,
    route: req.route?.path || req.path,
  });

  res.on('finish', () => {
    httpRequestsTotal.inc({
      method: req.method,
      route: req.route?.path || req.path,
      status_code: res.statusCode,
    });
    end(); // Records the duration
  });

  next();
});
Use a Gauge for values that go up and down (active connections, queue size). Use a Counter for things that only increase (requests processed, errors). Use a Histogram for distributions (response times, file sizes).
03

Business metrics

Beyond infrastructure

Technical metrics tell you if your servers are healthy. Business metrics tell you if your product is working. Both matter, and a good dashboard includes both:

const ordersCompleted = new Counter({
  name: 'orders_completed_total',
  help: 'Total number of completed orders',
  labelNames: ['payment_method'],
  registers: [register],
});

const activeUsers = new Gauge({
  name: 'active_users_current',
  help: 'Number of currently active users',
  registers: [register],
});

const checkoutDuration = new Histogram({
  name: 'checkout_duration_seconds',
  help: 'Time taken to complete checkout flow',
  buckets: [5, 10, 30, 60, 120, 300],
  registers: [register],
});

// Record when a checkout completes
async function processCheckout(order: Order) {
  const end = checkoutDuration.startTimer();
  await submitOrder(order);
  end();
  ordersCompleted.inc({ payment_method: order.paymentMethod });
}

When your p99 APIWhat is api?A set of rules that lets one program talk to another, usually over the internet, by sending requests and getting responses. latencyWhat is latency?The time delay between sending a request and receiving the first byte of the response, usually measured in milliseconds. spikes, check if orders completed also dropped. If they did, users are being impacted. If they didn't, the latency might be a background job, not a user-facing issue.

04

Building useful dashboards

Dashboard design principles

A dashboard nobody looks at is worthless. Design dashboards for the questions you actually ask during incidents:

QuestionMetric to showVisualization
Is the app up?Uptime / error rateStatus indicator
How fast is it?p50, p95, p99 latencyTime series graph
How busy is it?Requests per secondTime series graph
Are users succeeding?Error rate, conversionTime series + percentage
Is it running out of resources?CPU %, memory %, diskGauge or time series

Grafana query examples

Once Prometheus is scraping your metrics, Grafana can visualize them with PromQL:

# Request rate (per second, averaged over 5 minutes)
rate(http_requests_total[5m])

# Error rate as a percentage
rate(http_requests_total{status_code=~"5.."}[5m])
  / rate(http_requests_total[5m]) * 100

# 99th percentile response time
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))
Alerting on p99 latency is almost always better than alerting on average latency. Averages hide the worst-case experience that 1% of your users are having, and that 1% is often your most important users.
05

Alerting strategy

Setting meaningful thresholds

Alerts that fire too often train your team to ignore them. Alerts that never fire give you false confidence. Set thresholds based on real user impact:

yaml
# Prometheus alerting rule (alert.rules.yml)
groups:
  - name: api_alerts
    rules:
      - alert: HighErrorRate
        expr: rate(http_requests_total{status_code=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Error rate above 5% for 2 minutes"

      - alert: SlowAPI
        expr: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) > 2
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "p99 latency above 2 seconds for 5 minutes"

The for: 2m clause prevents alerts from firing on a single spike, the condition must be true continuously for 2 minutes before it pages anyone.

06

Quick reference

ToolRoleWhen you need it
PrometheusMetric collection + storageCore of any metrics setup
GrafanaVisualization + alertingBuilding dashboards
prom-clientNode.js Prometheus SDKInstrumenting your app
DatadogAll-in-one (metrics + logs + traces)Simpler setup, higher cost
CloudWatchAWS-native metricsAlready on AWS