Logs tell you what happened. Metrics tell you how your system is behaving right now, and how that compares to yesterday, last week, or last deployment. Together they give you a complete picture of your application's health.
If error tracking is your smoke detector, metrics are your vital signs monitor. They give you a continuous, quantitative view of system health rather than alerting only when something breaks.
The four golden signals
Google's Site Reliability Engineering book identified four signals that matter most for any user-facing service. If you can only monitor four things, monitor these:
| Signal | Definition | Example metric |
|---|---|---|
| Latency | How long requests take | p99 API response time |
| Traffic | How much demand exists | Requests per second |
| Errors | How often requests fail | HTTP 5xx error rate |
| Saturation | How full your resources are | CPU %, memory %, queue depth |
Start by getting these four into a dashboard before adding anything else. A dashboard with 40 metrics you don't understand is less useful than four metrics you check every day.
Prometheus and Grafana
How Prometheus works
Prometheus is a pull-based metrics system. Instead of your app pushing metrics somewhere, Prometheus periodically scrapes an HTTPWhat is http?The protocol browsers and servers use to exchange web pages, API data, and other resources, defining how requests and responses are formatted. endpointWhat is endpoint?A specific URL path on a server that handles a particular type of request, like GET /api/users. (/metrics) that your app exposes. This means metrics collection continues even if your aggregation service has a hiccup.
import { Registry, Counter, Histogram, Gauge } from 'prom-client';
const register = new Registry();
// Count how many HTTP requests you've handled
const httpRequestsTotal = new Counter({
name: 'http_requests_total',
help: 'Total number of HTTP requests',
labelNames: ['method', 'route', 'status_code'],
registers: [register],
});
// Track how long requests take (distribution)
const httpRequestDuration = new Histogram({
name: 'http_request_duration_seconds',
help: 'HTTP request duration in seconds',
labelNames: ['method', 'route'],
buckets: [0.01, 0.05, 0.1, 0.5, 1, 2, 5],
registers: [register],
});
// Expose the metrics endpoint
app.get('/metrics', async (req, res) => {
res.set('Content-Type', register.contentType);
res.end(await register.metrics());
});Instrumenting your routes
Wrap your route handlers with metric recording middlewareWhat is middleware?A function that runs between receiving a request and sending a response. It can check authentication, log data, or modify the request before your main code sees it.:
app.use((req, res, next) => {
const end = httpRequestDuration.startTimer({
method: req.method,
route: req.route?.path || req.path,
});
res.on('finish', () => {
httpRequestsTotal.inc({
method: req.method,
route: req.route?.path || req.path,
status_code: res.statusCode,
});
end(); // Records the duration
});
next();
});Gauge for values that go up and down (active connections, queue size). Use a Counter for things that only increase (requests processed, errors). Use a Histogram for distributions (response times, file sizes).Business metrics
Beyond infrastructure
Technical metrics tell you if your servers are healthy. Business metrics tell you if your product is working. Both matter, and a good dashboard includes both:
const ordersCompleted = new Counter({
name: 'orders_completed_total',
help: 'Total number of completed orders',
labelNames: ['payment_method'],
registers: [register],
});
const activeUsers = new Gauge({
name: 'active_users_current',
help: 'Number of currently active users',
registers: [register],
});
const checkoutDuration = new Histogram({
name: 'checkout_duration_seconds',
help: 'Time taken to complete checkout flow',
buckets: [5, 10, 30, 60, 120, 300],
registers: [register],
});
// Record when a checkout completes
async function processCheckout(order: Order) {
const end = checkoutDuration.startTimer();
await submitOrder(order);
end();
ordersCompleted.inc({ payment_method: order.paymentMethod });
}When your p99 APIWhat is api?A set of rules that lets one program talk to another, usually over the internet, by sending requests and getting responses. latencyWhat is latency?The time delay between sending a request and receiving the first byte of the response, usually measured in milliseconds. spikes, check if orders completed also dropped. If they did, users are being impacted. If they didn't, the latency might be a background job, not a user-facing issue.
Building useful dashboards
Dashboard design principles
A dashboard nobody looks at is worthless. Design dashboards for the questions you actually ask during incidents:
| Question | Metric to show | Visualization |
|---|---|---|
| Is the app up? | Uptime / error rate | Status indicator |
| How fast is it? | p50, p95, p99 latency | Time series graph |
| How busy is it? | Requests per second | Time series graph |
| Are users succeeding? | Error rate, conversion | Time series + percentage |
| Is it running out of resources? | CPU %, memory %, disk | Gauge or time series |
Grafana query examples
Once Prometheus is scraping your metrics, Grafana can visualize them with PromQL:
# Request rate (per second, averaged over 5 minutes)
rate(http_requests_total[5m])
# Error rate as a percentage
rate(http_requests_total{status_code=~"5.."}[5m])
/ rate(http_requests_total[5m]) * 100
# 99th percentile response time
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))Alerting strategy
Setting meaningful thresholds
Alerts that fire too often train your team to ignore them. Alerts that never fire give you false confidence. Set thresholds based on real user impact:
# Prometheus alerting rule (alert.rules.yml)
groups:
- name: api_alerts
rules:
- alert: HighErrorRate
expr: rate(http_requests_total{status_code=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
for: 2m
labels:
severity: critical
annotations:
summary: "Error rate above 5% for 2 minutes"
- alert: SlowAPI
expr: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) > 2
for: 5m
labels:
severity: warning
annotations:
summary: "p99 latency above 2 seconds for 5 minutes"The for: 2m clause prevents alerts from firing on a single spike, the condition must be true continuously for 2 minutes before it pages anyone.
Quick reference
| Tool | Role | When you need it |
|---|---|---|
| Prometheus | Metric collection + storage | Core of any metrics setup |
| Grafana | Visualization + alerting | Building dashboards |
prom-client | Node.js Prometheus SDK | Instrumenting your app |
| Datadog | All-in-one (metrics + logs + traces) | Simpler setup, higher cost |
| CloudWatch | AWS-native metrics | Already on AWS |