Integration & APIs - Metrics and Dashboards

Create a free account to save your progress

Earn XP, track streaks, and sync your dashboard across devices.

Lesson

AI pitfall

AI-generated metrics code almost always creates counters without labels. A single api_requests_total counter hides which specific partner is failing. Always label metrics with partner, endpoint, and status, the extra cardinality is worth it for integration debugging.

Logs tell you what happened on a single request. Metrics tell you what is happening across all requests right now. For integrations, metrics answer the questions that matter most: "Is the Stripe integration slow today?", "How many Twilio errors did we have this hour?", "Which partner has the worst availability?"

This lesson covers integration-specific metrics, not general application metrics, but the counters and histograms you need to monitor external APIWhat is api?A set of rules that lets one program talk to another, usually over the internet, by sending requests and getting responses. calls, per-partner health, and integration reliability.

Good to know

If you monitor only three things per partner, Rate, Errors, Duration (the RED method), you will catch 90% of integration issues before your users report them. Everything else is refinement on top of these three signals.

The RED method for integrations

The RED method gives you three numbers for every integration:

Rate: How many requests per second are you making to this partner?
Errors: What percentage of those requests are failing?
Duration: How long are those requests taking?

If you monitor only these three things per partner, you will catch 90% of integration issues before your users report them.

import { Registry, Counter, Histogram } from 'prom-client';

const register = new Registry();

// Rate + Errors: a single counter with status labels
const integrationRequests = new Counter({
  name: 'integration_requests_total',
  help: 'Total number of external API requests',
  labelNames: ['partner', 'endpoint', 'method', 'status'],
  registers: [register],
});

// Duration: histogram for response time distribution
const integrationDuration = new Histogram({
  name: 'integration_request_duration_seconds',
  help: 'External API request duration in seconds',
  labelNames: ['partner', 'endpoint', 'method'],
  // Buckets tuned for typical API latencies
  buckets: [0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10],
  registers: [register],
});

Instrumenting an integration call

Wrap every external APIWhat is api?A set of rules that lets one program talk to another, usually over the internet, by sending requests and getting responses. call with metric recording:

async function callExternalAPI(
  partner: string,
  endpoint: string,
  method: string,
  options: RequestInit
): Promise<Response> {
  const timer = integrationDuration.startTimer({
    partner,
    endpoint,
    method,
  });

  try {
    const response = await fetch(endpoint, { method, ...options });

    // Record the request with its status category
    const status = response.ok ? 'success' : `error_${response.status}`;
    integrationRequests.inc({ partner, endpoint, method, status });

    timer(); // Records duration
    return response;
  } catch (error) {
    // Network errors, timeouts, DNS failures
    integrationRequests.inc({
      partner,
      endpoint,
      method,
      status: 'network_error',
    });
    timer();
    throw error;
  }
}

Now every external call automatically tracks rate, errors, and duration. You do not need to remember to add metrics, the wrapper handles it.

Essential integration metrics

Beyond RED, there are integration-specific metrics that catch problems logs alone cannot:

Metric	Type	Labels	What it catches
`integration_requests_total`	Counter	partner, endpoint, status	Traffic volume, error rates per partner
`integration_request_duration_seconds`	Histogram	partner, endpoint	Latency spikes, degraded partner APIs
`integration_retries_total`	Counter	partner, attempt_number	Partners requiring excessive retries
`integration_circuit_breaker_state`	Gauge	partner, state	Partners in degraded/open circuit state
`webhook_received_total`	Counter	partner, event_type	Webhook traffic and event distribution
`webhook_processing_duration_seconds`	Histogram	partner, event_type	Slow webhook processing
`webhook_processing_errors_total`	Counter	partner, event_type, error_type	Failed webhook processing
`integration_auth_failures_total`	Counter	partner	Expired tokens, invalid credentials
`rate_limit_hits_total`	Counter	partner	Approaching or exceeding partner rate limits

Per-partner vs. global metrics

This is the key difference between integration observability and general application monitoring. A global error rate of 2% might look fine, but if 100% of your Stripe calls are failing and 0% of everything else is, that 2% global rate hides a complete payment outage.

// Bad: global metric hides partner-specific issues
const apiErrors = new Counter({
  name: 'api_errors_total',
  help: 'Total API errors',
  registers: [register],
});

// Good: per-partner metric reveals exactly who is failing
const integrationErrors = new Counter({
  name: 'integration_errors_total',
  help: 'Integration errors by partner',
  labelNames: ['partner', 'error_type'],
  registers: [register],
});

Always label by partner. The extra cardinality is worth it, integration debugging without partner-level granularity is guesswork.

Edge case

A global error rate of 2% might look healthy, but if 100% of your Stripe calls are failing and 0% of everything else is, that 2% hides a complete payment outage. Always track and alert on per-partner error rates, not just global aggregates.

Prometheus histogram buckets

Default histogram buckets are designed for general HTTPWhat is http?The protocol browsers and servers use to exchange web pages, API data, and other resources, defining how requests and responses are formatted. servers. Integration calls have different latencyWhat is latency?The time delay between sending a request and receiving the first byte of the response, usually measured in milliseconds. profiles, an internal microservice responds in 5ms, but a payment APIWhat is api?A set of rules that lets one program talk to another, usually over the internet, by sending requests and getting responses. might take 2 seconds. Choose buckets that match your SLO targets:

// For fast internal APIs (< 100ms expected)
const internalBuckets = [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5];

// For typical external APIs (100ms - 2s expected)
const externalBuckets = [0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10];

// For slow external APIs (payment, KYC verification)
const slowExternalBuckets = [0.5, 1, 2.5, 5, 10, 15, 30, 60];

If your SLO says "99% of Stripe calls must complete in under 2 seconds", you need buckets around that 2-second mark to calculate the actual percentile accurately. Buckets that jump from 1s to 5s will not give you useful data at the 2-second boundary.

Dashboard design for integration health

A good integration dashboard is not a wall of graphs, it is a decision-making tool. Design it to answer specific questions in a specific order.

The integration health dashboard layout

Row	Panel	PromQL	Purpose
1	Partner status overview	One panel per partner showing error rate	Instant: who is healthy, who is not
2	Request rate by partner	`rate(integration_requests_total[5m])` by partner	Traffic patterns, detect drops
3	Error rate by partner	`rate(integration_requests_total{status=~"error.*"}[5m]) / rate(integration_requests_total[5m])`	Which partner is failing
4	p50/p95/p99 latency	`histogram_quantile(0.99, rate(integration_request_duration_seconds_bucket[5m]))`	Is a partner getting slower
5	Retry rate	`rate(integration_retries_total[5m])` by partner	Early warning of degradation
6	Circuit breaker states	`integration_circuit_breaker_state`	Partners in fallback mode

PromQL examples for integration dashboards

# Error rate per partner (as percentage)
100 * (
  rate(integration_requests_total{status=~"error.*"}[5m])
  / rate(integration_requests_total[5m])
)

# p99 latency per partner
histogram_quantile(0.99,
  rate(integration_request_duration_seconds_bucket[5m])
)

# Requests per second to each partner
sum by (partner) (rate(integration_requests_total[5m]))

# Retry ratio (retries / total requests)
rate(integration_retries_total[5m])
  / rate(integration_requests_total[5m])

Avoiding high-cardinality explosions

Prometheus stores every unique combination of labels as a separate time series. Adding a userId label to your integration metrics creates one time series per user per partner per endpointWhat is endpoint?A specific URL path on a server that handles a particular type of request, like GET /api/users., millions of series that will crash your Prometheus server.

Safe labels	Dangerous labels
`partner` (tens of values)	`userId` (millions of values)
`endpoint` (tens of values)	`requestId` (unbounded)
`method` (GET, POST, etc.)	`correlationId` (unbounded)
`status` (success, error_4xx, error_5xx)	`statusCode` (specific codes like 401, 403, 404...)
`region` (us-east, eu-west)	`ip_address` (unbounded)

Group status codes into buckets (success, client_error, server_error, timeout) rather than tracking each individual code. Use logs for the per-request detail, metrics are for aggregates.

Exposing the metrics endpointWhat is endpoint?A specific URL path on a server that handles a particular type of request, like GET /api/users.

Wire up the Prometheus metrics endpoint so your Prometheus server can scrape it:

import express from 'express';
import { register } from './metrics';

const app = express();

// Prometheus scrape endpoint
app.get('/metrics', async (req, res) => {
  res.set('Content-Type', register.contentType);
  res.end(await register.metrics());
});

In production, protect this endpoint, it should not be publicly accessible. Use a separate internal port or restrict access by IP.

Quick reference

Concept	Tool	When to use
Counter	`prom-client Counter`	Things that only go up: requests, errors
Histogram	`prom-client Histogram`	Distributions: latency, payload sizes
Gauge	`prom-client Gauge`	Values that go up and down: queue depth, active connections
RED method	Rate, Errors, Duration	Default metrics for every integration
Labels	Prometheus labels	Group by partner, endpoint, status
Buckets	Histogram buckets	Match to your SLO thresholds

Done

Complete & Next