Integration & APIs/
Lesson
AI pitfall
AI-generated metrics code almost always creates counters without labels. A single api_requests_total counter hides which specific partner is failing. Always label metrics with partner, endpoint, and status, the extra cardinality is worth it for integration debugging.

Logs tell you what happened on a single request. Metrics tell you what is happening across all requests right now. For integrations, metrics answer the questions that matter most: "Is the Stripe integration slow today?", "How many Twilio errors did we have this hour?", "Which partner has the worst availability?"

This lesson covers integration-specific metrics, not general application metrics, but the counters and histograms you need to monitor external APIWhat is api?A set of rules that lets one program talk to another, usually over the internet, by sending requests and getting responses. calls, per-partner health, and integration reliability.

Good to know
If you monitor only three things per partner, Rate, Errors, Duration (the RED method), you will catch 90% of integration issues before your users report them. Everything else is refinement on top of these three signals.

The RED method for integrations

The RED method gives you three numbers for every integration:

  • Rate: How many requests per second are you making to this partner?
  • Errors: What percentage of those requests are failing?
  • Duration: How long are those requests taking?

If you monitor only these three things per partner, you will catch 90% of integration issues before your users report them.

import { Registry, Counter, Histogram } from 'prom-client';

const register = new Registry();

// Rate + Errors: a single counter with status labels
const integrationRequests = new Counter({
  name: 'integration_requests_total',
  help: 'Total number of external API requests',
  labelNames: ['partner', 'endpoint', 'method', 'status'],
  registers: [register],
});

// Duration: histogram for response time distribution
const integrationDuration = new Histogram({
  name: 'integration_request_duration_seconds',
  help: 'External API request duration in seconds',
  labelNames: ['partner', 'endpoint', 'method'],
  // Buckets tuned for typical API latencies
  buckets: [0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10],
  registers: [register],
});

Instrumenting an integration call

Wrap every external APIWhat is api?A set of rules that lets one program talk to another, usually over the internet, by sending requests and getting responses. call with metric recording:

async function callExternalAPI(
  partner: string,
  endpoint: string,
  method: string,
  options: RequestInit
): Promise<Response> {
  const timer = integrationDuration.startTimer({
    partner,
    endpoint,
    method,
  });

  try {
    const response = await fetch(endpoint, { method, ...options });

    // Record the request with its status category
    const status = response.ok ? 'success' : `error_${response.status}`;
    integrationRequests.inc({ partner, endpoint, method, status });

    timer(); // Records duration
    return response;
  } catch (error) {
    // Network errors, timeouts, DNS failures
    integrationRequests.inc({
      partner,
      endpoint,
      method,
      status: 'network_error',
    });
    timer();
    throw error;
  }
}

Now every external call automatically tracks rate, errors, and duration. You do not need to remember to add metrics, the wrapper handles it.

02

Essential integration metrics

Beyond RED, there are integration-specific metrics that catch problems logs alone cannot:

MetricTypeLabelsWhat it catches
integration_requests_totalCounterpartner, endpoint, statusTraffic volume, error rates per partner
integration_request_duration_secondsHistogrampartner, endpointLatency spikes, degraded partner APIs
integration_retries_totalCounterpartner, attempt_numberPartners requiring excessive retries
integration_circuit_breaker_stateGaugepartner, statePartners in degraded/open circuit state
webhook_received_totalCounterpartner, event_typeWebhook traffic and event distribution
webhook_processing_duration_secondsHistogrampartner, event_typeSlow webhook processing
webhook_processing_errors_totalCounterpartner, event_type, error_typeFailed webhook processing
integration_auth_failures_totalCounterpartnerExpired tokens, invalid credentials
rate_limit_hits_totalCounterpartnerApproaching or exceeding partner rate limits

Per-partner vs. global metrics

This is the key difference between integration observability and general application monitoring. A global error rate of 2% might look fine, but if 100% of your Stripe calls are failing and 0% of everything else is, that 2% global rate hides a complete payment outage.

// Bad: global metric hides partner-specific issues
const apiErrors = new Counter({
  name: 'api_errors_total',
  help: 'Total API errors',
  registers: [register],
});

// Good: per-partner metric reveals exactly who is failing
const integrationErrors = new Counter({
  name: 'integration_errors_total',
  help: 'Integration errors by partner',
  labelNames: ['partner', 'error_type'],
  registers: [register],
});

Always label by partner. The extra cardinality is worth it, integration debugging without partner-level granularity is guesswork.

Edge case
A global error rate of 2% might look healthy, but if 100% of your Stripe calls are failing and 0% of everything else is, that 2% hides a complete payment outage. Always track and alert on per-partner error rates, not just global aggregates.
03

Prometheus histogram buckets

Default histogram buckets are designed for general HTTPWhat is http?The protocol browsers and servers use to exchange web pages, API data, and other resources, defining how requests and responses are formatted. servers. Integration calls have different latencyWhat is latency?The time delay between sending a request and receiving the first byte of the response, usually measured in milliseconds. profiles, an internal microservice responds in 5ms, but a payment APIWhat is api?A set of rules that lets one program talk to another, usually over the internet, by sending requests and getting responses. might take 2 seconds. Choose buckets that match your SLO targets:

// For fast internal APIs (< 100ms expected)
const internalBuckets = [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5];

// For typical external APIs (100ms - 2s expected)
const externalBuckets = [0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10];

// For slow external APIs (payment, KYC verification)
const slowExternalBuckets = [0.5, 1, 2.5, 5, 10, 15, 30, 60];

If your SLO says "99% of Stripe calls must complete in under 2 seconds", you need buckets around that 2-second mark to calculate the actual percentile accurately. Buckets that jump from 1s to 5s will not give you useful data at the 2-second boundary.

04

Dashboard design for integration health

A good integration dashboard is not a wall of graphs, it is a decision-making tool. Design it to answer specific questions in a specific order.

The integration health dashboard layout

RowPanelPromQLPurpose
1Partner status overviewOne panel per partner showing error rateInstant: who is healthy, who is not
2Request rate by partnerrate(integration_requests_total[5m]) by partnerTraffic patterns, detect drops
3Error rate by partnerrate(integration_requests_total{status=~"error.*"}[5m]) / rate(integration_requests_total[5m])Which partner is failing
4p50/p95/p99 latencyhistogram_quantile(0.99, rate(integration_request_duration_seconds_bucket[5m]))Is a partner getting slower
5Retry raterate(integration_retries_total[5m]) by partnerEarly warning of degradation
6Circuit breaker statesintegration_circuit_breaker_statePartners in fallback mode

PromQL examples for integration dashboards

# Error rate per partner (as percentage)
100 * (
  rate(integration_requests_total{status=~"error.*"}[5m])
  / rate(integration_requests_total[5m])
)

# p99 latency per partner
histogram_quantile(0.99,
  rate(integration_request_duration_seconds_bucket[5m])
)

# Requests per second to each partner
sum by (partner) (rate(integration_requests_total[5m]))

# Retry ratio (retries / total requests)
rate(integration_retries_total[5m])
  / rate(integration_requests_total[5m])

Avoiding high-cardinality explosions

Prometheus stores every unique combination of labels as a separate time series. Adding a userId label to your integration metrics creates one time series per user per partner per endpointWhat is endpoint?A specific URL path on a server that handles a particular type of request, like GET /api/users., millions of series that will crash your Prometheus server.

Safe labelsDangerous labels
partner (tens of values)userId (millions of values)
endpoint (tens of values)requestId (unbounded)
method (GET, POST, etc.)correlationId (unbounded)
status (success, error_4xx, error_5xx)statusCode (specific codes like 401, 403, 404...)
region (us-east, eu-west)ip_address (unbounded)

Group status codes into buckets (success, client_error, server_error, timeout) rather than tracking each individual code. Use logs for the per-request detail, metrics are for aggregates.

05

Exposing the metrics endpointWhat is endpoint?A specific URL path on a server that handles a particular type of request, like GET /api/users.

Wire up the Prometheus metrics endpoint so your Prometheus server can scrape it:

import express from 'express';
import { register } from './metrics';

const app = express();

// Prometheus scrape endpoint
app.get('/metrics', async (req, res) => {
  res.set('Content-Type', register.contentType);
  res.end(await register.metrics());
});

In production, protect this endpoint, it should not be publicly accessible. Use a separate internal port or restrict access by IP.

06

Quick reference

ConceptToolWhen to use
Counterprom-client CounterThings that only go up: requests, errors
Histogramprom-client HistogramDistributions: latency, payload sizes
Gaugeprom-client GaugeValues that go up and down: queue depth, active connections
RED methodRate, Errors, DurationDefault metrics for every integration
LabelsPrometheus labelsGroup by partner, endpoint, status
BucketsHistogram bucketsMatch to your SLO thresholds