api_requests_total counter hides which specific partner is failing. Always label metrics with partner, endpoint, and status, the extra cardinality is worth it for integration debugging.Logs tell you what happened on a single request. Metrics tell you what is happening across all requests right now. For integrations, metrics answer the questions that matter most: "Is the Stripe integration slow today?", "How many Twilio errors did we have this hour?", "Which partner has the worst availability?"
This lesson covers integration-specific metrics, not general application metrics, but the counters and histograms you need to monitor external APIWhat is api?A set of rules that lets one program talk to another, usually over the internet, by sending requests and getting responses. calls, per-partner health, and integration reliability.
The RED method for integrations
The RED method gives you three numbers for every integration:
- Rate: How many requests per second are you making to this partner?
- Errors: What percentage of those requests are failing?
- Duration: How long are those requests taking?
If you monitor only these three things per partner, you will catch 90% of integration issues before your users report them.
import { Registry, Counter, Histogram } from 'prom-client';
const register = new Registry();
// Rate + Errors: a single counter with status labels
const integrationRequests = new Counter({
name: 'integration_requests_total',
help: 'Total number of external API requests',
labelNames: ['partner', 'endpoint', 'method', 'status'],
registers: [register],
});
// Duration: histogram for response time distribution
const integrationDuration = new Histogram({
name: 'integration_request_duration_seconds',
help: 'External API request duration in seconds',
labelNames: ['partner', 'endpoint', 'method'],
// Buckets tuned for typical API latencies
buckets: [0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10],
registers: [register],
});Instrumenting an integration call
Wrap every external APIWhat is api?A set of rules that lets one program talk to another, usually over the internet, by sending requests and getting responses. call with metric recording:
async function callExternalAPI(
partner: string,
endpoint: string,
method: string,
options: RequestInit
): Promise<Response> {
const timer = integrationDuration.startTimer({
partner,
endpoint,
method,
});
try {
const response = await fetch(endpoint, { method, ...options });
// Record the request with its status category
const status = response.ok ? 'success' : `error_${response.status}`;
integrationRequests.inc({ partner, endpoint, method, status });
timer(); // Records duration
return response;
} catch (error) {
// Network errors, timeouts, DNS failures
integrationRequests.inc({
partner,
endpoint,
method,
status: 'network_error',
});
timer();
throw error;
}
}Now every external call automatically tracks rate, errors, and duration. You do not need to remember to add metrics, the wrapper handles it.
Essential integration metrics
Beyond RED, there are integration-specific metrics that catch problems logs alone cannot:
| Metric | Type | Labels | What it catches |
|---|---|---|---|
integration_requests_total | Counter | partner, endpoint, status | Traffic volume, error rates per partner |
integration_request_duration_seconds | Histogram | partner, endpoint | Latency spikes, degraded partner APIs |
integration_retries_total | Counter | partner, attempt_number | Partners requiring excessive retries |
integration_circuit_breaker_state | Gauge | partner, state | Partners in degraded/open circuit state |
webhook_received_total | Counter | partner, event_type | Webhook traffic and event distribution |
webhook_processing_duration_seconds | Histogram | partner, event_type | Slow webhook processing |
webhook_processing_errors_total | Counter | partner, event_type, error_type | Failed webhook processing |
integration_auth_failures_total | Counter | partner | Expired tokens, invalid credentials |
rate_limit_hits_total | Counter | partner | Approaching or exceeding partner rate limits |
Per-partner vs. global metrics
This is the key difference between integration observability and general application monitoring. A global error rate of 2% might look fine, but if 100% of your Stripe calls are failing and 0% of everything else is, that 2% global rate hides a complete payment outage.
// Bad: global metric hides partner-specific issues
const apiErrors = new Counter({
name: 'api_errors_total',
help: 'Total API errors',
registers: [register],
});
// Good: per-partner metric reveals exactly who is failing
const integrationErrors = new Counter({
name: 'integration_errors_total',
help: 'Integration errors by partner',
labelNames: ['partner', 'error_type'],
registers: [register],
});Always label by partner. The extra cardinality is worth it, integration debugging without partner-level granularity is guesswork.
Prometheus histogram buckets
Default histogram buckets are designed for general HTTPWhat is http?The protocol browsers and servers use to exchange web pages, API data, and other resources, defining how requests and responses are formatted. servers. Integration calls have different latencyWhat is latency?The time delay between sending a request and receiving the first byte of the response, usually measured in milliseconds. profiles, an internal microservice responds in 5ms, but a payment APIWhat is api?A set of rules that lets one program talk to another, usually over the internet, by sending requests and getting responses. might take 2 seconds. Choose buckets that match your SLO targets:
// For fast internal APIs (< 100ms expected)
const internalBuckets = [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5];
// For typical external APIs (100ms - 2s expected)
const externalBuckets = [0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10];
// For slow external APIs (payment, KYC verification)
const slowExternalBuckets = [0.5, 1, 2.5, 5, 10, 15, 30, 60];If your SLO says "99% of Stripe calls must complete in under 2 seconds", you need buckets around that 2-second mark to calculate the actual percentile accurately. Buckets that jump from 1s to 5s will not give you useful data at the 2-second boundary.
Dashboard design for integration health
A good integration dashboard is not a wall of graphs, it is a decision-making tool. Design it to answer specific questions in a specific order.
The integration health dashboard layout
| Row | Panel | PromQL | Purpose |
|---|---|---|---|
| 1 | Partner status overview | One panel per partner showing error rate | Instant: who is healthy, who is not |
| 2 | Request rate by partner | rate(integration_requests_total[5m]) by partner | Traffic patterns, detect drops |
| 3 | Error rate by partner | rate(integration_requests_total{status=~"error.*"}[5m]) / rate(integration_requests_total[5m]) | Which partner is failing |
| 4 | p50/p95/p99 latency | histogram_quantile(0.99, rate(integration_request_duration_seconds_bucket[5m])) | Is a partner getting slower |
| 5 | Retry rate | rate(integration_retries_total[5m]) by partner | Early warning of degradation |
| 6 | Circuit breaker states | integration_circuit_breaker_state | Partners in fallback mode |
PromQL examples for integration dashboards
# Error rate per partner (as percentage)
100 * (
rate(integration_requests_total{status=~"error.*"}[5m])
/ rate(integration_requests_total[5m])
)
# p99 latency per partner
histogram_quantile(0.99,
rate(integration_request_duration_seconds_bucket[5m])
)
# Requests per second to each partner
sum by (partner) (rate(integration_requests_total[5m]))
# Retry ratio (retries / total requests)
rate(integration_retries_total[5m])
/ rate(integration_requests_total[5m])Avoiding high-cardinality explosions
Prometheus stores every unique combination of labels as a separate time series. Adding a userId label to your integration metrics creates one time series per user per partner per endpointWhat is endpoint?A specific URL path on a server that handles a particular type of request, like GET /api/users., millions of series that will crash your Prometheus server.
| Safe labels | Dangerous labels |
|---|---|
partner (tens of values) | userId (millions of values) |
endpoint (tens of values) | requestId (unbounded) |
method (GET, POST, etc.) | correlationId (unbounded) |
status (success, error_4xx, error_5xx) | statusCode (specific codes like 401, 403, 404...) |
region (us-east, eu-west) | ip_address (unbounded) |
Group status codes into buckets (success, client_error, server_error, timeout) rather than tracking each individual code. Use logs for the per-request detail, metrics are for aggregates.
Exposing the metrics endpointWhat is endpoint?A specific URL path on a server that handles a particular type of request, like GET /api/users.
Wire up the Prometheus metrics endpoint so your Prometheus server can scrape it:
import express from 'express';
import { register } from './metrics';
const app = express();
// Prometheus scrape endpoint
app.get('/metrics', async (req, res) => {
res.set('Content-Type', register.contentType);
res.end(await register.metrics());
});In production, protect this endpoint, it should not be publicly accessible. Use a separate internal port or restrict access by IP.
Quick reference
| Concept | Tool | When to use |
|---|---|---|
| Counter | prom-client Counter | Things that only go up: requests, errors |
| Histogram | prom-client Histogram | Distributions: latency, payload sizes |
| Gauge | prom-client Gauge | Values that go up and down: queue depth, active connections |
| RED method | Rate, Errors, Duration | Default metrics for every integration |
| Labels | Prometheus labels | Group by partner, endpoint, status |
| Buckets | Histogram buckets | Match to your SLO thresholds |