Integration & APIs - Alerting and SLOs

Create a free account to save your progress

Earn XP, track streaks, and sync your dashboard across devices.

Lesson

AI pitfall

AI-generated alerting rules fire on every single error. One 503 from Stripe at 3 AM triggers a page. This is the fastest path to alert fatigue, your team stops reading alerts within a week. Alert on error rate over a sustained window (5+ minutes), not individual failures.

Alerting pushes information to you when something needs attention. Too many alerts and your team ignores them all. Too few and you miss real outages. The key is tying alerts to SLOs: concrete targets that define what "healthy" means for each integration.

Good to know

An SLO is an internal target you set for yourself. An SLA is a contractual guarantee with financial consequences. Your SLO should always be stricter than your SLA so you have margin.

Integration SLOs

Each integration gets its own SLO because each partner has different characteristics:

Partner	SLO metric	Target	Measurement window
Stripe (payments)	Success rate	99.9%	30-day rolling
Stripe (payments)	p99 latency	< 2 seconds	30-day rolling
Twilio (SMS)	Delivery rate	99.5%	30-day rolling
Twilio (SMS)	p99 latency	< 5 seconds	30-day rolling
Fulfillment API	Success rate	99.0%	30-day rolling
Fulfillment API	p99 latency	< 10 seconds	30-day rolling
Internal auth service	Success rate	99.99%	30-day rolling
Internal auth service	p99 latency	< 200ms	30-day rolling

To choose a target, start from the partner's published SLAWhat is sla?A formal commitment defining the minimum uptime or performance level a service promises to deliver, usually expressed as a percentage like 99.9%. and set your SLO slightly lower, your integration code, network, and configuration add failure points. Then factor in business impact: a payment outage blocks revenue, an email outage delays receipts.

Error budgets

An error budget is the flip side of an SLO. A 99.5% success rate SLO means you have a 0.5% error budget.

Error budget = 1 - SLO target

Example: 99.5% SLO over 30 days
- Error budget: 0.5%
- If you make 1,000,000 requests/month to this partner:
  Budget = 1,000,000 * 0.005 = 5,000 allowed failures
- If you make 10,000 requests/month:
  Budget = 10,000 * 0.005 = 50 allowed failures

Error budget burn rate

The burn rate tells you how fast you are consuming your error budget. A burn rate of 1x means you will exactly exhaust the budget by the end of the window. A burn rate of 10x means you will exhaust it in 1/10th of the time.

# PromQL: error budget burn rate for Stripe
(
  rate(integration_requests_total{partner="stripe", status=~"error.*"}[1h])
  / rate(integration_requests_total{partner="stripe"}[1h])
) / 0.001
# Dividing by 0.001 (the error budget for 99.9% SLO) gives the burn rate
# Burn rate > 1 means you're consuming budget faster than planned

Edge case

A 99.5% SLO means different things at different traffic levels. At 1,000,000 requests per month your budget is 5,000 failures. At 1,000 requests per month your budget is only 5, a single bad hour can exhaust it. Always consider traffic volume when setting SLO targets.

Alert fatigue

When your team receives 50 alerts a day and most are noise, they stop reading them. Then the real critical alert fires and nobody notices.

Common causes of alert fatigue in integrations

Cause	Example	Fix
Alerting on single errors	One 503 from Stripe triggers a page	Alert on error rate over a window, not single events
Same SLO for all partners	Flaky partner triggers constant alerts	Set realistic SLOs per partner based on their actual reliability
No deduplication	Same issue triggers 5 different alerts	Group related alerts, deduplicate by root cause
Missing `for` duration	Brief spike triggers alert then resolves	Require condition to persist for 2-5 minutes
Alerting on symptoms and causes	"High latency" AND "Stripe slow" fire together	Alert on user-facing symptoms, use causes for investigation
No severity levels	Everything is "critical"	Tier alerts: critical (page), warning (Slack), info (dashboard)

The rule: if an alert fires and the on-call person does not need to do anything, it should not be an alert. It should be a dashboard observation.

yaml

# Good: alerts on SLO burn rate (burn through budget too fast)
groups:
  - name: integration_slo_alerts
    rules:
      - alert: StripeHighBurnRate
        expr: |
          (
            rate(integration_requests_total{partner="stripe", status=~"error.*"}[1h])
            / rate(integration_requests_total{partner="stripe"}[1h])
          ) / 0.001 > 10
        for: 5m
        labels:
          severity: critical
          partner: stripe
        annotations:
          summary: "Stripe error budget burning 10x faster than sustainable"
          runbook: "https://wiki.internal/runbooks/stripe-high-burn-rate"

      - alert: FulfillmentLatencyDegraded
        expr: |
          histogram_quantile(0.99,
            rate(integration_request_duration_seconds_bucket{partner="fulfillment"}[5m])
          ) > 10
        for: 10m
        labels:
          severity: warning
          partner: fulfillment
        annotations:
          summary: "Fulfillment API p99 latency above 10s for 10 minutes"
          runbook: "https://wiki.internal/runbooks/fulfillment-slow"

Alert types and severity

Alert type	Severity	Response time	Example	Notification
Error budget burn > 10x	Critical	Immediately (page)	Stripe 50% error rate for 5 minutes	PagerDuty, phone call
Error budget burn > 2x	Warning	Within 1 hour	Stripe 1% error rate sustained	Slack channel
p99 latency above SLO	Warning	Within 1 hour	Partner p99 > 5s for 10 minutes	Slack channel
Circuit breaker opened	Critical	Immediately	Partner completely unreachable	PagerDuty
Auth token expiring	Warning	Within 24 hours	Partner OAuth token expires in 48h	Slack + ticket
Rate limit approaching	Info	Next business day	At 80% of partner rate limit	Dashboard only
Certificate expiring	Warning	Within 1 week	mTLS cert expires in 14 days	Slack + ticket

Runbooks: making alerts actionable

An alert without a runbookWhat is runbook?A documented, step-by-step guide for responding to a specific production incident. is just a notification. A runbook tells the on-call person exactly what to do.

markdown

# Stripe High Error Rate Runbook

## What this alert means
More than 5% of Stripe API calls are failing over a 5-minute window.

## Impact
Users cannot complete purchases. Revenue is directly affected.

## Diagnostic steps
1. Check Stripe status page: https://status.stripe.com
2. Check our dashboard: https://grafana.internal/d/stripe-health
3. Look for specific error codes in logs:
   Query: partner=stripe AND level=error | last 30 minutes
4. Check if the error is on our side (auth, payload) or Stripe's (5xx)

## Common causes and fixes
| Cause | Evidence | Fix |
|---|---|---|
| Stripe outage | status.stripe.com shows incident | Wait. Enable fallback if available |
| Expired API key | 401 errors in logs | Rotate key in secrets manager, restart |
| Invalid payload | 400 errors with validation messages | Check recent code deploys for breaking changes |
| Rate limited | 429 errors | Reduce traffic, implement backoff |

## Escalation
If not resolved in 15 minutes, escalate to #payments-team Slack channel.

Always include the runbook URL in the alert annotation, when someone gets paged at 3 AM, they should not have to search for documentation. The YAMLWhat is yaml?A human-readable text format used for configuration files, including Docker Compose and GitHub Actions workflows. example above shows the pattern: every alert rule has a runbook annotation linking directly to its runbook.

Done

Complete & Next