Integration & APIs/
Lesson
AI pitfall
AI-generated alerting rules fire on every single error. One 503 from Stripe at 3 AM triggers a page. This is the fastest path to alert fatigue, your team stops reading alerts within a week. Alert on error rate over a sustained window (5+ minutes), not individual failures.

Alerting pushes information to you when something needs attention. Too many alerts and your team ignores them all. Too few and you miss real outages. The key is tying alerts to SLOs: concrete targets that define what "healthy" means for each integration.

Good to know
An SLO is an internal target you set for yourself. An SLA is a contractual guarantee with financial consequences. Your SLO should always be stricter than your SLA so you have margin.

Integration SLOs

Each integration gets its own SLO because each partner has different characteristics:

PartnerSLO metricTargetMeasurement window
Stripe (payments)Success rate99.9%30-day rolling
Stripe (payments)p99 latency< 2 seconds30-day rolling
Twilio (SMS)Delivery rate99.5%30-day rolling
Twilio (SMS)p99 latency< 5 seconds30-day rolling
Fulfillment APISuccess rate99.0%30-day rolling
Fulfillment APIp99 latency< 10 seconds30-day rolling
Internal auth serviceSuccess rate99.99%30-day rolling
Internal auth servicep99 latency< 200ms30-day rolling

To choose a target, start from the partner's published SLAWhat is sla?A formal commitment defining the minimum uptime or performance level a service promises to deliver, usually expressed as a percentage like 99.9%. and set your SLO slightly lower, your integration code, network, and configuration add failure points. Then factor in business impact: a payment outage blocks revenue, an email outage delays receipts.

02

Error budgets

An error budget is the flip side of an SLO. A 99.5% success rate SLO means you have a 0.5% error budget.

Error budget = 1 - SLO target

Example: 99.5% SLO over 30 days
- Error budget: 0.5%
- If you make 1,000,000 requests/month to this partner:
  Budget = 1,000,000 * 0.005 = 5,000 allowed failures
- If you make 10,000 requests/month:
  Budget = 10,000 * 0.005 = 50 allowed failures

Error budget burn rate

The burn rate tells you how fast you are consuming your error budget. A burn rate of 1x means you will exactly exhaust the budget by the end of the window. A burn rate of 10x means you will exhaust it in 1/10th of the time.

# PromQL: error budget burn rate for Stripe
(
  rate(integration_requests_total{partner="stripe", status=~"error.*"}[1h])
  / rate(integration_requests_total{partner="stripe"}[1h])
) / 0.001
# Dividing by 0.001 (the error budget for 99.9% SLO) gives the burn rate
# Burn rate > 1 means you're consuming budget faster than planned
Edge case
A 99.5% SLO means different things at different traffic levels. At 1,000,000 requests per month your budget is 5,000 failures. At 1,000 requests per month your budget is only 5, a single bad hour can exhaust it. Always consider traffic volume when setting SLO targets.
03

Alert fatigue

When your team receives 50 alerts a day and most are noise, they stop reading them. Then the real critical alert fires and nobody notices.

Common causes of alert fatigue in integrations

CauseExampleFix
Alerting on single errorsOne 503 from Stripe triggers a pageAlert on error rate over a window, not single events
Same SLO for all partnersFlaky partner triggers constant alertsSet realistic SLOs per partner based on their actual reliability
No deduplicationSame issue triggers 5 different alertsGroup related alerts, deduplicate by root cause
Missing for durationBrief spike triggers alert then resolvesRequire condition to persist for 2-5 minutes
Alerting on symptoms and causes"High latency" AND "Stripe slow" fire togetherAlert on user-facing symptoms, use causes for investigation
No severity levelsEverything is "critical"Tier alerts: critical (page), warning (Slack), info (dashboard)

The rule: if an alert fires and the on-call person does not need to do anything, it should not be an alert. It should be a dashboard observation.

yaml
# Good: alerts on SLO burn rate (burn through budget too fast)
groups:
  - name: integration_slo_alerts
    rules:
      - alert: StripeHighBurnRate
        expr: |
          (
            rate(integration_requests_total{partner="stripe", status=~"error.*"}[1h])
            / rate(integration_requests_total{partner="stripe"}[1h])
          ) / 0.001 > 10
        for: 5m
        labels:
          severity: critical
          partner: stripe
        annotations:
          summary: "Stripe error budget burning 10x faster than sustainable"
          runbook: "https://wiki.internal/runbooks/stripe-high-burn-rate"

      - alert: FulfillmentLatencyDegraded
        expr: |
          histogram_quantile(0.99,
            rate(integration_request_duration_seconds_bucket{partner="fulfillment"}[5m])
          ) > 10
        for: 10m
        labels:
          severity: warning
          partner: fulfillment
        annotations:
          summary: "Fulfillment API p99 latency above 10s for 10 minutes"
          runbook: "https://wiki.internal/runbooks/fulfillment-slow"
04

Alert types and severity

Alert typeSeverityResponse timeExampleNotification
Error budget burn > 10xCriticalImmediately (page)Stripe 50% error rate for 5 minutesPagerDuty, phone call
Error budget burn > 2xWarningWithin 1 hourStripe 1% error rate sustainedSlack channel
p99 latency above SLOWarningWithin 1 hourPartner p99 > 5s for 10 minutesSlack channel
Circuit breaker openedCriticalImmediatelyPartner completely unreachablePagerDuty
Auth token expiringWarningWithin 24 hoursPartner OAuth token expires in 48hSlack + ticket
Rate limit approachingInfoNext business dayAt 80% of partner rate limitDashboard only
Certificate expiringWarningWithin 1 weekmTLS cert expires in 14 daysSlack + ticket
05

Runbooks: making alerts actionable

An alert without a runbookWhat is runbook?A documented, step-by-step guide for responding to a specific production incident. is just a notification. A runbook tells the on-call person exactly what to do.

markdown
# Stripe High Error Rate Runbook

## What this alert means
More than 5% of Stripe API calls are failing over a 5-minute window.

## Impact
Users cannot complete purchases. Revenue is directly affected.

## Diagnostic steps
1. Check Stripe status page: https://status.stripe.com
2. Check our dashboard: https://grafana.internal/d/stripe-health
3. Look for specific error codes in logs:
   Query: partner=stripe AND level=error | last 30 minutes
4. Check if the error is on our side (auth, payload) or Stripe's (5xx)

## Common causes and fixes
| Cause | Evidence | Fix |
|---|---|---|
| Stripe outage | status.stripe.com shows incident | Wait. Enable fallback if available |
| Expired API key | 401 errors in logs | Rotate key in secrets manager, restart |
| Invalid payload | 400 errors with validation messages | Check recent code deploys for breaking changes |
| Rate limited | 429 errors | Reduce traffic, implement backoff |

## Escalation
If not resolved in 15 minutes, escalate to #payments-team Slack channel.

Always include the runbook URL in the alert annotation, when someone gets paged at 3 AM, they should not have to search for documentation. The YAMLWhat is yaml?A human-readable text format used for configuration files, including Docker Compose and GitHub Actions workflows. example above shows the pattern: every alert rule has a runbook annotation linking directly to its runbook.