Alerting pushes information to you when something needs attention. Too many alerts and your team ignores them all. Too few and you miss real outages. The key is tying alerts to SLOs: concrete targets that define what "healthy" means for each integration.
Integration SLOs
Each integration gets its own SLO because each partner has different characteristics:
| Partner | SLO metric | Target | Measurement window |
|---|---|---|---|
| Stripe (payments) | Success rate | 99.9% | 30-day rolling |
| Stripe (payments) | p99 latency | < 2 seconds | 30-day rolling |
| Twilio (SMS) | Delivery rate | 99.5% | 30-day rolling |
| Twilio (SMS) | p99 latency | < 5 seconds | 30-day rolling |
| Fulfillment API | Success rate | 99.0% | 30-day rolling |
| Fulfillment API | p99 latency | < 10 seconds | 30-day rolling |
| Internal auth service | Success rate | 99.99% | 30-day rolling |
| Internal auth service | p99 latency | < 200ms | 30-day rolling |
To choose a target, start from the partner's published SLAWhat is sla?A formal commitment defining the minimum uptime or performance level a service promises to deliver, usually expressed as a percentage like 99.9%. and set your SLO slightly lower, your integration code, network, and configuration add failure points. Then factor in business impact: a payment outage blocks revenue, an email outage delays receipts.
Error budgets
An error budget is the flip side of an SLO. A 99.5% success rate SLO means you have a 0.5% error budget.
Error budget = 1 - SLO target
Example: 99.5% SLO over 30 days
- Error budget: 0.5%
- If you make 1,000,000 requests/month to this partner:
Budget = 1,000,000 * 0.005 = 5,000 allowed failures
- If you make 10,000 requests/month:
Budget = 10,000 * 0.005 = 50 allowed failuresError budget burn rate
The burn rate tells you how fast you are consuming your error budget. A burn rate of 1x means you will exactly exhaust the budget by the end of the window. A burn rate of 10x means you will exhaust it in 1/10th of the time.
# PromQL: error budget burn rate for Stripe
(
rate(integration_requests_total{partner="stripe", status=~"error.*"}[1h])
/ rate(integration_requests_total{partner="stripe"}[1h])
) / 0.001
# Dividing by 0.001 (the error budget for 99.9% SLO) gives the burn rate
# Burn rate > 1 means you're consuming budget faster than plannedAlert fatigue
When your team receives 50 alerts a day and most are noise, they stop reading them. Then the real critical alert fires and nobody notices.
Common causes of alert fatigue in integrations
| Cause | Example | Fix |
|---|---|---|
| Alerting on single errors | One 503 from Stripe triggers a page | Alert on error rate over a window, not single events |
| Same SLO for all partners | Flaky partner triggers constant alerts | Set realistic SLOs per partner based on their actual reliability |
| No deduplication | Same issue triggers 5 different alerts | Group related alerts, deduplicate by root cause |
Missing for duration | Brief spike triggers alert then resolves | Require condition to persist for 2-5 minutes |
| Alerting on symptoms and causes | "High latency" AND "Stripe slow" fire together | Alert on user-facing symptoms, use causes for investigation |
| No severity levels | Everything is "critical" | Tier alerts: critical (page), warning (Slack), info (dashboard) |
The rule: if an alert fires and the on-call person does not need to do anything, it should not be an alert. It should be a dashboard observation.
# Good: alerts on SLO burn rate (burn through budget too fast)
groups:
- name: integration_slo_alerts
rules:
- alert: StripeHighBurnRate
expr: |
(
rate(integration_requests_total{partner="stripe", status=~"error.*"}[1h])
/ rate(integration_requests_total{partner="stripe"}[1h])
) / 0.001 > 10
for: 5m
labels:
severity: critical
partner: stripe
annotations:
summary: "Stripe error budget burning 10x faster than sustainable"
runbook: "https://wiki.internal/runbooks/stripe-high-burn-rate"
- alert: FulfillmentLatencyDegraded
expr: |
histogram_quantile(0.99,
rate(integration_request_duration_seconds_bucket{partner="fulfillment"}[5m])
) > 10
for: 10m
labels:
severity: warning
partner: fulfillment
annotations:
summary: "Fulfillment API p99 latency above 10s for 10 minutes"
runbook: "https://wiki.internal/runbooks/fulfillment-slow"Alert types and severity
| Alert type | Severity | Response time | Example | Notification |
|---|---|---|---|---|
| Error budget burn > 10x | Critical | Immediately (page) | Stripe 50% error rate for 5 minutes | PagerDuty, phone call |
| Error budget burn > 2x | Warning | Within 1 hour | Stripe 1% error rate sustained | Slack channel |
| p99 latency above SLO | Warning | Within 1 hour | Partner p99 > 5s for 10 minutes | Slack channel |
| Circuit breaker opened | Critical | Immediately | Partner completely unreachable | PagerDuty |
| Auth token expiring | Warning | Within 24 hours | Partner OAuth token expires in 48h | Slack + ticket |
| Rate limit approaching | Info | Next business day | At 80% of partner rate limit | Dashboard only |
| Certificate expiring | Warning | Within 1 week | mTLS cert expires in 14 days | Slack + ticket |
Runbooks: making alerts actionable
An alert without a runbookWhat is runbook?A documented, step-by-step guide for responding to a specific production incident. is just a notification. A runbook tells the on-call person exactly what to do.
# Stripe High Error Rate Runbook
## What this alert means
More than 5% of Stripe API calls are failing over a 5-minute window.
## Impact
Users cannot complete purchases. Revenue is directly affected.
## Diagnostic steps
1. Check Stripe status page: https://status.stripe.com
2. Check our dashboard: https://grafana.internal/d/stripe-health
3. Look for specific error codes in logs:
Query: partner=stripe AND level=error | last 30 minutes
4. Check if the error is on our side (auth, payload) or Stripe's (5xx)
## Common causes and fixes
| Cause | Evidence | Fix |
|---|---|---|
| Stripe outage | status.stripe.com shows incident | Wait. Enable fallback if available |
| Expired API key | 401 errors in logs | Rotate key in secrets manager, restart |
| Invalid payload | 400 errors with validation messages | Check recent code deploys for breaking changes |
| Rate limited | 429 errors | Reduce traffic, implement backoff |
## Escalation
If not resolved in 15 minutes, escalate to #payments-team Slack channel.Always include the runbook URL in the alert annotation, when someone gets paged at 3 AM, they should not have to search for documentation. The YAMLWhat is yaml?A human-readable text format used for configuration files, including Docker Compose and GitHub Actions workflows. example above shows the pattern: every alert rule has a runbook annotation linking directly to its runbook.