Production Engineering/
Lesson

When something breaks in production, and it will, you want a plan, not a panic. Disaster recovery (DR) planning is the work you do before the emergency so that when it arrives, you're executing a rehearsed procedure instead of improvising under pressure. This lesson covers how to build a DR plan that actually works.

What counts as a disaster

"Disaster" doesn't just mean the data center burned down. In practice, the incidents you'll face are more mundane:

  • Database corruption from a bad migrationWhat is migration?A versioned script that changes your database structure (add a column, create a table) so every developer and server stays in sync.
  • Accidental deletion of production data
  • A bad deployment that takes your app offline
  • A DDoS attack overwhelming your servers
  • A cloud providerWhat is provider?A wrapper component that makes data available to all components nested inside it without passing props manually. region outage
  • A compromised APIWhat is api?A set of rules that lets one program talk to another, usually over the internet, by sending requests and getting responses. key used to exfiltrate data

Each scenario needs a different response, which is why a one-size-fits-all plan isn't enough. You need scenario-specific runbooks.

Incident typeLikely causeRecovery approach
App down (crash loop)Bad deployment, config errorRollback deployment
Data corruptionBad migration, application bugRestore from backup
Data loss (accidental)DROP TABLE, human errorPoint-in-time recovery
Security breachLeaked credentials, vulnerabilityRotate secrets, patch, audit logs
Region outageCloud provider failureFailover to secondary region
DDoS attackMalicious traffic floodEnable CDN protection, rate limiting
02

Writing runbooks

A runbookWhat is runbook?A documented, step-by-step guide for responding to a specific production incident. is a step-by-step guide for responding to a specific incident. It's written in advance, by someone who isn't in a panic, and it includes everything an engineer needs to resolve the situation, including commands to run, services to check, and escalation paths.

Think of it like a recipe: someone unfamiliar with your system should be able to follow it and get the right result.

Runbook template

markdown
# Runbook: Database unresponsive

## Symptoms
- API returning 500 errors on all database-dependent routes
- Grafana shows no database connections
- Health check endpoint returning unhealthy

## Severity: P1 (critical)

## Immediate actions (0-5 minutes)
1. Check if the database process is running:
   ssh deploy@db-host "systemctl status postgresql"

2. Check recent logs:
   ssh deploy@db-host "journalctl -u postgresql -n 100"

3. Attempt a restart if process is dead:
   ssh deploy@db-host "sudo systemctl restart postgresql"

## If restart fails (5-15 minutes)
4. Check disk space (full disk is a common cause):
   ssh deploy@db-host "df -h"

5. If disk is full, identify large files and clear old logs:
   ssh deploy@db-host "du -sh /var/log/postgresql/*"

## If database cannot be recovered (15+ minutes)
6. Initiate failover to read replica:
   ./scripts/failover.sh --promote-replica db-replica-01

7. Update DNS to point to replica:
   ./scripts/update-dns.sh --target db-replica-01

## Communication
- Notify #incidents Slack channel immediately
- Post update every 15 minutes
- Escalate to CTO if not resolved in 30 minutes

## Post-recovery
- Document what happened
- Schedule a post-mortem within 48 hours
A runbook that lives only in one engineer's head is not a runbook. It must be written, accessible to the whole team, and kept up to date. A runbook that's 6 months out of date is dangerous, it might direct you to servers that no longer exist.
03

Implementing failoverWhat is failover?Automatically switching traffic from a failed server or service to a healthy backup to keep the system running.

Failover is the process of switching from a failed system to a healthy backup. It can be automatic (triggered by health checks) or manual (triggered by an operator).

Health checks and load balancerWhat is load balancer?A server that distributes incoming traffic across multiple backend servers so no single server gets overwhelmed. failover

// Express health check endpoint
app.get('/health', async (req, res) => {
  const checks = {
    database: false,
    cache: false,
    timestamp: new Date().toISOString(),
  };

  try {
    // Test database connectivity
    await db.prepare('SELECT 1').first();
    checks.database = true;
  } catch {
    // Database is down - don't throw, just mark unhealthy
  }

  const isHealthy = checks.database;
  const statusCode = isHealthy ? 200 : 503;

  res.status(statusCode).json({
    status: isHealthy ? 'healthy' : 'unhealthy',
    checks,
  });
});

Load balancers like Cloudflare, AWS ALB, or Nginx use this endpointWhat is endpoint?A specific URL path on a server that handles a particular type of request, like GET /api/users. to decide whether to send traffic to a server. If the health checkWhat is health check?An API endpoint that verifies your application and its dependencies are working, so monitoring tools can alert you when something fails. returns a non-200 status, the server is taken out of rotation automatically.

# Nginx upstream with health checks
upstream api_backend {
  server app1.example.com:3000;
  server app2.example.com:3000 backup;

  # Check health every 10 seconds
  check interval=10000 rise=2 fall=3 timeout=3000;
}
04

Testing your DR plan

An untested DR plan is just a theory. You must practice it before you need it.

Types of DR tests

Test typeWhat it involvesHow often
Tabletop exerciseTeam walks through a scenario verballyQuarterly
Partial testRestore a subset of data or one serviceMonthly
Full DR testFail over entire system to secondaryAnnually
Chaos engineeringDeliberately inject failures in production-like envContinuously
# Simple chaos: kill a random process and verify recovery
# (run this in staging, not production)
kill -9 $(pgrep -f "node server.js")

# Watch your monitoring to confirm alerts fire
# and automatic recovery kicks in within your RTO
Netflix pioneered "chaos engineering" with their Chaos Monkey tool, which randomly terminates production instances to verify that the system can handle failures gracefully. The idea is that if failure will happen eventually anyway, you're better off discovering weaknesses on your own terms.
05

Post-mortems

After every significant incident, hold a blameless post-mortemWhat is post-mortem?A structured review after an incident to document what happened, why, and what to change so it doesn't happen again.. The goal is not to find someone to blame, it's to understand what happened, why it happened, and how to prevent it from happening again.

Post-mortem structure

markdown
# Post-mortem: API outage 2024-03-15

## Summary
The API was unavailable for 47 minutes from 14:23 to 15:10 UTC.
Root cause: a database migration added an index without CONCURRENTLY,
locking the users table.

## Timeline
- 14:23 - Migration deployed, table lock acquired
- 14:26 - First alerts fire for elevated error rates
- 14:31 - On-call engineer paged
- 14:45 - Root cause identified
- 14:55 - Migration rolled back, table lock released
- 15:10 - System fully recovered, error rates normal

## Root cause
The migration script used CREATE INDEX instead of CREATE INDEX CONCURRENTLY.
On a large table, this locks all reads and writes for the duration.

## Impact
- 47 minutes of API downtime
- ~3,200 failed requests
- No data loss

## Action items
1. Add a CI check that rejects migrations using CREATE INDEX without CONCURRENTLY
2. Add a runbook for migration-related outages
3. Test migrations against a production-sized dataset in staging
06

Quick reference

DR componentPurposeTools/approaches
Health checksDetect failures automatically/health endpoint, uptime monitors
RunbooksDocument recovery proceduresNotion, Confluence, GitHub wiki
AlertingNotify on-call engineerPagerDuty, Opsgenie, Grafana
FailoverRoute traffic to healthy systemsLoad balancer, DNS failover
DR testingVerify the plan actually worksTabletop, chaos engineering
Post-mortemsLearn and improve from incidentsStructured retrospective