Production Engineering - Disaster Recovery Planning

Create a free account to save your progress

Earn XP, track streaks, and sync your dashboard across devices.

Lesson

When something breaks in production, and it will, you want a plan, not a panic. Disaster recovery (DR) planning is the work you do before the emergency so that when it arrives, you're executing a rehearsed procedure instead of improvising under pressure. This lesson covers how to build a DR plan that actually works.

What counts as a disaster

"Disaster" doesn't just mean the data center burned down. In practice, the incidents you'll face are more mundane:

Database corruption from a bad migrationWhat is migration?A versioned script that changes your database structure (add a column, create a table) so every developer and server stays in sync.
Accidental deletion of production data
A bad deployment that takes your app offline
A DDoS attack overwhelming your servers
A cloud providerWhat is provider?A wrapper component that makes data available to all components nested inside it without passing props manually. region outage
A compromised APIWhat is api?A set of rules that lets one program talk to another, usually over the internet, by sending requests and getting responses. key used to exfiltrate data

Each scenario needs a different response, which is why a one-size-fits-all plan isn't enough. You need scenario-specific runbooks.

Incident type	Likely cause	Recovery approach
App down (crash loop)	Bad deployment, config error	Rollback deployment
Data corruption	Bad migration, application bug	Restore from backup
Data loss (accidental)	`DROP TABLE`, human error	Point-in-time recovery
Security breach	Leaked credentials, vulnerability	Rotate secrets, patch, audit logs
Region outage	Cloud provider failure	Failover to secondary region
DDoS attack	Malicious traffic flood	Enable CDN protection, rate limiting

Writing runbooks

A runbookWhat is runbook?A documented, step-by-step guide for responding to a specific production incident. is a step-by-step guide for responding to a specific incident. It's written in advance, by someone who isn't in a panic, and it includes everything an engineer needs to resolve the situation, including commands to run, services to check, and escalation paths.

Think of it like a recipe: someone unfamiliar with your system should be able to follow it and get the right result.

Runbook template

markdown

# Runbook: Database unresponsive

## Symptoms
- API returning 500 errors on all database-dependent routes
- Grafana shows no database connections
- Health check endpoint returning unhealthy

## Severity: P1 (critical)

## Immediate actions (0-5 minutes)
1. Check if the database process is running:
   ssh deploy@db-host "systemctl status postgresql"

2. Check recent logs:
   ssh deploy@db-host "journalctl -u postgresql -n 100"

3. Attempt a restart if process is dead:
   ssh deploy@db-host "sudo systemctl restart postgresql"

## If restart fails (5-15 minutes)
4. Check disk space (full disk is a common cause):
   ssh deploy@db-host "df -h"

5. If disk is full, identify large files and clear old logs:
   ssh deploy@db-host "du -sh /var/log/postgresql/*"

## If database cannot be recovered (15+ minutes)
6. Initiate failover to read replica:
   ./scripts/failover.sh --promote-replica db-replica-01

7. Update DNS to point to replica:
   ./scripts/update-dns.sh --target db-replica-01

## Communication
- Notify #incidents Slack channel immediately
- Post update every 15 minutes
- Escalate to CTO if not resolved in 30 minutes

## Post-recovery
- Document what happened
- Schedule a post-mortem within 48 hours

A runbook that lives only in one engineer's head is not a runbook. It must be written, accessible to the whole team, and kept up to date. A runbook that's 6 months out of date is dangerous, it might direct you to servers that no longer exist.

Implementing failoverWhat is failover?Automatically switching traffic from a failed server or service to a healthy backup to keep the system running.

Failover is the process of switching from a failed system to a healthy backup. It can be automatic (triggered by health checks) or manual (triggered by an operator).

Health checks and load balancerWhat is load balancer?A server that distributes incoming traffic across multiple backend servers so no single server gets overwhelmed. failover

// Express health check endpoint
app.get('/health', async (req, res) => {
  const checks = {
    database: false,
    cache: false,
    timestamp: new Date().toISOString(),
  };

  try {
    // Test database connectivity
    await db.prepare('SELECT 1').first();
    checks.database = true;
  } catch {
    // Database is down - don't throw, just mark unhealthy
  }

  const isHealthy = checks.database;
  const statusCode = isHealthy ? 200 : 503;

  res.status(statusCode).json({
    status: isHealthy ? 'healthy' : 'unhealthy',
    checks,
  });
});

Load balancers like Cloudflare, AWS ALB, or Nginx use this endpointWhat is endpoint?A specific URL path on a server that handles a particular type of request, like GET /api/users. to decide whether to send traffic to a server. If the health checkWhat is health check?An API endpoint that verifies your application and its dependencies are working, so monitoring tools can alert you when something fails. returns a non-200 status, the server is taken out of rotation automatically.

# Nginx upstream with health checks
upstream api_backend {
  server app1.example.com:3000;
  server app2.example.com:3000 backup;

  # Check health every 10 seconds
  check interval=10000 rise=2 fall=3 timeout=3000;
}

Testing your DR plan

An untested DR plan is just a theory. You must practice it before you need it.

Types of DR tests

Test type	What it involves	How often
Tabletop exercise	Team walks through a scenario verbally	Quarterly
Partial test	Restore a subset of data or one service	Monthly
Full DR test	Fail over entire system to secondary	Annually
Chaos engineering	Deliberately inject failures in production-like env	Continuously

# Simple chaos: kill a random process and verify recovery
# (run this in staging, not production)
kill -9 $(pgrep -f "node server.js")

# Watch your monitoring to confirm alerts fire
# and automatic recovery kicks in within your RTO

Netflix pioneered "chaos engineering" with their Chaos Monkey tool, which randomly terminates production instances to verify that the system can handle failures gracefully. The idea is that if failure will happen eventually anyway, you're better off discovering weaknesses on your own terms.

Post-mortems

After every significant incident, hold a blameless post-mortemWhat is post-mortem?A structured review after an incident to document what happened, why, and what to change so it doesn't happen again.. The goal is not to find someone to blame, it's to understand what happened, why it happened, and how to prevent it from happening again.

Post-mortem structure

markdown

# Post-mortem: API outage 2024-03-15

## Summary
The API was unavailable for 47 minutes from 14:23 to 15:10 UTC.
Root cause: a database migration added an index without CONCURRENTLY,
locking the users table.

## Timeline
- 14:23 - Migration deployed, table lock acquired
- 14:26 - First alerts fire for elevated error rates
- 14:31 - On-call engineer paged
- 14:45 - Root cause identified
- 14:55 - Migration rolled back, table lock released
- 15:10 - System fully recovered, error rates normal

## Root cause
The migration script used CREATE INDEX instead of CREATE INDEX CONCURRENTLY.
On a large table, this locks all reads and writes for the duration.

## Impact
- 47 minutes of API downtime
- ~3,200 failed requests
- No data loss

## Action items
1. Add a CI check that rejects migrations using CREATE INDEX without CONCURRENTLY
2. Add a runbook for migration-related outages
3. Test migrations against a production-sized dataset in staging

Quick reference

DR component	Purpose	Tools/approaches
Health checks	Detect failures automatically	`/health` endpoint, uptime monitors
Runbooks	Document recovery procedures	Notion, Confluence, GitHub wiki
Alerting	Notify on-call engineer	PagerDuty, Opsgenie, Grafana
Failover	Route traffic to healthy systems	Load balancer, DNS failover
DR testing	Verify the plan actually works	Tabletop, chaos engineering
Post-mortems	Learn and improve from incidents	Structured retrospective

Done

Complete & Next