Production Engineering/
Lesson

Data loss is not an "if" but a "when" question. Databases get corrupted. Cloud providers have outages. Engineers accidentally run DROP TABLE on production. The difference between a minor incident and a catastrophe is whether you have reliable, tested backups. This lesson gives you the mental model and practical techniques to build a solid backup strategy.

The 3-2-1 rule

The 3-2-1 rule is the backbone of any sensible backup strategy. It's been around for decades because it works.

  • 3 copies of your data (the original + 2 backups)
  • 2 different storage media or services
  • 1 copy offsite (in a different physical location or cloud region)

Think of it like this: if your database is on a server in AWS us-east-1, a backup in another folder on the same server doesn't help when the whole machine fails. A backup in us-west-2 and another in Backblaze B2 means two separate failures would need to happen simultaneously to lose your data.

Storage typeExamplesProtects against
Primary (online)RDS, D1, Postgres on EC2Normal reads/writes
Secondary (cloud)S3, R2, GCSServer failure, accidental deletion
OffsiteDifferent cloud provider, cold storageRegion outage, provider failure
02

Backup types

Not all backups work the same way. Understanding the tradeoffs helps you pick the right strategy for each part of your system.

Full vs incremental vs differential

A full backup copies everything. An incremental backup copies only what changed since the last backup (of any kind). A differential backup copies what changed since the last full backup.

Full backup:         All data (slow, large, self-contained)
Incremental backup:  Changes since last backup (fast, small, requires chain)
Differential backup: Changes since last full (medium speed/size, simpler restore)
Incremental backups are efficient for storage, but restoring from them requires replaying a chain of changes. If any backup in the chain is corrupted, you can't restore beyond that point. For critical databases, consider keeping weekly full backups plus daily incrementals.

Database-specific backups

Most databases have purpose-built backup tools that understand transactionWhat is transaction?A group of database operations that either all succeed together or all fail together, preventing partial updates. consistency, something file-level copies can't guarantee.

# PostgreSQL: pg_dump creates a consistent snapshot
pg_dump -h localhost -U postgres -d mydb -F c -f backup_$(date +%Y%m%d).dump

# Restore from a pg_dump backup
pg_restore -h localhost -U postgres -d mydb backup_20240101.dump

# SQLite: just copy the file (while no writes are happening)
sqlite3 mydb.db ".backup 'backup_$(date +%Y%m%d).db'"

# MySQL: mysqldump
mysqldump -u root -p mydb > backup_$(date +%Y%m%d).sql
03

Automating backups

A backup strategy that requires a human to remember to run it will fail. Automate everything.

Using cron jobs

# Edit your crontab
crontab -e

# Run a database backup every day at 2am
0 2 * * * /home/deploy/scripts/backup.sh >> /var/log/backup.log 2>&1

# Run a weekly full backup on Sundays at 3am
0 3 * * 0 /home/deploy/scripts/full-backup.sh >> /var/log/backup.log 2>&1
#!/bin/bash
# backup.sh - simple PostgreSQL backup to S3

set -e  # exit on any error

DATE=$(date +%Y%m%d_%H%M%S)
DB_NAME="myapp_production"
BACKUP_FILE="/tmp/backup_${DATE}.dump"
S3_BUCKET="s3://my-app-backups/db"

echo "Starting backup at ${DATE}"

# Create the backup
pg_dump -h "$DB_HOST" -U "$DB_USER" -d "$DB_NAME" -F c -f "$BACKUP_FILE"

# Upload to S3
aws s3 cp "$BACKUP_FILE" "${S3_BUCKET}/backup_${DATE}.dump"

# Remove local temp file
rm "$BACKUP_FILE"

echo "Backup completed successfully"

Retention policies

Don't keep backups forever, storage costs add up, and you rarely need data from 3 years ago. A sensible default:

FrequencyKeep for
Hourly24 hours
Daily30 days
Weekly3 months
Monthly1 year
# Delete S3 backups older than 30 days
aws s3 ls s3://my-app-backups/db/ \
  | awk '{print $4}' \
  | while read file; do
      age=$(( ( $(date +%s) - $(date -d "$(echo $file | grep -oP '\d{8}')" +%s) ) / 86400 ))
      if [ $age -gt 30 ]; then
        aws s3 rm "s3://my-app-backups/db/$file"
      fi
    done
04

RTOWhat is rto?Recovery Time Objective - the maximum time a system can be down before recovery must be complete. and RPOWhat is rpo?Recovery Point Objective - the maximum amount of data loss acceptable after an incident, expressed as time (e.g., no more than 1 hour of lost data).

Before you can evaluate a backup strategy, you need to know what you're optimizing for. Two metrics define your requirements.

RPO (Recovery Point Objective) is the maximum amount of data loss you can accept. If your RPO is 1 hour, you must take backups at least every hour. If you can't lose a single transactionWhat is transaction?A group of database operations that either all succeed together or all fail together, preventing partial updates., you need continuous replication.

RTO (Recovery Time Objective) is the maximum time it can take to restore service. If your RTO is 4 hours, you need to be able to restore everything, database, files, configuration, within 4 hours.

These are business decisions, not technical ones. Talk to your stakeholders. A startup might tolerate an RPO of 24 hours and an RTO of 8 hours. A payment processor might need an RPO of 0 (zero data loss) and an RTO of minutes. The tighter the requirements, the more expensive the solution.
05

Quick reference

RequirementApproach
Low RPO (minutes)Continuous replication, read replicas
Medium RPO (hours)Hourly incremental backups
High RPO (days)Daily full backups
Low RTO (minutes)Hot standby, automatic failover
Medium RTO (hours)Pre-configured restore scripts
High RTO (days)Manual restore from cold storage