Your APIWhat is api?A set of rules that lets one program talk to another, usually over the internet, by sending requests and getting responses. is deployed and serving 50 users. Then a tweet goes viral, and suddenly 5,000 people hit your endpointWhat is endpoint?A specific URL path on a server that handles a particular type of request, like GET /api/users. simultaneously. What happens next depends entirely on decisions you made before the traffic arrived, decisions AI never mentions because they are not about code. They are about infrastructure.
Horizontal vs vertical scalingWhat is vertical scaling?Making a single machine more powerful by adding CPU, RAM, or storage, rather than adding more machines.
There are two ways to handle more traffic:
| Approach | How it works | Example | Limit |
|---|---|---|---|
| Vertical | Bigger machine (more CPU, RAM) | Upgrade from 256MB to 2GB | Hardware ceiling |
| Horizontal | More machines running the same code | Run 5 instances instead of 1 | Practically unlimited |
Python APIs scale horizontally. Your FastAPI app is statelessWhat is stateless?A design where each request contains all the information the server needs, so any server can handle any request without remembering previous ones. (no data stored in memory between requests), so you can run 10 copies of it behind a load balancerWhat is load balancer?A server that distributes incoming traffic across multiple backend servers so no single server gets overwhelmed. and each one handles a portion of the traffic.
Load Balancer
├── Instance 1 (uvicorn) → handles requests A, D, G
├── Instance 2 (uvicorn) → handles requests B, E, H
└── Instance 3 (uvicorn) → handles requests C, F, IHealth checks that actually work
Every platform periodically pings your app to check if it is healthy. If the health checkWhat is health check?An API endpoint that verifies your application and its dependencies are working, so monitoring tools can alert you when something fails. fails, the platform restarts the instance or removes it from the load balancerWhat is load balancer?A server that distributes incoming traffic across multiple backend servers so no single server gets overwhelmed..
What AI generates
@app.get("/health")
async def health():
return {"status": "ok"}This endpointWhat is endpoint?A specific URL path on a server that handles a particular type of request, like GET /api/users. always returns 200. Even if the database is down. Even if Redis is unreachable. Even if the app is out of memory. The platform thinks everything is fine because the health check passes.
What production needs
from sqlalchemy import text
@app.get("/health")
async def health(db: Session = Depends(get_db)):
checks = {}
# Check database connectivity
try:
db.execute(text("SELECT 1"))
checks["database"] = "ok"
except Exception:
checks["database"] = "failed"
# Check Redis connectivity
try:
await redis.ping()
checks["redis"] = "ok"
except Exception:
checks["redis"] = "failed"
all_ok = all(v == "ok" for v in checks.values())
status_code = 200 if all_ok else 503
return JSONResponse(
content={"status": "healthy" if all_ok else "degraded", "checks": checks},
status_code=status_code,
)Readiness vs liveness
Some platforms (Kubernetes, Fly.io) distinguish between two types of health checks:
| Type | Question it answers | What happens on failure |
|---|---|---|
| Liveness | "Is the process alive?" | Platform kills and restarts it |
| Readiness | "Can it handle traffic?" | Platform stops sending traffic (but keeps it alive) |
This distinction matters. If your app is alive but its database connection poolWhat is connection pool?A set of pre-opened database connections that your app reuses instead of opening and closing a new one for every request. is exhausted, you want a readiness failure (stop traffic, let it recover), not a liveness failure (kill and restart, which makes things worse).
Zero-downtime deployment
When you deploy a new version, the platform needs to swap the old version for the new one without dropping requests.
Rolling updates
The standard approach:
- Start a new instance with the new code
- Wait for it to pass health checks
- Start routing traffic to the new instance
- Stop routing traffic to the old instance
- Send
SIGTERMto the old instance - Wait for in-flight requests to finish
- Kill the old instance
If any step fails, the platform rolls back, the old instance keeps running.
Graceful shutdownWhat is graceful shutdown?Finishing all in-progress requests and closing connections cleanly before your server exits, instead of cutting off users mid-response. in Python
When the platform sends SIGTERM, your app needs to finish current requests before exiting. Uvicorn handles this by default with a grace period, but you can customize it.
# uvicorn handles SIGTERM gracefully by default
# It stops accepting new connections and waits for in-flight requests
# Start command with explicit grace period
# uvicorn main:app --host 0.0.0.0 --port $PORT --timeout-graceful-shutdown 30For background tasks or cleanup:
from contextlib import asynccontextmanager
import signal
@asynccontextmanager
async def lifespan(app: FastAPI):
# Startup
print("Starting up - connecting to databases")
yield
# Shutdown - this runs on SIGTERM
print("Shutting down - closing connections")
await database.disconnect()
await redis.close()
app = FastAPI(lifespan=lifespan)Cold starts and auto-scaling
Auto-scaling means the platform adds instances when traffic increases and removes them when it decreases. The challenge is cold startWhat is cold start?The delay that occurs when a serverless function runs for the first time after being idle. The cloud provider needs to spin up a new container, which adds latency. time, how long it takes from "spin up new instance" to "ready to serve requests."
| Component | Typical cold start time |
|---|---|
| Container creation | 2-5 seconds |
| Python startup | 1-2 seconds |
| Dependency imports | 1-5 seconds (depends on package size) |
| Database connection pool | 0.5-2 seconds |
| ML model loading | 10-60 seconds |
| Total (typical API) | 5-15 seconds |
During cold start, requests either queue (adding latencyWhat is latency?The time delay between sending a request and receiving the first byte of the response, usually measured in milliseconds.) or fail (returning errors). This is why auto-scaling is not a magic solution, if traffic spikes faster than your instances can start, users see errors.
Strategies to reduce cold start impact:
- Keep minimum instances running: most platforms let you set
min_instances = 1so there is always a warm instance - Use slim DockerWhat is docker?A tool that packages your application and all its dependencies into a portable container that runs identically on any machine. images: fewer layers, faster pull times
- Lazy-load heavy dependencies: do not import your ML model at moduleWhat is module?A self-contained file of code with its own scope that explicitly exports values for other files to import, preventing name collisions. level
- Pre-warm connection pools: establish database connections during startup, not on first request
Quick reference
| Concept | What to do | What to avoid |
|---|---|---|
| Scaling | Horizontal (more instances) | Global variables for shared state |
| Health checks | Check dependencies (DB, Redis) | Unconditional 200 OK |
| Deploys | Rolling updates with readiness checks | Stop-start with downtime |
| Shutdown | Handle SIGTERM, finish requests | Hard kill with lost requests |
| Cold starts | Min instances, slim images, lazy imports | Loading everything at module level |