Shipping Python APIs - Scaling, health checks, and zero-downtime

Create a free account to save your progress

Earn XP, track streaks, and sync your dashboard across devices.

Lesson

Your APIWhat is api?A set of rules that lets one program talk to another, usually over the internet, by sending requests and getting responses. is deployed and serving 50 users. Then a tweet goes viral, and suddenly 5,000 people hit your endpointWhat is endpoint?A specific URL path on a server that handles a particular type of request, like GET /api/users. simultaneously. What happens next depends entirely on decisions you made before the traffic arrived, decisions AI never mentions because they are not about code. They are about infrastructure.

Horizontal vs vertical scalingWhat is vertical scaling?Making a single machine more powerful by adding CPU, RAM, or storage, rather than adding more machines.

There are two ways to handle more traffic:

Approach	How it works	Example	Limit
Vertical	Bigger machine (more CPU, RAM)	Upgrade from 256MB to 2GB	Hardware ceiling
Horizontal	More machines running the same code	Run 5 instances instead of 1	Practically unlimited

Python APIs scale horizontally. Your FastAPI app is statelessWhat is stateless?A design where each request contains all the information the server needs, so any server can handle any request without remembering previous ones. (no data stored in memory between requests), so you can run 10 copies of it behind a load balancerWhat is load balancer?A server that distributes incoming traffic across multiple backend servers so no single server gets overwhelmed. and each one handles a portion of the traffic.

Load Balancer
├── Instance 1 (uvicorn) → handles requests A, D, G
├── Instance 2 (uvicorn) → handles requests B, E, H
└── Instance 3 (uvicorn) → handles requests C, F, I

AI pitfall

AI sometimes stores state in global variables or module-level dictionaries (like an in-memory cache or rate limiter). This breaks with horizontal scaling because each instance has its own memory. If instance 1 caches a user's data, instance 2 does not have it. Use Redis or a database for shared state.

Health checks that actually work

Every platform periodically pings your app to check if it is healthy. If the health checkWhat is health check?An API endpoint that verifies your application and its dependencies are working, so monitoring tools can alert you when something fails. fails, the platform restarts the instance or removes it from the load balancerWhat is load balancer?A server that distributes incoming traffic across multiple backend servers so no single server gets overwhelmed..

What AI generates

@app.get("/health")
async def health():
    return {"status": "ok"}

This endpointWhat is endpoint?A specific URL path on a server that handles a particular type of request, like GET /api/users. always returns 200. Even if the database is down. Even if Redis is unreachable. Even if the app is out of memory. The platform thinks everything is fine because the health check passes.

What production needs

from sqlalchemy import text

@app.get("/health")
async def health(db: Session = Depends(get_db)):
    checks = {}

    # Check database connectivity
    try:
        db.execute(text("SELECT 1"))
        checks["database"] = "ok"
    except Exception:
        checks["database"] = "failed"

    # Check Redis connectivity
    try:
        await redis.ping()
        checks["redis"] = "ok"
    except Exception:
        checks["redis"] = "failed"

    all_ok = all(v == "ok" for v in checks.values())
    status_code = 200 if all_ok else 503

    return JSONResponse(
        content={"status": "healthy" if all_ok else "degraded", "checks": checks},
        status_code=status_code,
    )

Readiness vs liveness

Some platforms (Kubernetes, Fly.io) distinguish between two types of health checks:

Type	Question it answers	What happens on failure
Liveness	"Is the process alive?"	Platform kills and restarts it
Readiness	"Can it handle traffic?"	Platform stops sending traffic (but keeps it alive)

This distinction matters. If your app is alive but its database connection poolWhat is connection pool?A set of pre-opened database connections that your app reuses instead of opening and closing a new one for every request. is exhausted, you want a readiness failure (stop traffic, let it recover), not a liveness failure (kill and restart, which makes things worse).

Zero-downtime deployment

When you deploy a new version, the platform needs to swap the old version for the new one without dropping requests.

Rolling updates

The standard approach:

Start a new instance with the new code
Wait for it to pass health checks
Start routing traffic to the new instance
Stop routing traffic to the old instance
Send SIGTERM to the old instance
Wait for in-flight requests to finish
Kill the old instance

If any step fails, the platform rolls back, the old instance keeps running.

Graceful shutdownWhat is graceful shutdown?Finishing all in-progress requests and closing connections cleanly before your server exits, instead of cutting off users mid-response. in Python

When the platform sends SIGTERM, your app needs to finish current requests before exiting. Uvicorn handles this by default with a grace period, but you can customize it.

# uvicorn handles SIGTERM gracefully by default
# It stops accepting new connections and waits for in-flight requests

# Start command with explicit grace period
# uvicorn main:app --host 0.0.0.0 --port $PORT --timeout-graceful-shutdown 30

For background tasks or cleanup:

from contextlib import asynccontextmanager
import signal

@asynccontextmanager
async def lifespan(app: FastAPI):
    # Startup
    print("Starting up - connecting to databases")
    yield
    # Shutdown - this runs on SIGTERM
    print("Shutting down - closing connections")
    await database.disconnect()
    await redis.close()

app = FastAPI(lifespan=lifespan)

Good to know

The default graceful shutdown timeout in most platforms is 10-30 seconds. If your request takes longer than that (e.g., a large file upload), the platform kills the process. For long-running operations, use background tasks or a task queue instead of blocking the request.

Cold starts and auto-scaling

Auto-scaling means the platform adds instances when traffic increases and removes them when it decreases. The challenge is cold startWhat is cold start?The delay that occurs when a serverless function runs for the first time after being idle. The cloud provider needs to spin up a new container, which adds latency. time, how long it takes from "spin up new instance" to "ready to serve requests."

Component	Typical cold start time
Container creation	2-5 seconds
Python startup	1-2 seconds
Dependency imports	1-5 seconds (depends on package size)
Database connection pool	0.5-2 seconds
ML model loading	10-60 seconds
Total (typical API)	5-15 seconds

During cold start, requests either queue (adding latencyWhat is latency?The time delay between sending a request and receiving the first byte of the response, usually measured in milliseconds.) or fail (returning errors). This is why auto-scaling is not a magic solution, if traffic spikes faster than your instances can start, users see errors.

Strategies to reduce cold start impact:

Keep minimum instances running: most platforms let you set min_instances = 1 so there is always a warm instance
Use slim DockerWhat is docker?A tool that packages your application and all its dependencies into a portable container that runs identically on any machine. images: fewer layers, faster pull times
Lazy-load heavy dependencies: do not import your ML model at moduleWhat is module?A self-contained file of code with its own scope that explicitly exports values for other files to import, preventing name collisions. level
Pre-warm connection pools: establish database connections during startup, not on first request

Quick reference

Concept	What to do	What to avoid
Scaling	Horizontal (more instances)	Global variables for shared state
Health checks	Check dependencies (DB, Redis)	Unconditional `200 OK`
Deploys	Rolling updates with readiness checks	Stop-start with downtime
Shutdown	Handle SIGTERM, finish requests	Hard kill with lost requests
Cold starts	Min instances, slim images, lazy imports	Loading everything at module level