Shipping Python APIs/
Lesson

Your APIWhat is api?A set of rules that lets one program talk to another, usually over the internet, by sending requests and getting responses. is deployed and serving 50 users. Then a tweet goes viral, and suddenly 5,000 people hit your endpointWhat is endpoint?A specific URL path on a server that handles a particular type of request, like GET /api/users. simultaneously. What happens next depends entirely on decisions you made before the traffic arrived, decisions AI never mentions because they are not about code. They are about infrastructure.

Horizontal vs vertical scalingWhat is vertical scaling?Making a single machine more powerful by adding CPU, RAM, or storage, rather than adding more machines.

There are two ways to handle more traffic:

ApproachHow it worksExampleLimit
VerticalBigger machine (more CPU, RAM)Upgrade from 256MB to 2GBHardware ceiling
HorizontalMore machines running the same codeRun 5 instances instead of 1Practically unlimited

Python APIs scale horizontally. Your FastAPI app is statelessWhat is stateless?A design where each request contains all the information the server needs, so any server can handle any request without remembering previous ones. (no data stored in memory between requests), so you can run 10 copies of it behind a load balancerWhat is load balancer?A server that distributes incoming traffic across multiple backend servers so no single server gets overwhelmed. and each one handles a portion of the traffic.

Load Balancer
├── Instance 1 (uvicorn) → handles requests A, D, G
├── Instance 2 (uvicorn) → handles requests B, E, H
└── Instance 3 (uvicorn) → handles requests C, F, I
AI pitfall
AI sometimes stores state in global variables or module-level dictionaries (like an in-memory cache or rate limiter). This breaks with horizontal scaling because each instance has its own memory. If instance 1 caches a user's data, instance 2 does not have it. Use Redis or a database for shared state.
02

Health checks that actually work

Every platform periodically pings your app to check if it is healthy. If the health checkWhat is health check?An API endpoint that verifies your application and its dependencies are working, so monitoring tools can alert you when something fails. fails, the platform restarts the instance or removes it from the load balancerWhat is load balancer?A server that distributes incoming traffic across multiple backend servers so no single server gets overwhelmed..

What AI generates

@app.get("/health")
async def health():
    return {"status": "ok"}

This endpointWhat is endpoint?A specific URL path on a server that handles a particular type of request, like GET /api/users. always returns 200. Even if the database is down. Even if Redis is unreachable. Even if the app is out of memory. The platform thinks everything is fine because the health check passes.

What production needs

from sqlalchemy import text

@app.get("/health")
async def health(db: Session = Depends(get_db)):
    checks = {}

    # Check database connectivity
    try:
        db.execute(text("SELECT 1"))
        checks["database"] = "ok"
    except Exception:
        checks["database"] = "failed"

    # Check Redis connectivity
    try:
        await redis.ping()
        checks["redis"] = "ok"
    except Exception:
        checks["redis"] = "failed"

    all_ok = all(v == "ok" for v in checks.values())
    status_code = 200 if all_ok else 503

    return JSONResponse(
        content={"status": "healthy" if all_ok else "degraded", "checks": checks},
        status_code=status_code,
    )

Readiness vs liveness

Some platforms (Kubernetes, Fly.io) distinguish between two types of health checks:

TypeQuestion it answersWhat happens on failure
Liveness"Is the process alive?"Platform kills and restarts it
Readiness"Can it handle traffic?"Platform stops sending traffic (but keeps it alive)

This distinction matters. If your app is alive but its database connection poolWhat is connection pool?A set of pre-opened database connections that your app reuses instead of opening and closing a new one for every request. is exhausted, you want a readiness failure (stop traffic, let it recover), not a liveness failure (kill and restart, which makes things worse).

03

Zero-downtime deployment

When you deploy a new version, the platform needs to swap the old version for the new one without dropping requests.

Rolling updates

The standard approach:

  1. Start a new instance with the new code
  2. Wait for it to pass health checks
  3. Start routing traffic to the new instance
  4. Stop routing traffic to the old instance
  5. Send SIGTERM to the old instance
  6. Wait for in-flight requests to finish
  7. Kill the old instance

If any step fails, the platform rolls back, the old instance keeps running.

Graceful shutdownWhat is graceful shutdown?Finishing all in-progress requests and closing connections cleanly before your server exits, instead of cutting off users mid-response. in Python

When the platform sends SIGTERM, your app needs to finish current requests before exiting. Uvicorn handles this by default with a grace period, but you can customize it.

# uvicorn handles SIGTERM gracefully by default
# It stops accepting new connections and waits for in-flight requests

# Start command with explicit grace period
# uvicorn main:app --host 0.0.0.0 --port $PORT --timeout-graceful-shutdown 30

For background tasks or cleanup:

from contextlib import asynccontextmanager
import signal

@asynccontextmanager
async def lifespan(app: FastAPI):
    # Startup
    print("Starting up - connecting to databases")
    yield
    # Shutdown - this runs on SIGTERM
    print("Shutting down - closing connections")
    await database.disconnect()
    await redis.close()

app = FastAPI(lifespan=lifespan)
Good to know
The default graceful shutdown timeout in most platforms is 10-30 seconds. If your request takes longer than that (e.g., a large file upload), the platform kills the process. For long-running operations, use background tasks or a task queue instead of blocking the request.
04

Cold starts and auto-scaling

Auto-scaling means the platform adds instances when traffic increases and removes them when it decreases. The challenge is cold startWhat is cold start?The delay that occurs when a serverless function runs for the first time after being idle. The cloud provider needs to spin up a new container, which adds latency. time, how long it takes from "spin up new instance" to "ready to serve requests."

ComponentTypical cold start time
Container creation2-5 seconds
Python startup1-2 seconds
Dependency imports1-5 seconds (depends on package size)
Database connection pool0.5-2 seconds
ML model loading10-60 seconds
Total (typical API)5-15 seconds

During cold start, requests either queue (adding latencyWhat is latency?The time delay between sending a request and receiving the first byte of the response, usually measured in milliseconds.) or fail (returning errors). This is why auto-scaling is not a magic solution, if traffic spikes faster than your instances can start, users see errors.

Strategies to reduce cold start impact:

  • Keep minimum instances running: most platforms let you set min_instances = 1 so there is always a warm instance
  • Use slim DockerWhat is docker?A tool that packages your application and all its dependencies into a portable container that runs identically on any machine. images: fewer layers, faster pull times
  • Lazy-load heavy dependencies: do not import your ML model at moduleWhat is module?A self-contained file of code with its own scope that explicitly exports values for other files to import, preventing name collisions. level
  • Pre-warm connection pools: establish database connections during startup, not on first request

05

Quick reference

ConceptWhat to doWhat to avoid
ScalingHorizontal (more instances)Global variables for shared state
Health checksCheck dependencies (DB, Redis)Unconditional 200 OK
DeploysRolling updates with readiness checksStop-start with downtime
ShutdownHandle SIGTERM, finish requestsHard kill with lost requests
Cold startsMin instances, slim images, lazy importsLoading everything at module level