Shipping Python APIs - Debugging in production

Create a free account to save your progress

Earn XP, track streaks, and sync your dashboard across devices.

Lesson

Everything in this moduleWhat is module?A self-contained file of code with its own scope that explicitly exports values for other files to import, preventing name collisions. has been building toward this moment: something is broken in production and you need to fix it. Your structured logs, Sentry alerts, and Prometheus metrics are the tools you will use. But tools alone do not solve production problems, the debugging workflow is a skill that develops through practice and discipline.

This lesson is marked DEEP because production debugging requires synthesizing everything you have learned. It is not about memorizing steps, it is about developing the judgment to navigate ambiguity under pressure.

The production debugging workflow

When an alert fires, resist the urge to start reading code. The workflow is:

1. Alert fires (Sentry, PagerDuty, customer report)
     |
2. Assess impact (how many users? which endpoints? since when?)
     |
3. Check recent changes (deploys, config changes, dependency updates)
     |
4. Read logs (correlation ID → full request trace)
     |
5. Check metrics (latency spike? error rate? saturation?)
     |
6. Form hypothesis
     |
7. Reproduce locally
     |
8. Fix → Deploy → Verify
     |
9. Post-mortem (if severe)

Each step narrows the search space. Skipping steps wastes time, you end up reading random code instead of systematically eliminating possibilities.

Step 1, Assess the impact

Before debugging, understand the scopeWhat is scope?The area of your code where a variable is accessible; variables declared inside a function or block are invisible outside it.. This determines urgency and whether you need a hotfix or can take time for a proper investigation.

Question	Where to find the answer
How many users are affected?	Sentry issue → "Users" count
When did it start?	Sentry → "First seen" / Grafana → error rate graph
Which endpoints are affected?	Prometheus → error rate by endpoint
Is it getting worse?	Grafana → error rate trend over last hour
Is the service degraded or fully down?	Health check endpoint + external monitoring

If the service is fully down, skip to "Hotfixes" below. If it is degraded (intermittent errors, slow responses), continue the investigation.

Step 2, Check recent changes

Most production bugs are caused by something that changed. Start with the most likely suspects:

1. Recent deploy? → git log --since="2 hours ago" --oneline
2. Config change? → Check environment variable history
3. Dependency update? → Check requirements.txt / poetry.lock diff
4. Infrastructure change? → Check cloud provider status page
5. Traffic spike? → Check request rate metrics

If a deploy happened 30 minutes before the errors started, that deploy is the prime suspect. You do not need to prove it yet, just narrow the search.

AI pitfall

If you paste a production error into AI and ask "what is wrong?", it will analyze the code in isolation. It does not know about your recent deploy, your traffic patterns, or your infrastructure. AI can help you understand code, but it cannot debug a system. The context, what changed, when, and what the metrics show, is yours to provide.

Step 3, Read the logs

This is where structured loggingWhat is structured logging?Writing log entries as machine-readable JSON objects with consistent fields instead of plain text, making them searchable by log analysis tools. pays off. You have a Sentry error report with a stack traceWhat is stack trace?A list of function calls recorded at the moment an error occurs, showing exactly which functions were active and in what order. and breadcrumbs. Now you need the full picture.

Tracing a request

If you have a correlation ID (from the error report or from the user):

# Search your log aggregator
request_id = "abc-123-def-456"

# You'll see every log entry from that request, in order:
14:32:01 INFO  Request received: POST /api/orders
14:32:01 INFO  User authenticated: user_789
14:32:01 INFO  Validating order items: 3 items
14:32:02 INFO  Payment initiated: $59.99
14:32:02 INFO  Payment successful: charge_xyz
14:32:02 INFO  Saving order to database...
14:32:07 ERROR Database query timeout after 5000ms
14:32:07 ERROR Order creation failed: TimeoutError

The timeline tells the story: everything worked until the database query. The 5-second gap between "Saving order" and "timeout" points to a database problem, not an application bug.

What to look for in logs

Pattern	What it suggests
Sudden increase in ERROR logs	Something broke, check recent changes
Gaps in the timeline	A call is hanging (network, database, external API)
Repeating pattern (every N requests)	Resource exhaustion (connection pool, memory leak)
Errors from a single user/IP	Client-side issue or abuse
Errors across all users simultaneously	Infrastructure or dependency failure

Step 4, Check the metrics

Logs show individual events. Metrics show patterns over time. Overlay the error with your Grafana dashboards:

Error rate: Did it spike suddenly (deploy/config change) or creep up gradually (resource leak)?
LatencyWhat is latency?The time delay between sending a request and receiving the first byte of the response, usually measured in milliseconds.: Are response times degrading before errors appear? This often indicates saturation.
Saturation: Is a connection poolWhat is connection pool?A set of pre-opened database connections that your app reuses instead of opening and closing a new one for every request. filling up? Is memory growing? Is disk space running out?
Traffic: Did a traffic spike coincide with the errors?

Metrics often reveal the cause even before you read a single line of code. A memory graph that climbs linearly for 12 hours and then drops (process crash + restart) is a textbook memory leakWhat is memory leak?When your program holds onto memory it no longer needs, causing usage to grow over time until performance degrades or it crashes.. A connection pool gauge that hits 100% right when errors start is connection pool exhaustion.

Step 5, Form a hypothesis and reproduce locally

By now you should have a hypothesis: "the error happens because X under condition Y." The next step is proving it.

Reproducing production conditions

Production bugs are hard to reproduce because the conditions are specific:

# The bug might only appear when:
# - More than 20 concurrent requests hit the same endpoint
# - A specific user has NULL in a field you assumed was never NULL
# - The database query takes longer than the connection timeout
# - Two requests modify the same row simultaneously

Strategies for local reproduction:

Strategy	When to use
Copy the exact request (curl from logs)	Input-dependent bugs
Load test locally (locust, wrk)	Concurrency/saturation bugs
Seed database with production-like data	Data-dependent bugs
Simulate slow dependencies (toxiproxy)	Timeout/network bugs

If you cannot reproduce locally, add more logging at the suspected location and deploy to staging. The additional logs will confirm or refute your hypothesis on the next occurrence.

Hotfixes vs proper fixes

When the service is down, speed matters more than elegance.

Hotfix, stop the bleeding

A hotfix is the minimum change that stops the immediate impact. Common hotfixes:

- Roll back the most recent deploy
- Disable a feature flag
- Increase a resource limit (connection pool size, timeout)
- Add a rate limit to a hammered endpoint
- Redirect traffic away from a broken service

A hotfix does not need to be the "right" fix. It needs to be safe, small, and fast. Rolling back is almost always the safest hotfix, it returns the system to a known working state.

Proper fix, after the bleeding stops

Once the hotfix stabilizes the service, take time for the real fix. This means understanding the root cause, writing the code properly, testing it, and deploying with confidence.

Hotfix	Proper fix
"Increase connection pool to 50"	"Fix connection leak in error handling path"
"Add 10-second timeout"	"Optimize query that takes too long under load"
"Disable feature X"	"Fix the race condition in feature X"
"Roll back deploy"	"Fix the bug and re-deploy with the fix"

Post-mortems

For any incident that affected users for more than a few minutes, write a post-mortemWhat is post-mortem?A structured review after an incident to document what happened, why, and what to change so it doesn't happen again.. The goal is not blame, it is learning.

A post-mortem answers five questions:

What happened? Timeline of the incident from first alert to resolution.
What was the impact? How many users, how long, what was the user experience.
What was the root cause? The actual bug or failure, not "the deploy broke things."
How was it resolved? Hotfix and proper fix.
How do we prevent recurrence? Action items: better tests, monitoring, process changes.

The most valuable part is item 5. If the root cause was "a missing session.close() in an error path," the action item is not "remember to close sessions", it is "add a linterWhat is linter?A tool that scans your code for style violations, common mistakes, and suspicious patterns without running it. rule or use yield dependencyWhat is dependency?A piece of code written by someone else that your project needs to work. Think of it as a building block you import instead of writing yourself. injection so sessions are always closed automatically."

AI pitfall

You can ask AI to write a post-mortem template. You cannot ask it to fill in the template meaningfully. AI does not know your system architecture, your deploy history, or the actual impact on users. The post-mortem is a human document that requires the context of the person who investigated the incident.

A complete example, tracing an intermittent error

Let us walk through a realistic debugging scenario using everything from this moduleWhat is module?A self-contained file of code with its own scope that explicitly exports values for other files to import, preventing name collisions..

The alert

Sentry alert: DatabaseError: connection pool exhausted, 47 occurrences in the last hour, affecting 23 users. The error started 90 minutes ago.

The investigation

Recent changes? A deploy happened 2 hours ago. The diffWhat is diff?A comparison showing exactly which lines were added, removed, or changed between two versions of code. shows a new endpointWhat is endpoint?A specific URL path on a server that handles a particular type of request, like GET /api/users. /api/reports/generate was added.

Metrics? The database connection gauge shows active connections climbing steadily from 5 to 20 (the pool max) over the last 90 minutes. Once it hits 20, errors start.

Logs? Filtering by the new endpoint:

14:00:01 INFO  GET /api/reports/generate - started
14:00:02 INFO  Database session acquired
14:00:02 INFO  Querying report data...
14:00:03 ERROR ValueError: invalid date range
# No "session released" log - the connection leaked

Root cause? The new endpoint acquires a database sessionWhat is session?A server-side record that tracks a logged-in user. The browser holds only a session ID in a cookie, and the server looks up the full data on each request. but does not release it when an exception occurs. The dependencyWhat is dependency?A piece of code written by someone else that your project needs to work. Think of it as a building block you import instead of writing yourself. does not use yield:

# Bug: session never closed on error
def get_db():
    session = SessionLocal()
    return session        # No cleanup!

# Fix: use yield so cleanup always runs
def get_db():
    session = SessionLocal()
    try:
        yield session
    finally:
        session.close()   # Always runs, even on error

Hotfix? Increase the connection poolWhat is connection pool?A set of pre-opened database connections that your app reuses instead of opening and closing a new one for every request. size from 20 to 50 to buy time, and restart the service to release leaked connections.

Proper fix? Change get_db() to use yield with a finally block. Add a Prometheus gauge for active connections so the leak would have been visible before it caused errors.

Post-mortemWhat is post-mortem?A structured review after an incident to document what happened, why, and what to change so it doesn't happen again. action items:

All database dependencies must use yield pattern (add to code review checklist)
Add active connection gauge to the standard dashboard
Alert when connections exceed 80% of pool size

This is the workflow AI cannot replace. AI can explain what yield does. It cannot look at your metrics, correlate them with your deploy history, trace a request through your logs, and determine that a missing finally block is leaking connections. That synthesis of system knowledge, observability data, and debugging judgment is uniquely human.

Quick reference

Step	Action	Tools
1. Assess impact	How many users? Since when? Getting worse?	Sentry, Grafana
2. Recent changes	Deploy? Config? Dependency update?	Git log, deploy history
3. Read logs	Trace request by correlation ID	Log aggregator
4. Check metrics	Error rate, latency, saturation trends	Grafana, Prometheus
5. Reproduce	Match production conditions locally	curl, load testing
6. Hotfix	Minimum change to stop impact	Rollback, feature flag
7. Proper fix	Root cause fix with tests	Code review, staging deploy
8. Post-mortem	Timeline, root cause, prevention	Written document