Integration & APIs - AI-Assisted Error Strategy

Create a free account to save your progress

Earn XP, track streaks, and sync your dashboard across devices.

Lesson

Good to know

Error handling is the area where AI assistance has the highest gap between "looks correct" and "actually works in production." The generated code compiles and passes basic tests, but the subtle bugs, retrying non-retryable errors, missing timeout budgets, silent DLQs, only surface under real failure conditions that are hard to test.

Error handling is one of the areas where AI assistance is genuinely useful -- and also genuinely dangerous. AI can generate comprehensive error handling code in seconds, complete with retry logic, status codeWhat is status code?A three-digit number in an HTTP response that tells the client what happened: 200 means success, 404 means not found, 500 means the server broke. mapping, and DLQ patterns. But it often produces code that looks correct while containing subtle bugs that only surface in production under real failure conditions.

The key is knowing exactly where AI helps and where it misleads.

What AI does well vs poorly

Task	AI quality	Why
Generating RFC 7807 error schemas	Excellent	Well-documented standard, lots of training data
Mapping HTTP codes to retry decisions	Good	Common pattern, but verify edge cases
Writing exponential backoff code	Good	Standard algorithm, but check for jitter
Designing DLQ message formats	Good	Structural task with clear requirements
Setting timeout values	Poor	Depends on your specific service latencies
Identifying which errors are retryable in YOUR system	Poor	Requires domain knowledge AI lacks
Timeout cascade analysis	Poor	Requires understanding your full service graph
Choosing between DLQ strategies	Mediocre	Depends on business criticality AI cannot assess
Error logging and alerting thresholds	Mediocre	Depends on your traffic volume and SLAs

Prompt templates

Designing error handling for an APIWhat is api?A set of rules that lets one program talk to another, usually over the internet, by sending requests and getting responses.

Edge case

AI-generated timeout values are almost always round numbers, 30 seconds for everything. Your actual services have wildly different latency profiles: an auth check should be 200ms, a payment API might need 5 seconds, a report generation could need 30 seconds. Never use AI's default timeout values in production.

I'm building a [type of service] that calls [external service].
The downstream service has these characteristics:
- Average latency: [X]ms
- P99 latency: [Y]ms
- Known failure modes: [list]
- Rate limits: [X requests per second]

My upstream caller expects responses within [Z] seconds.

Design the error handling strategy including:
1. Which HTTP status codes I should return for each failure mode
2. Retry configuration (max retries, backoff strategy, timeout)
3. What to log at each failure stage
4. When to circuit-break vs retry

Use TypeScript. Follow RFC 7807 for error responses.

This prompt works well because it gives AI the specific constraints it needs. Without the latencyWhat is latency?The time delay between sending a request and receiving the first byte of the response, usually measured in milliseconds. numbers and upstream timeout, AI will pick generic values that may not work for your system.

Reviewing existing retry logic

AI pitfall

AI's strongest error-handling use case is not generating code, it is reviewing your existing code against a specific checklist. Feed it your retry logic and ask it to check for the five common bugs listed below. It catches structural issues better than most human reviewers.

Review this retry implementation for bugs and missing edge cases:

[paste your code]

Specifically check for:
1. Does it retry non-retryable status codes (400, 403, 404)?
2. Does it have jitter to prevent thundering herd?
3. Is there a total time budget, or can retries run forever?
4. Does it respect Retry-After headers on 429 responses?
5. Are timeouts configured correctly relative to upstream callers?

This is one of AI's strongest use cases: reviewing code against a specific checklist. AI is good at pattern matching against known antipatterns.

Generating error response schemas

Generate RFC 7807 Problem Details error types for an e-commerce
order API with these operations:
- Create order
- Process payment
- Reserve inventory
- Ship order

For each operation, define the possible error types with:
- type URI
- title
- HTTP status code
- extension fields specific to this error
- Example response body

Use TypeScript interfaces.

AI is excellent at this because it is a structural, well-defined task. The RFC 7807 standard is well-represented in training data, and generating typed error schemas is exactly the kind of boilerplateWhat is boilerplate?Repetitive, standardized code that follows a known pattern and appears in nearly every project - like setting up a server or wiring up database connections. AI handles well.

What to verify in AI-generated error handling

AI-generated error handling code has predictable failure patterns. Here is what to check every time.

1. Retrying non-retryable errors. AI frequently generates retry logic that retries all non-2xx responses. Check that 400, 403, 404, and 422 are excluded from retry.

// AI often generates this -- WRONG
if (response.status >= 400) {
  await retry(); // Retries 400 Bad Request -- will never succeed
}

// Correct: only retry specific codes
const RETRYABLE = [408, 429, 500, 502, 503, 504];
if (RETRYABLE.includes(response.status)) {
  await retry();
}

2. Missing jitterWhat is jitter?Random variation added to retry delays so that many clients don't all retry at the exact same moment and overwhelm a recovering server.. AI frequently generates pure exponential backoffWhat is exponential backoff?A retry strategy where each attempt waits twice as long as the previous one, giving an overloaded server progressively more time to recover. without jitter. If you see Math.pow(2, attempt) * baseDelay without any Math.random(), add jitter.

3. No total time budget. AI generates retry loops with a max attempt count but no total time limit. If each retry takes 30 seconds and you have 5 retries, your caller waits 2.5 minutes. Add maxTotalTime.

4. Timeout cascades. Ask AI to set timeouts for a three-service chain, and it will often set them all to the same value (e.g., 30 seconds each) or even set the upstream timeout shorter than the downstream. Always verify that upstream timeout > downstream timeout.

5. Silent DLQ. AI generates DLQ routing code that moves messages but forgets to alert anyone. A DLQ without alerting is a black hole.

Hybrid workflow

Here is the workflow that gets the best results by combining AI speed with human judgment.

Step 1: Generate with AI. Use AI to draft the initial error handling: retry logic, error response types, DLQ routing. This saves significant time on boilerplateWhat is boilerplate?Repetitive, standardized code that follows a known pattern and appears in nearly every project - like setting up a server or wiring up database connections..

Step 2: Review the retry list. Go through every status codeWhat is status code?A three-digit number in an HTTP response that tells the client what happened: 200 means success, 404 means not found, 500 means the server broke. in the retry logic. For each one, ask yourself: "If I retry this exact same request, will it succeed?" If the answer is no, remove it from the retry list.

Step 3: Verify timeouts. Draw your service call chain on paper. Write the timeout for each hop. Verify that each upstream timeout is greater than its downstream timeout plus processing time.

Step 4: Test failure modes. AI cannot test for you. Simulate actual failures: kill a downstream service, flood a rate limit, send malformed data. Verify that your error handling does what the code says it does.

Step 5: Add monitoring. AI often forgets this step entirely. For every error path, ask: "How would I know this is happening in production?" Add logging, metrics, and alerts.

This workflow typically takes 30 minutes for a new service integration. Without AI, the boilerplate alone would take that long. With AI but without human review, you ship code that fails in predictable, preventable ways. The combination gives you speed and correctness.