Integration & APIs - AI for Reliability Engineering

Create a free account to save your progress

Earn XP, track streaks, and sync your dashboard across devices.

Lesson

Good to know

The strongest use of AI in reliability engineering is not generating new code, it is reviewing your existing code for missing reliability patterns. Feed AI your service code and ask it to check for missing timeouts, shared connection pools, retry loops without backoff, and absent circuit breakers. It catches structural gaps faster than most human reviewers.

Reliability patterns have a lot of boilerplateWhat is boilerplate?Repetitive, standardized code that follows a known pattern and appears in nearly every project - like setting up a server or wiring up database connections.. Circuit breakers, retry logic, timeout wrappers, and rate limiters all follow well-known templates. AI is very good at generating these templates. But the devil is in the details, the specific thresholds, the failure scenarios, and the interactions between patterns are where AI regularly gets things wrong.

What AI does well vs. poorly

AI does well	AI does poorly
Generating circuit breaker implementations	Choosing threshold values for your specific traffic
Writing retry logic with exponential backoff	Timeout propagation across service chains
Implementing token bucket rate limiters	Reasoning about cascading failure scenarios
Creating timeout wrapper utilities	Fallback strategies for partial failure states
Scaffolding bulkhead patterns with semaphores	Interactions between multiple reliability patterns
Identifying missing reliability patterns in code reviews	Tuning for production traffic patterns you describe vaguely

The pattern is consistent: AI handles the structural code well but struggles with the contextual decisions, the things that depend on your specific system, traffic, and failure modes.

AI pitfall

AI defaults to round-number thresholds: 5 failures, 30-second reset, 10-second timeout. These are reasonable starting points but wrong for your specific system. A payment service handling CALLOUTM/day needs a threshold of 3 failures (trip fast), while a recommendation engine can tolerate 20 (trip slow). Always replace AI's defaults with values based on your actual traffic data and business criticality.

Prompt templates

1. Analyze failure modes for a service architecture

Prompt:

Analyze the failure modes of this architecture:
- API Gateway → User Service → PostgreSQL
- API Gateway → Order Service → PostgreSQL + Payment API (external)
- API Gateway → Notification Service → SendGrid API

For each service and connection, list:
1. What can fail
2. Impact on the user
3. Recommended reliability pattern (circuit breaker, timeout, fallback, etc.)
4. Specific configuration recommendations

AI will typically produce a thorough analysis covering obvious failure modes. What to verify: AI often underestimates the impact of slow responses (not just failures) and may miss correlated failures, for example, the PostgreSQL being shared between User Service and Order Service means a database issue affects both simultaneously.

2. Implement a circuit breakerWhat is circuit breaker?A pattern that stops sending requests to a failing service after repeated errors, giving it time to recover before trying again. with fallback

Prompt:

Implement a circuit breaker for a Node.js service calling an
external payment API with these requirements:
- Open after 5 failures in 30 seconds
- Half-open test after 15 seconds
- Fallback: queue the payment for async processing
- Log state transitions for monitoring
- TypeScript, using the opossum library

What to verify in AI output:

The volumeThreshold setting, AI often omits it, causing the circuit to trip on the very first failure
Fallback error handling, if the queue itself fails, is that handled?
The errorFilter, AI rarely includes this, but you may want to exclude 4xx client errors from tripping the circuit (a 400 Bad Request is not a service failure)

3. Design a timeout strategy for a service chain

Prompt:

Design a timeout strategy for this request flow:
- Client → API Gateway (user expects response in 5 seconds)
- API Gateway → Auth Service (fast, should be <200ms)
- API Gateway → Product Service (usually 500ms, sometimes 2s)
- Product Service → Pricing Service (usually 100ms)
- API Gateway → Recommendation Service (optional, can be dropped)

Provide specific timeout values for each hop and explain
the timeout budget distribution.

AI typically generates reasonable values but watch for these mistakes:

Setting individual timeouts that sum to more than the total budget
Not accounting for network overhead between services (add 50-100ms per hop)
Missing the recommendation service timeout, marking it "optional" does not mean it needs no timeout

4. Review code for missing reliability patterns

Prompt:

Review this service code for missing reliability patterns.
For each issue found, explain the failure scenario and
suggest a specific fix:

[paste your code]

Check for: missing timeouts, missing circuit breakers,
missing fallbacks, shared resource pools, missing rate
limiting, retry storms, and cascading failure risks.

This is where AI genuinely shines. It will catch obvious gaps like missing timeouts on fetch calls, shared connection pools with no bulkhead, and retry loops without backoff. It may miss subtler issues like timeout budget propagation or interactions between your circuit breaker and retry logic (retrying inside a circuit breaker can trip it faster than expected).

Edge case

AI rarely includes an errorFilter in circuit breaker configurations. A 400 Bad Request means the client sent invalid data, the server is fine. A 503 means the server is down. Without error filtering, client errors trip the circuit breaker and prevent legitimate requests from reaching a healthy server.

What to always verify

After AI generates reliability code, check these specific things:

Circuit breakerWhat is circuit breaker?A pattern that stops sending requests to a failing service after repeated errors, giving it time to recover before trying again. thresholds: AI defaults to round numbers (5 failures, 30-second reset). Your service might need 3 failures if it handles payments, or 20 if it is a high-volume, error-tolerant endpointWhat is endpoint?A specific URL path on a server that handles a particular type of request, like GET /api/users.. Base thresholds on your actual error rate baseline and SLAs.

Timeout propagation: If AI sets a 10-second timeout on your gateway and 10-second timeouts on each downstream call, the math does not work. Verify the total time budget and distribute it correctly.

Fallback completeness: AI generates the happy-path fallback but often misses what happens when the fallback itself fails. Cache misses, empty defaults, and stale data all need handling.

Error classification: Not every error should trip a circuit breaker. A 400 Bad Request is a client error, the server is fine. A 503 Service Unavailable is a server error. AI-generated circuit breakers rarely include error filters.

Retry and circuit breaker interaction: If you retry 3 times inside a circuit breaker with a threshold of 5, two failed user requests (3 retries each = 6 failures) will trip the circuit. Is that what you want?

Hybrid workflow

The most effective approach combines AI speed with human judgment:

AI generates the skeleton: circuit breakerWhat is circuit breaker?A pattern that stops sending requests to a failing service after repeated errors, giving it time to recover before trying again., retry, timeout, and fallback boilerplateWhat is boilerplate?Repetitive, standardized code that follows a known pattern and appears in nearly every project - like setting up a server or wiring up database connections.
You map your dependencies: classify each as critical, degraded, or optional
You set the thresholds: based on actual traffic data, not AI guesses
AI reviews your implementation: catches structural gaps you might have missed
You verify failure paths: manually trace what happens when each dependencyWhat is dependency?A piece of code written by someone else that your project needs to work. Think of it as a building block you import instead of writing yourself. goes down
AI generates tests: failure scenario tests, timeout tests, circuit breaker state transition tests

This workflow lets AI handle the repetitive scaffoldingWhat is scaffolding?Auto-generating the basic file structure and starter code for a project or feature so you don't have to write it from scratch. while you handle the decisions that require understanding your specific system.

Create a free account to save your progress

Essential to know

What AI does well vs. poorly

Prompt templates

1. Analyze failure modes for a service architecture

2. Implement a circuit breakerWhat is circuit breaker?A pattern that stops sending requests to a failing service after repeated errors, giving it time to recover before trying again. Ask AI for more with fallback

3. Design a timeout strategy for a service chain

4. Review code for missing reliability patterns

What to always verify

Hybrid workflow

2. Implement a circuit breakerWhat is circuit breaker?A pattern that stops sending requests to a failing service after repeated errors, giving it time to recover before trying again. with fallback