Math.pow(2, attempt) * baseDelay with no Math.random() anywhere, add jitter. Without it, all your clients retry at the exact same moment and create a thundering herd that makes the outage worse.A network call will eventually fail. DNSWhat is dns?The system that translates human-readable domain names like google.com into the numerical IP addresses computers use to find each other. resolution times out. A server restarts mid-request. A database connection poolWhat is connection pool?A set of pre-opened database connections that your app reuses instead of opening and closing a new one for every request. fills up. The question is never "will it fail?" but "what happens when it does?" Naive retry logic is one of the most common causes of cascading failures in distributed systems. Done right, retries make your system resilient. Done wrong, they make outages worse.
Why naive retries are dangerous
Imagine a server that is struggling under load. It returns 503 to half its requests. Now imagine 1,000 clients all immediately retry. The server now has 2,000 requests instead of 1,000. It returns 503 to even more requests. Those clients retry again. The server collapses entirely.
This is the retry storm, and it happens constantly in production. The fix is not to stop retrying -- it is to retry intelligently.
// BAD: Immediate retry with no delay
async function fetchWithRetry(url, retries = 3) {
for (let i = 0; i < retries; i++) {
const response = await fetch(url);
if (response.ok) return response;
// Immediate retry -- hammers the server
}
throw new Error('All retries failed');
}Exponential backoffWhat is exponential backoff?A retry strategy where each attempt waits twice as long as the previous one, giving an overloaded server progressively more time to recover.
The simplest improvement: wait longer between each retry. The delay grows exponentially, giving the server time to recover.
function exponentialDelay(attempt, baseDelay = 1000) {
return baseDelay * Math.pow(2, attempt);
// attempt 0: 1000ms (1s)
// attempt 1: 2000ms (2s)
// attempt 2: 4000ms (4s)
// attempt 3: 8000ms (8s)
}But pure exponential backoff has a problem. If 1,000 clients all start retrying at the same time, they all wait 1 second, then all retry simultaneously, then all wait 2 seconds, then all retry simultaneously. The retries are synchronized, creating periodic bursts of traffic -- the thundering herd.
Why jitterWhat is jitter?Random variation added to retry delays so that many clients don't all retry at the exact same moment and overwhelm a recovering server. matters
Jitter adds randomness to the delay, so clients spread their retries across time instead of all hitting at once. This single addition is the difference between a retry strategy that helps and one that kills your system.
// Full jitter: random delay between 0 and the exponential max
function fullJitter(attempt, baseDelay = 1000) {
const maxDelay = baseDelay * Math.pow(2, attempt);
return Math.random() * maxDelay;
}
// Equal jitter: half fixed + half random (more predictable minimum wait)
function equalJitter(attempt, baseDelay = 1000) {
const maxDelay = baseDelay * Math.pow(2, attempt);
const half = maxDelay / 2;
return half + Math.random() * half;
}
// Decorrelated jitter (AWS recommendation): each delay is random
// between baseDelay and 3x the previous delay
function decorrelatedJitter(previousDelay, baseDelay = 1000) {
return Math.min(
MAX_DELAY,
Math.random() * (previousDelay * 3 - baseDelay) + baseDelay
);
}Retry strategy comparison
| Strategy | Formula | Spread | Use case |
|---|---|---|---|
| Fixed delay | baseDelay | None | Simple internal services with low traffic |
| Exponential | base * 2^attempt | None (thundering herd risk) | Only as a building block |
| Exponential + full jitter | random(0, base * 2^attempt) | Maximum | Default choice for most integrations |
| Exponential + equal jitter | half + random(0, half) | Good, with minimum wait | When you need a guaranteed minimum delay |
| Decorrelated jitter | random(base, prev * 3) | Excellent | AWS-recommended, good for high-concurrency |
| Linear backoff | base * attempt | None | Very slow growth, specific throttling scenarios |
Complete retry implementation
Here is a production-ready retry function that handles all the edge cases.
interface RetryOptions {
maxRetries: number; // Maximum number of retry attempts
baseDelay: number; // Base delay in milliseconds
maxDelay: number; // Cap on delay to prevent absurd waits
maxTotalTime: number; // Total time budget for all retries
retryableStatuses: number[]; // Which HTTP codes to retry
}
const DEFAULT_OPTIONS: RetryOptions = {
maxRetries: 3,
baseDelay: 1000,
maxDelay: 30000, // 30 seconds max per retry
maxTotalTime: 60000, // 60 seconds total budget
retryableStatuses: [408, 429, 500, 502, 503, 504],
};
async function fetchWithRetry(
url: string,
init: RequestInit = {},
options: Partial<RetryOptions> = {}
): Promise<Response> {
const opts = { ...DEFAULT_OPTIONS, ...options };
const startTime = Date.now();
for (let attempt = 0; attempt <= opts.maxRetries; attempt++) {
try {
const response = await fetch(url, init);
// Success -- return immediately
if (response.ok) return response;
// Non-retryable error -- fail immediately
if (!opts.retryableStatuses.includes(response.status)) {
throw new HttpError(response.status, await response.text());
}
// 429 with Retry-After header -- respect the server's request
if (response.status === 429) {
const retryAfter = response.headers.get('Retry-After');
if (retryAfter) {
const waitMs = parseInt(retryAfter, 10) * 1000;
await sleep(Math.min(waitMs, opts.maxDelay));
continue;
}
}
} catch (error) {
// Network errors (DNS failure, connection refused) are retryable
if (error instanceof HttpError) throw error;
if (attempt === opts.maxRetries) throw error;
}
// Check total time budget
if (Date.now() - startTime > opts.maxTotalTime) {
throw new Error(`Retry budget exhausted after ${Date.now() - startTime}ms`);
}
// Calculate delay with full jitter
const exponentialDelay = opts.baseDelay * Math.pow(2, attempt);
const cappedDelay = Math.min(exponentialDelay, opts.maxDelay);
const jitteredDelay = Math.random() * cappedDelay;
console.log(
`Retry ${attempt + 1}/${opts.maxRetries} after ${Math.round(jitteredDelay)}ms`
);
await sleep(jitteredDelay);
}
throw new Error(`All ${opts.maxRetries} retries exhausted`);
}
function sleep(ms: number): Promise<void> {
return new Promise(resolve => setTimeout(resolve, ms));
}
class HttpError extends Error {
constructor(public status: number, public body: string) {
super(`HTTP ${status}: ${body}`);
}
}When NOT to retry
Here is what AI typically generates, retrying everything, and why that is wrong:
Retries are not always the answer. Retrying the wrong errors makes things worse.
| Scenario | Retry? | Why |
|---|---|---|
| 400 Bad Request | No | Your payload is wrong -- fix it |
| 401 Unauthorized | Only after re-auth | Retry the same token and you get the same error |
| 403 Forbidden | No | You lack permissions -- no amount of retrying changes that |
| 404 Not Found | No | The resource does not exist |
| 409 Conflict | No | State conflict -- investigate before retrying |
| 422 Unprocessable | No | Semantically invalid -- fix the data |
| 429 Too Many Requests | Yes, after delay | Respect Retry-After header |
| 500+ Server errors | Yes, with backoff | Server might recover |
| Network timeout | Yes, with backoff | Transient network issue |
| DNS resolution failure | Yes, with backoff | DNS might be temporarily down |
Timeout cascades
This is one of the most subtle and dangerous failure modes. AI-generated timeout configurations get this wrong almost every time, setting all timeouts to the same value across the service chain. Consider a request chain:
Client (timeout: 5s)
--> API Gateway (timeout: 10s)
--> Order Service (timeout: 30s)
--> Payment Service (timeout: 60s)The client gives up after 5 seconds. But the API GatewayWhat is api gateway?A single entry point that sits in front of multiple backend services, routing requests to the right one and handling shared concerns like authentication and rate limiting. keeps waiting for 10 seconds. The Order Service keeps waiting for 30 seconds. The Payment Service keeps working for 60 seconds. You have three services doing work that nobody is waiting for anymore.
The rule: upstream timeout must be less than downstream timeout.
Client (timeout: 30s)
--> API Gateway (timeout: 25s)
--> Order Service (timeout: 20s)
--> Payment Service (timeout: 15s)Now when the Payment Service is slow, the Order Service gives up at 20 seconds and returns an error to the API Gateway, which returns it to the client. No wasted work.
// Configuring timeouts correctly in a service chain
const DOWNSTREAM_TIMEOUT = 15000; // 15s for the service we call
const OUR_TIMEOUT = 20000; // 20s for our own processing (> downstream)
// When calling downstream
const controller = new AbortController();
const timeout = setTimeout(() => controller.abort(), DOWNSTREAM_TIMEOUT);
try {
const response = await fetch('https://payment-service.internal/charge', {
signal: controller.signal,
method: 'POST',
body: JSON.stringify(chargeData),
});
return response;
} catch (error) {
if (error.name === 'AbortError') {
// Downstream timed out -- return 504 to our caller
return new Response('Payment service timeout', { status: 504 });
}
throw error;
} finally {
clearTimeout(timeout);
}Circuit breakers
When a service is consistently failing, retries just add load. A circuit breakerWhat is circuit breaker?A pattern that stops sending requests to a failing service after repeated errors, giving it time to recover before trying again. detects persistent failures and stops calling the service entirely for a cooldown period.
class CircuitBreaker {
private failures = 0;
private lastFailureTime = 0;
private state: 'closed' | 'open' | 'half-open' = 'closed';
constructor(
private threshold: number = 5, // failures before opening
private cooldown: number = 30000 // 30s before trying again
) {}
async call<T>(fn: () => Promise<T>): Promise<T> {
if (this.state === 'open') {
if (Date.now() - this.lastFailureTime > this.cooldown) {
this.state = 'half-open'; // Try one request
} else {
throw new Error('Circuit breaker is open');
}
}
try {
const result = await fn();
this.onSuccess();
return result;
} catch (error) {
this.onFailure();
throw error;
}
}
private onSuccess() {
this.failures = 0;
this.state = 'closed';
}
private onFailure() {
this.failures++;
this.lastFailureTime = Date.now();
if (this.failures >= this.threshold) {
this.state = 'open';
}
}
}
// Usage
const paymentBreaker = new CircuitBreaker(5, 30000);
try {
const result = await paymentBreaker.call(() =>
fetchWithRetry('https://payment-service.internal/charge', { method: 'POST' })
);
} catch (error) {
// Circuit is open -- use fallback or queue for later
await queueForLater(chargeData);
}The combination of retries with backoff, jitterWhat is jitter?Random variation added to retry delays so that many clients don't all retry at the exact same moment and overwhelm a recovering server., timeouts, and circuit breakers gives you a defense-in-depth strategy. Each layer protects against a different failure mode, and together they keep a single failing service from bringing down your entire system.