Integration & APIs - Retry and Backoff Strategies

Create a free account to save your progress

Earn XP, track streaks, and sync your dashboard across devices.

Lesson

AI pitfall

AI generates pure exponential backoff without jitter about 80% of the time. If you see Math.pow(2, attempt) * baseDelay with no Math.random() anywhere, add jitter. Without it, all your clients retry at the exact same moment and create a thundering herd that makes the outage worse.

A network call will eventually fail. DNSWhat is dns?The system that translates human-readable domain names like google.com into the numerical IP addresses computers use to find each other. resolution times out. A server restarts mid-request. A database connection poolWhat is connection pool?A set of pre-opened database connections that your app reuses instead of opening and closing a new one for every request. fills up. The question is never "will it fail?" but "what happens when it does?" Naive retry logic is one of the most common causes of cascading failures in distributed systems. Done right, retries make your system resilient. Done wrong, they make outages worse.

Why naive retries are dangerous

Imagine a server that is struggling under load. It returns 503 to half its requests. Now imagine 1,000 clients all immediately retry. The server now has 2,000 requests instead of 1,000. It returns 503 to even more requests. Those clients retry again. The server collapses entirely.

This is the retry storm, and it happens constantly in production. The fix is not to stop retrying -- it is to retry intelligently.

// BAD: Immediate retry with no delay
async function fetchWithRetry(url, retries = 3) {
  for (let i = 0; i < retries; i++) {
    const response = await fetch(url);
    if (response.ok) return response;
    // Immediate retry -- hammers the server
  }
  throw new Error('All retries failed');
}

Exponential backoffWhat is exponential backoff?A retry strategy where each attempt waits twice as long as the previous one, giving an overloaded server progressively more time to recover.

The simplest improvement: wait longer between each retry. The delay grows exponentially, giving the server time to recover.

function exponentialDelay(attempt, baseDelay = 1000) {
  return baseDelay * Math.pow(2, attempt);
  // attempt 0: 1000ms (1s)
  // attempt 1: 2000ms (2s)
  // attempt 2: 4000ms (4s)
  // attempt 3: 8000ms (8s)
}

But pure exponential backoff has a problem. If 1,000 clients all start retrying at the same time, they all wait 1 second, then all retry simultaneously, then all wait 2 seconds, then all retry simultaneously. The retries are synchronized, creating periodic bursts of traffic -- the thundering herd.

Why jitterWhat is jitter?Random variation added to retry delays so that many clients don't all retry at the exact same moment and overwhelm a recovering server. matters

Jitter adds randomness to the delay, so clients spread their retries across time instead of all hitting at once. This single addition is the difference between a retry strategy that helps and one that kills your system.

// Full jitter: random delay between 0 and the exponential max
function fullJitter(attempt, baseDelay = 1000) {
  const maxDelay = baseDelay * Math.pow(2, attempt);
  return Math.random() * maxDelay;
}

// Equal jitter: half fixed + half random (more predictable minimum wait)
function equalJitter(attempt, baseDelay = 1000) {
  const maxDelay = baseDelay * Math.pow(2, attempt);
  const half = maxDelay / 2;
  return half + Math.random() * half;
}

// Decorrelated jitter (AWS recommendation): each delay is random
// between baseDelay and 3x the previous delay
function decorrelatedJitter(previousDelay, baseDelay = 1000) {
  return Math.min(
    MAX_DELAY,
    Math.random() * (previousDelay * 3 - baseDelay) + baseDelay
  );
}

Good to know

The AWS Architecture Blog recommends decorrelated jitter for high-concurrency scenarios. Their testing shows it spreads retries more evenly than full jitter, reducing peak load on recovering servers. When in doubt, use exponential backoff + full jitter, it is the safest default.

Retry strategy comparison

Strategy	Formula	Spread	Use case
Fixed delay	`baseDelay`	None	Simple internal services with low traffic
Exponential	`base * 2^attempt`	None (thundering herd risk)	Only as a building block
Exponential + full jitter	`random(0, base * 2^attempt)`	Maximum	Default choice for most integrations
Exponential + equal jitter	`half + random(0, half)`	Good, with minimum wait	When you need a guaranteed minimum delay
Decorrelated jitter	`random(base, prev * 3)`	Excellent	AWS-recommended, good for high-concurrency
Linear backoff	`base * attempt`	None	Very slow growth, specific throttling scenarios

Edge case

Retrying a POST request that creates a resource (like a payment charge) can cause duplicate operations. Always pair retry logic with idempotency keys for non-idempotent HTTP methods. The AI almost never includes this in generated retry functions.

Complete retry implementation

Here is a production-ready retry function that handles all the edge cases.

interface RetryOptions {
  maxRetries: number;        // Maximum number of retry attempts
  baseDelay: number;         // Base delay in milliseconds
  maxDelay: number;          // Cap on delay to prevent absurd waits
  maxTotalTime: number;      // Total time budget for all retries
  retryableStatuses: number[]; // Which HTTP codes to retry
}

const DEFAULT_OPTIONS: RetryOptions = {
  maxRetries: 3,
  baseDelay: 1000,
  maxDelay: 30000,           // 30 seconds max per retry
  maxTotalTime: 60000,       // 60 seconds total budget
  retryableStatuses: [408, 429, 500, 502, 503, 504],
};

async function fetchWithRetry(
  url: string,
  init: RequestInit = {},
  options: Partial<RetryOptions> = {}
): Promise<Response> {
  const opts = { ...DEFAULT_OPTIONS, ...options };
  const startTime = Date.now();

  for (let attempt = 0; attempt <= opts.maxRetries; attempt++) {
    try {
      const response = await fetch(url, init);

      // Success -- return immediately
      if (response.ok) return response;

      // Non-retryable error -- fail immediately
      if (!opts.retryableStatuses.includes(response.status)) {
        throw new HttpError(response.status, await response.text());
      }

      // 429 with Retry-After header -- respect the server's request
      if (response.status === 429) {
        const retryAfter = response.headers.get('Retry-After');
        if (retryAfter) {
          const waitMs = parseInt(retryAfter, 10) * 1000;
          await sleep(Math.min(waitMs, opts.maxDelay));
          continue;
        }
      }
    } catch (error) {
      // Network errors (DNS failure, connection refused) are retryable
      if (error instanceof HttpError) throw error;
      if (attempt === opts.maxRetries) throw error;
    }

    // Check total time budget
    if (Date.now() - startTime > opts.maxTotalTime) {
      throw new Error(`Retry budget exhausted after ${Date.now() - startTime}ms`);
    }

    // Calculate delay with full jitter
    const exponentialDelay = opts.baseDelay * Math.pow(2, attempt);
    const cappedDelay = Math.min(exponentialDelay, opts.maxDelay);
    const jitteredDelay = Math.random() * cappedDelay;

    console.log(
      `Retry ${attempt + 1}/${opts.maxRetries} after ${Math.round(jitteredDelay)}ms`
    );
    await sleep(jitteredDelay);
  }

  throw new Error(`All ${opts.maxRetries} retries exhausted`);
}

function sleep(ms: number): Promise<void> {
  return new Promise(resolve => setTimeout(resolve, ms));
}

class HttpError extends Error {
  constructor(public status: number, public body: string) {
    super(`HTTP ${status}: ${body}`);
  }
}

When NOT to retry

Here is what AI typically generates, retrying everything, and why that is wrong:

Retries are not always the answer. Retrying the wrong errors makes things worse.

Scenario	Retry?	Why
400 Bad Request	No	Your payload is wrong -- fix it
401 Unauthorized	Only after re-auth	Retry the same token and you get the same error
403 Forbidden	No	You lack permissions -- no amount of retrying changes that
404 Not Found	No	The resource does not exist
409 Conflict	No	State conflict -- investigate before retrying
422 Unprocessable	No	Semantically invalid -- fix the data
429 Too Many Requests	Yes, after delay	Respect `Retry-After` header
500+ Server errors	Yes, with backoff	Server might recover
Network timeout	Yes, with backoff	Transient network issue
DNS resolution failure	Yes, with backoff	DNS might be temporarily down

Timeout cascades

This is one of the most subtle and dangerous failure modes. AI-generated timeout configurations get this wrong almost every time, setting all timeouts to the same value across the service chain. Consider a request chain:

Client (timeout: 5s)
  --> API Gateway (timeout: 10s)
    --> Order Service (timeout: 30s)
      --> Payment Service (timeout: 60s)

The client gives up after 5 seconds. But the API GatewayWhat is api gateway?A single entry point that sits in front of multiple backend services, routing requests to the right one and handling shared concerns like authentication and rate limiting. keeps waiting for 10 seconds. The Order Service keeps waiting for 30 seconds. The Payment Service keeps working for 60 seconds. You have three services doing work that nobody is waiting for anymore.

The rule: upstream timeout must be less than downstream timeout.

Client (timeout: 30s)
  --> API Gateway (timeout: 25s)
    --> Order Service (timeout: 20s)
      --> Payment Service (timeout: 15s)

Now when the Payment Service is slow, the Order Service gives up at 20 seconds and returns an error to the API Gateway, which returns it to the client. No wasted work.

// Configuring timeouts correctly in a service chain
const DOWNSTREAM_TIMEOUT = 15000; // 15s for the service we call
const OUR_TIMEOUT = 20000;        // 20s for our own processing (> downstream)

// When calling downstream
const controller = new AbortController();
const timeout = setTimeout(() => controller.abort(), DOWNSTREAM_TIMEOUT);

try {
  const response = await fetch('https://payment-service.internal/charge', {
    signal: controller.signal,
    method: 'POST',
    body: JSON.stringify(chargeData),
  });
  return response;
} catch (error) {
  if (error.name === 'AbortError') {
    // Downstream timed out -- return 504 to our caller
    return new Response('Payment service timeout', { status: 504 });
  }
  throw error;
} finally {
  clearTimeout(timeout);
}

Circuit breakers

When a service is consistently failing, retries just add load. A circuit breakerWhat is circuit breaker?A pattern that stops sending requests to a failing service after repeated errors, giving it time to recover before trying again. detects persistent failures and stops calling the service entirely for a cooldown period.

class CircuitBreaker {
  private failures = 0;
  private lastFailureTime = 0;
  private state: 'closed' | 'open' | 'half-open' = 'closed';

  constructor(
    private threshold: number = 5,   // failures before opening
    private cooldown: number = 30000  // 30s before trying again
  ) {}

  async call<T>(fn: () => Promise<T>): Promise<T> {
    if (this.state === 'open') {
      if (Date.now() - this.lastFailureTime > this.cooldown) {
        this.state = 'half-open'; // Try one request
      } else {
        throw new Error('Circuit breaker is open');
      }
    }

    try {
      const result = await fn();
      this.onSuccess();
      return result;
    } catch (error) {
      this.onFailure();
      throw error;
    }
  }

  private onSuccess() {
    this.failures = 0;
    this.state = 'closed';
  }

  private onFailure() {
    this.failures++;
    this.lastFailureTime = Date.now();
    if (this.failures >= this.threshold) {
      this.state = 'open';
    }
  }
}

// Usage
const paymentBreaker = new CircuitBreaker(5, 30000);

try {
  const result = await paymentBreaker.call(() =>
    fetchWithRetry('https://payment-service.internal/charge', { method: 'POST' })
  );
} catch (error) {
  // Circuit is open -- use fallback or queue for later
  await queueForLater(chargeData);
}

The combination of retries with backoff, jitterWhat is jitter?Random variation added to retry delays so that many clients don't all retry at the exact same moment and overwhelm a recovering server., timeouts, and circuit breakers gives you a defense-in-depth strategy. Each layer protects against a different failure mode, and together they keep a single failing service from bringing down your entire system.

Done

Complete & Next