Integration & APIs - Timeouts and Graceful Degradation

Create a free account to save your progress

Earn XP, track streaks, and sync your dashboard across devices.

Lesson

AI pitfall

AI-generated HTTP client code almost never includes timeouts. The default fetch() call has no timeout, it will wait indefinitely. This is the single most common reliability mistake in distributed systems, and AI reproduces it faithfully because the training data is full of timeout-free code.

A missing timeout is the single most common reliability mistake in distributed systems. Without a timeout, a single hung connection can hold a thread, a socket, and a database connection hostage for minutes or hours. Multiply that across concurrent requests and you have a system that appears to be "slow" when it is actually stuck, waiting for a response that will never come.

Types of timeouts

Good to know

Connect timeouts should always be short, 2-3 seconds maximum. If you cannot establish a TCP connection in that time, the server is either down or a network device is silently dropping packets. Waiting 30 seconds will not help. Read timeouts depend on what the server is doing and should be tuned per endpoint.

Not all timeouts measure the same thing. Understanding the distinction between them helps you configure each one correctly.

Timeout type	What it measures	Typical range	When it fires
Connect timeout	Time to establish a TCP connection	1-3 seconds	The server is unreachable or a firewall is dropping packets
Read timeout	Time to receive the first byte of the response body	3-30 seconds	The server accepted the connection but is processing slowly
Total timeout	Wall clock time for the entire request/response cycle	5-60 seconds	Large payloads, slow servers, or DNS resolution delays
Idle timeout	Time a connection can sit unused before being closed	30-120 seconds	Connection pooling, keep-alive management

Connect timeouts should always be short. If you cannot establish a TCP connection in 2-3 seconds, the server is either down or a network device is silently dropping packets. Waiting 30 seconds will not help, it will just waste 30 seconds.

Read timeouts depend on what the server is doing. A simple CRUDWhat is crud?Create, Read, Update, Delete - the four basic operations almost every application performs on data. lookup should respond in under a second. A report generation endpointWhat is endpoint?A specific URL path on a server that handles a particular type of request, like GET /api/users. might legitimately take 15 seconds. Set your read timeout based on what is reasonable for that specific operation.

// Using node-fetch with separate connect and read timeouts
import { Agent } from 'http';

const agent = new Agent({
  timeout: 3000,          // connect timeout: 3 seconds
  keepAlive: true,
  maxSockets: 50,
});

const response = await fetch('https://api.example.com/data', {
  agent,
  signal: AbortSignal.timeout(10000),  // total timeout: 10 seconds
});

Axios timeout configuration

import axios from 'axios';

const client = axios.create({
  timeout: 10000,                 // total timeout: 10 seconds
  // For separate connect/read timeouts, use custom adapter or httpAgent
  httpAgent: new Agent({
    timeout: 3000,                // connect timeout
  }),
});

// Per-request override for slow endpoints
const report = await client.get('/reports/annual', {
  timeout: 30000,                 // this one is legitimately slow
});

Edge case

An AbortSignal.timeout(5000) in Node.js creates a timeout on the entire fetch operation including DNS resolution and TLS handshake. If DNS takes 3 seconds (rare but possible on cold starts), you only have 2 seconds left for the actual request. Account for infrastructure overhead when setting timeouts.

Timeout propagation

This is where most developers get tripped up. You set a 10-second timeout on your API gatewayWhat is api gateway?A single entry point that sits in front of multiple backend services, routing requests to the right one and handling shared concerns like authentication and rate limiting. and think you are safe. But your handler calls three services sequentially, each with a 10-second timeout. The actual worst case is 30 seconds, three times your gateway timeout. The gateway times out after 10 seconds, returns a 504 to the user, but your handler keeps running in the background, consuming resources for another 20 seconds.

User request (10s budget)
  ├── Service A (10s timeout) → might use 10s
  ├── Service B (10s timeout) → might use 10s
  └── Service C (10s timeout) → might use 10s

Worst case: 30 seconds. Gateway timed out at 10s.
The user sees an error. Your server is still working on a dead request.

The fix is timeout budgets: distribute your total timeout across all downstream calls.

class TimeoutBudget {
  private deadline: number;

  constructor(totalMs: number) {
    this.deadline = Date.now() + totalMs;
  }

  remaining(): number {
    return Math.max(0, this.deadline - Date.now());
  }

  hasExpired(): boolean {
    return Date.now() >= this.deadline;
  }

  // Create a signal that aborts when the budget is exhausted
  toAbortSignal(): AbortSignal {
    return AbortSignal.timeout(this.remaining());
  }
}

// Usage: distribute a 10-second budget across three calls
async function handleRequest(req: Request): Promise<Response> {
  const budget = new TimeoutBudget(10000); // 10 seconds total

  // Each call gets whatever time is left
  const user = await fetch('https://users.internal/api/user/123', {
    signal: budget.toAbortSignal(),
  }).then((r) => r.json());

  if (budget.hasExpired()) {
    return Response.json({ error: 'Timeout' }, { status: 504 });
  }

  const orders = await fetch('https://orders.internal/api/user/123/orders', {
    signal: budget.toAbortSignal(),
  }).then((r) => r.json());

  if (budget.hasExpired()) {
    return Response.json(
      { user, orders: [], note: 'Orders timed out' },
      { status: 200 }  // partial success
    );
  }

  const recommendations = await fetch('https://recs.internal/api/user/123', {
    signal: budget.toAbortSignal(),
  }).then((r) => r.json()).catch(() => []);  // optional, fail silently

  return Response.json({ user, orders, recommendations });
}

Notice the progression: the user fetch is critical (fail the request if it times out), the orders fetch returns partial data, and the recommendations are optional (catch and return empty array). This is graceful degradation in action.

Graceful degradation

Graceful degradation means your application still works, just with reduced functionality, when a dependencyWhat is dependency?A piece of code written by someone else that your project needs to work. Think of it as a building block you import instead of writing yourself. fails. Instead of showing a blank error page, you show what you can with what you have.

Dependency fails	Bad response	Graceful degradation
Recommendation engine	500 Internal Server Error	Show "Top sellers" from a static list
User avatar service	Broken image icon	Show initials or a default avatar
Search service	"Service unavailable" page	Show category browsing instead
Real-time pricing	Old price or no price shown	Show cached price with "prices may vary"
Analytics service	Entire page fails to load	Disable tracking silently, page loads fine

async function getProductPage(productId: string): Promise<ProductPage> {
  // Critical: product data must exist
  const product = await productService.getProduct(productId);

  // Degraded: show cached price if pricing service is down
  let price: Price;
  try {
    price = await pricingService.getPrice(productId);
  } catch {
    price = await cache.get(`price:${productId}`) ?? product.basePrice;
  }

  // Optional: recommendations are nice but not essential
  const recommendations = await recommendationService
    .getRecommendations(productId)
    .catch(() => getFallbackRecommendations(product.category));

  // Optional: reviews are non-critical
  const reviews = await reviewService
    .getReviews(productId)
    .catch(() => ({ items: [], total: 0, note: 'Reviews temporarily unavailable' }));

  return { product, price, recommendations, reviews };
}

Cache as a safety net

Caching is the most common degradation strategy. When the live service is down, serve stale data. Stale data is almost always better than no data.

async function fetchWithCache<T>(
  key: string,
  fetcher: () => Promise<T>,
  ttlMs: number = 300000  // 5 minutes default
): Promise<T> {
  try {
    const fresh = await fetcher();
    await cache.set(key, fresh, ttlMs);
    return fresh;
  } catch (error) {
    // Live fetch failed - try cache
    const cached = await cache.get<T>(key);
    if (cached) {
      console.warn(`Serving stale cache for ${key}: ${error}`);
      return cached;
    }

    // Nothing in cache either - now we truly fail
    throw error;
  }
}

Bulkhead pattern

The bulkhead pattern is borrowed from ship design: ships have watertight compartments (bulkheads) so that a breach in one compartment does not sink the entire ship. In software, a bulkhead isolates failures so that one misbehaving dependencyWhat is dependency?A piece of code written by someone else that your project needs to work. Think of it as a building block you import instead of writing yourself. cannot consume all your resources.

Without a bulkhead, all your outbound HTTPWhat is http?The protocol browsers and servers use to exchange web pages, API data, and other resources, defining how requests and responses are formatted. calls share the same connection poolWhat is connection pool?A set of pre-opened database connections that your app reuses instead of opening and closing a new one for every request.. If the recommendation service starts hanging, it consumes all connections. Now the payment service cannot get a connection either, even though it is perfectly healthy.

// Without bulkhead: shared connection pool
// If recs hangs, payments can't get a connection
const sharedPool = new ConnectionPool({ maxConnections: 50 });

// With bulkhead: isolated pools per dependency
const paymentPool = new ConnectionPool({ maxConnections: 20 });
const recsPool = new ConnectionPool({ maxConnections: 10 });
const notificationPool = new ConnectionPool({ maxConnections: 5 });

Semaphore-based bulkhead

When you cannot control connection pools directly, use a semaphore to limit concurrencyWhat is concurrency?The ability of a program to handle multiple tasks at the same time, like serving thousands of users without slowing down. per dependency.

class Semaphore {
  private queue: Array<() => void> = [];
  private active = 0;

  constructor(private maxConcurrency: number) {}

  async acquire(): Promise<void> {
    if (this.active < this.maxConcurrency) {
      this.active++;
      return;
    }

    return new Promise<void>((resolve) => {
      this.queue.push(() => {
        this.active++;
        resolve();
      });
    });
  }

  release(): void {
    this.active--;
    const next = this.queue.shift();
    if (next) next();
  }
}

// Isolate each dependency
const paymentBulkhead = new Semaphore(10);
const recsBulkhead = new Semaphore(5);

async function callPaymentService(data: PaymentRequest): Promise<PaymentResult> {
  await paymentBulkhead.acquire();
  try {
    return await fetch('https://payments.internal/charge', {
      method: 'POST',
      body: JSON.stringify(data),
      signal: AbortSignal.timeout(5000),
    }).then((r) => r.json());
  } finally {
    paymentBulkhead.release();
  }
}

Pattern	What it protects	How
Timeout	Individual requests from hanging	Abort after N milliseconds
Timeout budget	Total request time across dependencies	Distribute time budget
Graceful degradation	User experience when dependencies fail	Serve cached/default data
Bulkhead	Healthy dependencies from failing ones	Isolate resource pools

These patterns are not alternatives, they are layers. A well-built system uses timeouts inside circuit breakers inside bulkheads, with graceful degradation as the final safety net. Each layer catches what the previous one missed.

Done

Complete & Next