Integration & APIs - Reliability Fundamentals

Create a free account to save your progress

Earn XP, track streaks, and sync your dashboard across devices.

Lesson

Every integration you build is a bet that someone else's service will be available when you need it. The question is not whether your integrations will fail, they will. The question is what happens to your system when they do.

AI pitfall

Ask AI to build a service that calls 3 external APIs and it will chain them sequentially with no error handling differentiation. Every dependency gets treated as equally critical. A failed recommendation engine should not prevent an order from completing, but a failed payment service should. AI does not make this distinction unless you explicitly tell it to classify dependencies.

Why integrations are fragile

Between your fetch() and the response body, your request crosses DNSWhat is dns?The system that translates human-readable domain names like google.com into the numerical IP addresses computers use to find each other. resolution, TLSWhat is ssl/tls?Encryption protocols that secure the connection between a browser and a server, preventing eavesdropping on data in transit. handshakes, load balancers, reverse proxies, application servers, database connections, and the return trip. Any link in that chain can break.

The math works against you: if each service has 99.9% uptime (roughly 8.7 hours of downtime per year), five services chained together give you 99.5% uptime, roughly 43 hours of downtime per year. Five times worse than any individual service.

// This innocent-looking function has 3 external failure points
async function processOrder(order: Order): Promise<OrderResult> {
  // Failure point 1: payment service
  const payment = await paymentService.charge(order.amount);

  // Failure point 2: inventory service
  await inventoryService.reserve(order.items);

  // Failure point 3: notification service
  await notificationService.sendConfirmation(order.email);

  return { status: 'completed', paymentId: payment.id };
}

If the notification service is down, should the entire order fail? The customer already paid and inventory was reserved. But without explicit handling, the unhandled rejection bubbles up and the order appears to fail.

The five failure modes

Failure mode	Symptoms	Impact	Typical cause
Timeout	Request hangs, eventually errors after N seconds	Threads/connections held open, cascading slowdowns	Overloaded service, network congestion, large payloads
Connection refused	Immediate error, no response at all	Fast failure, relatively low impact if handled	Service is down, port not listening, firewall blocking
Slow response	Response arrives but takes 5-30x longer than normal	Resource exhaustion, user-facing latency spikes	Database under load, GC pauses, cold starts
Corrupt data	200 OK but response body is malformed or incorrect	Silent data corruption, downstream bugs	Serialization bugs, version mismatches, partial responses
Partial failure	Some items in a batch succeed, others fail	Inconsistent state, hard to retry safely	One database row locked, one item out of stock

Timeouts and connection refused are loud and obvious. Slow responses and corrupt data are dangerous because they can go undetected for hours.

Detecting each failure mode

async function callWithDiagnostics(url: string): Promise<Response> {
  const start = Date.now();

  try {
    const response = await fetch(url, {
      signal: AbortSignal.timeout(5000), // Timeout after 5s
    });

    const duration = Date.now() - start;

    // Detect slow responses (even if they succeed)
    if (duration > 2000) {
      console.warn(`Slow response from ${url}: ${duration}ms`);
      metrics.recordSlowCall(url, duration);
    }

    // Detect corrupt data (status is OK but body is wrong)
    if (response.ok) {
      const contentType = response.headers.get('content-type');
      if (!contentType?.includes('application/json')) {
        console.error(`Unexpected content type from ${url}: ${contentType}`);
        throw new Error('Corrupt response: unexpected content type');
      }
    }

    return response;
  } catch (error) {
    const duration = Date.now() - start;

    if (error instanceof DOMException && error.name === 'TimeoutError') {
      // Timeout: request took too long
      metrics.recordTimeout(url, duration);
      throw new IntegrationError('TIMEOUT', url, duration);
    }

    if (error instanceof TypeError && error.message.includes('fetch failed')) {
      // Connection refused: service unreachable
      metrics.recordConnectionRefused(url);
      throw new IntegrationError('CONNECTION_REFUSED', url, duration);
    }

    throw error;
  }
}

Cascading failures

A cascading failure happens when one service fails and the failure propagates to every service that depends on it, and then to their dependents.

Here is a typical cascade:

The database gets slow due to a long-running query
The APIWhat is api?A set of rules that lets one program talk to another, usually over the internet, by sending requests and getting responses. service holds connections open waiting for the database
The API connection poolWhat is connection pool?A set of pre-opened database connections that your app reuses instead of opening and closing a new one for every request. fills up, new requests start queuing
The frontend gateway times out waiting for the API
Users start retrying their requests, tripling the load
The entire system grinds to a halt

The root cause was a slow query. The actual impact was a complete outage. Circuit breakers, timeouts, rate limitingWhat is rate limiting?Restricting how many requests a client can make within a time window. Prevents brute-force attacks and protects your API from being overwhelmed., and bulkheads are all designed to stop step 2 from becoming step 6.

[Database slow] → [API holds connections] → [Connection pool full]
                                                   ↓
[Users retry] ← [Frontend timeouts] ← [Requests queue up]
     ↓
[3x more load] → [Complete system failure]

Edge case

The most dangerous cascading failures are caused by slow responses, not outright failures. A service returning 503 triggers your error handling. A service that takes 29 seconds to respond holds your threads, fills your connection pool, and brings down everything behind it, while technically still "working."

DependencyWhat is dependency?A piece of code written by someone else that your project needs to work. Think of it as a building block you import instead of writing yourself. mapping

A dependency map shows every service your application depends on, how critical each one is, and what happens when it fails.

Building your dependency map

interface Dependency {
  name: string;
  type: 'critical' | 'degraded' | 'optional';
  timeout: number;        // ms
  fallback: string;       // what to do when it fails
  healthCheck: string;    // URL to check
}

const dependencies: Dependency[] = [
  {
    name: 'Payment Service',
    type: 'critical',           // Order cannot proceed without it
    timeout: 10000,
    fallback: 'reject order with retry prompt',
    healthCheck: 'https://payments.internal/health',
  },
  {
    name: 'Inventory Service',
    type: 'critical',
    timeout: 5000,
    fallback: 'reject order, show out-of-stock',
    healthCheck: 'https://inventory.internal/health',
  },
  {
    name: 'Notification Service',
    type: 'optional',           // Order succeeds without it
    timeout: 3000,
    fallback: 'queue email for later delivery',
    healthCheck: 'https://notifications.internal/health',
  },
  {
    name: 'Recommendation Engine',
    type: 'degraded',           // Show generic recs if down
    timeout: 2000,
    fallback: 'return top-selling products',
    healthCheck: 'https://recommendations.internal/health',
  },
];

Dependency type	Meaning	Failure strategy
Critical	System cannot function without it	Retry with circuit breaker, fail the request if unrecoverable
Degraded	System works but with reduced functionality	Return cached or default data, hide the feature
Optional	System works fine without it	Fire and forget, queue for later, log and move on

An optional dependency failure should never block a response. A degraded dependency should trigger a fallback, not an error. Only critical dependencies should be able to fail your request, and even then, only after retries and circuit breakers have had their chance.

Measuring reliability

These four metrics, the "four golden signals", tell you how your integrations are performing:

Metric	What it measures	Warning sign
Error rate	Percentage of requests that fail	Sudden spike above baseline
Latency (p50, p95, p99)	How long requests take	p99 growing while p50 stays flat
Throughput	Requests per second	Sudden drop without a deploy
Saturation	How close resources are to capacity	Connection pool above 80%

Focus on p95 and p99 latency, not averages. A mean latency of 200ms can hide the fact that 1% of your users are waiting 5 seconds.

Done

Complete & Next