Integration & APIs/
Lesson

Every integration you build is a bet that someone else's service will be available when you need it. The question is not whether your integrations will fail, they will. The question is what happens to your system when they do.

AI pitfall
Ask AI to build a service that calls 3 external APIs and it will chain them sequentially with no error handling differentiation. Every dependency gets treated as equally critical. A failed recommendation engine should not prevent an order from completing, but a failed payment service should. AI does not make this distinction unless you explicitly tell it to classify dependencies.

Why integrations are fragile

Between your fetch() and the response body, your request crosses DNSWhat is dns?The system that translates human-readable domain names like google.com into the numerical IP addresses computers use to find each other. resolution, TLSWhat is ssl/tls?Encryption protocols that secure the connection between a browser and a server, preventing eavesdropping on data in transit. handshakes, load balancers, reverse proxies, application servers, database connections, and the return trip. Any link in that chain can break.

The math works against you: if each service has 99.9% uptime (roughly 8.7 hours of downtime per year), five services chained together give you 99.5% uptime, roughly 43 hours of downtime per year. Five times worse than any individual service.

// This innocent-looking function has 3 external failure points
async function processOrder(order: Order): Promise<OrderResult> {
  // Failure point 1: payment service
  const payment = await paymentService.charge(order.amount);

  // Failure point 2: inventory service
  await inventoryService.reserve(order.items);

  // Failure point 3: notification service
  await notificationService.sendConfirmation(order.email);

  return { status: 'completed', paymentId: payment.id };
}

If the notification service is down, should the entire order fail? The customer already paid and inventory was reserved. But without explicit handling, the unhandled rejection bubbles up and the order appears to fail.

02

The five failure modes

Failure modeSymptomsImpactTypical cause
TimeoutRequest hangs, eventually errors after N secondsThreads/connections held open, cascading slowdownsOverloaded service, network congestion, large payloads
Connection refusedImmediate error, no response at allFast failure, relatively low impact if handledService is down, port not listening, firewall blocking
Slow responseResponse arrives but takes 5-30x longer than normalResource exhaustion, user-facing latency spikesDatabase under load, GC pauses, cold starts
Corrupt data200 OK but response body is malformed or incorrectSilent data corruption, downstream bugsSerialization bugs, version mismatches, partial responses
Partial failureSome items in a batch succeed, others failInconsistent state, hard to retry safelyOne database row locked, one item out of stock

Timeouts and connection refused are loud and obvious. Slow responses and corrupt data are dangerous because they can go undetected for hours.

Detecting each failure mode

async function callWithDiagnostics(url: string): Promise<Response> {
  const start = Date.now();

  try {
    const response = await fetch(url, {
      signal: AbortSignal.timeout(5000), // Timeout after 5s
    });

    const duration = Date.now() - start;

    // Detect slow responses (even if they succeed)
    if (duration > 2000) {
      console.warn(`Slow response from ${url}: ${duration}ms`);
      metrics.recordSlowCall(url, duration);
    }

    // Detect corrupt data (status is OK but body is wrong)
    if (response.ok) {
      const contentType = response.headers.get('content-type');
      if (!contentType?.includes('application/json')) {
        console.error(`Unexpected content type from ${url}: ${contentType}`);
        throw new Error('Corrupt response: unexpected content type');
      }
    }

    return response;
  } catch (error) {
    const duration = Date.now() - start;

    if (error instanceof DOMException && error.name === 'TimeoutError') {
      // Timeout: request took too long
      metrics.recordTimeout(url, duration);
      throw new IntegrationError('TIMEOUT', url, duration);
    }

    if (error instanceof TypeError && error.message.includes('fetch failed')) {
      // Connection refused: service unreachable
      metrics.recordConnectionRefused(url);
      throw new IntegrationError('CONNECTION_REFUSED', url, duration);
    }

    throw error;
  }
}
03

Cascading failures

A cascading failure happens when one service fails and the failure propagates to every service that depends on it, and then to their dependents.

Here is a typical cascade:

  1. The database gets slow due to a long-running query
  2. The APIWhat is api?A set of rules that lets one program talk to another, usually over the internet, by sending requests and getting responses. service holds connections open waiting for the database
  3. The API connection poolWhat is connection pool?A set of pre-opened database connections that your app reuses instead of opening and closing a new one for every request. fills up, new requests start queuing
  4. The frontend gateway times out waiting for the API
  5. Users start retrying their requests, tripling the load
  6. The entire system grinds to a halt

The root cause was a slow query. The actual impact was a complete outage. Circuit breakers, timeouts, rate limitingWhat is rate limiting?Restricting how many requests a client can make within a time window. Prevents brute-force attacks and protects your API from being overwhelmed., and bulkheads are all designed to stop step 2 from becoming step 6.

[Database slow][API holds connections][Connection pool full][Users retry][Frontend timeouts][Requests queue up][3x more load][Complete system failure]
Edge case
The most dangerous cascading failures are caused by slow responses, not outright failures. A service returning 503 triggers your error handling. A service that takes 29 seconds to respond holds your threads, fills your connection pool, and brings down everything behind it, while technically still "working."
04

DependencyWhat is dependency?A piece of code written by someone else that your project needs to work. Think of it as a building block you import instead of writing yourself. mapping

A dependency map shows every service your application depends on, how critical each one is, and what happens when it fails.

Building your dependency map

interface Dependency {
  name: string;
  type: 'critical' | 'degraded' | 'optional';
  timeout: number;        // ms
  fallback: string;       // what to do when it fails
  healthCheck: string;    // URL to check
}

const dependencies: Dependency[] = [
  {
    name: 'Payment Service',
    type: 'critical',           // Order cannot proceed without it
    timeout: 10000,
    fallback: 'reject order with retry prompt',
    healthCheck: 'https://payments.internal/health',
  },
  {
    name: 'Inventory Service',
    type: 'critical',
    timeout: 5000,
    fallback: 'reject order, show out-of-stock',
    healthCheck: 'https://inventory.internal/health',
  },
  {
    name: 'Notification Service',
    type: 'optional',           // Order succeeds without it
    timeout: 3000,
    fallback: 'queue email for later delivery',
    healthCheck: 'https://notifications.internal/health',
  },
  {
    name: 'Recommendation Engine',
    type: 'degraded',           // Show generic recs if down
    timeout: 2000,
    fallback: 'return top-selling products',
    healthCheck: 'https://recommendations.internal/health',
  },
];
Dependency typeMeaningFailure strategy
CriticalSystem cannot function without itRetry with circuit breaker, fail the request if unrecoverable
DegradedSystem works but with reduced functionalityReturn cached or default data, hide the feature
OptionalSystem works fine without itFire and forget, queue for later, log and move on

An optional dependency failure should never block a response. A degraded dependency should trigger a fallback, not an error. Only critical dependencies should be able to fail your request, and even then, only after retries and circuit breakers have had their chance.

05

Measuring reliability

These four metrics, the "four golden signals", tell you how your integrations are performing:

MetricWhat it measuresWarning sign
Error ratePercentage of requests that failSudden spike above baseline
Latency (p50, p95, p99)How long requests takep99 growing while p50 stays flat
ThroughputRequests per secondSudden drop without a deploy
SaturationHow close resources are to capacityConnection pool above 80%
Focus on p95 and p99 latency, not averages. A mean latency of 200ms can hide the fact that 1% of your users are waiting 5 seconds.