Every integration you build is a bet that someone else's service will be available when you need it. The question is not whether your integrations will fail, they will. The question is what happens to your system when they do.
Why integrations are fragile
Between your fetch() and the response body, your request crosses DNSWhat is dns?The system that translates human-readable domain names like google.com into the numerical IP addresses computers use to find each other. resolution, TLSWhat is ssl/tls?Encryption protocols that secure the connection between a browser and a server, preventing eavesdropping on data in transit. handshakes, load balancers, reverse proxies, application servers, database connections, and the return trip. Any link in that chain can break.
The math works against you: if each service has 99.9% uptime (roughly 8.7 hours of downtime per year), five services chained together give you 99.5% uptime, roughly 43 hours of downtime per year. Five times worse than any individual service.
// This innocent-looking function has 3 external failure points
async function processOrder(order: Order): Promise<OrderResult> {
// Failure point 1: payment service
const payment = await paymentService.charge(order.amount);
// Failure point 2: inventory service
await inventoryService.reserve(order.items);
// Failure point 3: notification service
await notificationService.sendConfirmation(order.email);
return { status: 'completed', paymentId: payment.id };
}If the notification service is down, should the entire order fail? The customer already paid and inventory was reserved. But without explicit handling, the unhandled rejection bubbles up and the order appears to fail.
The five failure modes
| Failure mode | Symptoms | Impact | Typical cause |
|---|---|---|---|
| Timeout | Request hangs, eventually errors after N seconds | Threads/connections held open, cascading slowdowns | Overloaded service, network congestion, large payloads |
| Connection refused | Immediate error, no response at all | Fast failure, relatively low impact if handled | Service is down, port not listening, firewall blocking |
| Slow response | Response arrives but takes 5-30x longer than normal | Resource exhaustion, user-facing latency spikes | Database under load, GC pauses, cold starts |
| Corrupt data | 200 OK but response body is malformed or incorrect | Silent data corruption, downstream bugs | Serialization bugs, version mismatches, partial responses |
| Partial failure | Some items in a batch succeed, others fail | Inconsistent state, hard to retry safely | One database row locked, one item out of stock |
Timeouts and connection refused are loud and obvious. Slow responses and corrupt data are dangerous because they can go undetected for hours.
Detecting each failure mode
async function callWithDiagnostics(url: string): Promise<Response> {
const start = Date.now();
try {
const response = await fetch(url, {
signal: AbortSignal.timeout(5000), // Timeout after 5s
});
const duration = Date.now() - start;
// Detect slow responses (even if they succeed)
if (duration > 2000) {
console.warn(`Slow response from ${url}: ${duration}ms`);
metrics.recordSlowCall(url, duration);
}
// Detect corrupt data (status is OK but body is wrong)
if (response.ok) {
const contentType = response.headers.get('content-type');
if (!contentType?.includes('application/json')) {
console.error(`Unexpected content type from ${url}: ${contentType}`);
throw new Error('Corrupt response: unexpected content type');
}
}
return response;
} catch (error) {
const duration = Date.now() - start;
if (error instanceof DOMException && error.name === 'TimeoutError') {
// Timeout: request took too long
metrics.recordTimeout(url, duration);
throw new IntegrationError('TIMEOUT', url, duration);
}
if (error instanceof TypeError && error.message.includes('fetch failed')) {
// Connection refused: service unreachable
metrics.recordConnectionRefused(url);
throw new IntegrationError('CONNECTION_REFUSED', url, duration);
}
throw error;
}
}Cascading failures
A cascading failure happens when one service fails and the failure propagates to every service that depends on it, and then to their dependents.
Here is a typical cascade:
- The database gets slow due to a long-running query
- The APIWhat is api?A set of rules that lets one program talk to another, usually over the internet, by sending requests and getting responses. service holds connections open waiting for the database
- The API connection poolWhat is connection pool?A set of pre-opened database connections that your app reuses instead of opening and closing a new one for every request. fills up, new requests start queuing
- The frontend gateway times out waiting for the API
- Users start retrying their requests, tripling the load
- The entire system grinds to a halt
The root cause was a slow query. The actual impact was a complete outage. Circuit breakers, timeouts, rate limitingWhat is rate limiting?Restricting how many requests a client can make within a time window. Prevents brute-force attacks and protects your API from being overwhelmed., and bulkheads are all designed to stop step 2 from becoming step 6.
[Database slow] → [API holds connections] → [Connection pool full]
↓
[Users retry] ← [Frontend timeouts] ← [Requests queue up]
↓
[3x more load] → [Complete system failure]DependencyWhat is dependency?A piece of code written by someone else that your project needs to work. Think of it as a building block you import instead of writing yourself. mapping
A dependency map shows every service your application depends on, how critical each one is, and what happens when it fails.
Building your dependency map
interface Dependency {
name: string;
type: 'critical' | 'degraded' | 'optional';
timeout: number; // ms
fallback: string; // what to do when it fails
healthCheck: string; // URL to check
}
const dependencies: Dependency[] = [
{
name: 'Payment Service',
type: 'critical', // Order cannot proceed without it
timeout: 10000,
fallback: 'reject order with retry prompt',
healthCheck: 'https://payments.internal/health',
},
{
name: 'Inventory Service',
type: 'critical',
timeout: 5000,
fallback: 'reject order, show out-of-stock',
healthCheck: 'https://inventory.internal/health',
},
{
name: 'Notification Service',
type: 'optional', // Order succeeds without it
timeout: 3000,
fallback: 'queue email for later delivery',
healthCheck: 'https://notifications.internal/health',
},
{
name: 'Recommendation Engine',
type: 'degraded', // Show generic recs if down
timeout: 2000,
fallback: 'return top-selling products',
healthCheck: 'https://recommendations.internal/health',
},
];| Dependency type | Meaning | Failure strategy |
|---|---|---|
| Critical | System cannot function without it | Retry with circuit breaker, fail the request if unrecoverable |
| Degraded | System works but with reduced functionality | Return cached or default data, hide the feature |
| Optional | System works fine without it | Fire and forget, queue for later, log and move on |
An optional dependency failure should never block a response. A degraded dependency should trigger a fallback, not an error. Only critical dependencies should be able to fail your request, and even then, only after retries and circuit breakers have had their chance.
Measuring reliability
These four metrics, the "four golden signals", tell you how your integrations are performing:
| Metric | What it measures | Warning sign |
|---|---|---|
| Error rate | Percentage of requests that fail | Sudden spike above baseline |
| Latency (p50, p95, p99) | How long requests take | p99 growing while p50 stays flat |
| Throughput | Requests per second | Sudden drop without a deploy |
| Saturation | How close resources are to capacity | Connection pool above 80% |