Integration & APIs/
Lesson
AI pitfall
AI-generated API clients almost never read rate limit headers proactively. They wait for a 429 response and then back off, but by that point, you have already hit the limit. Good clients track X-RateLimit-Remaining and slow down before they reach 0.

Rate limitingWhat is rate limiting?Restricting how many requests a client can make within a time window. Prevents brute-force attacks and protects your API from being overwhelmed. is the bouncer at the door. It decides how many requests can get in per unit of time, and it turns the restWhat is rest?An architectural style for web APIs where URLs represent resources (nouns) and HTTP methods (GET, POST, PUT, DELETE) represent actions on those resources. away. Without it, a single misbehaving client, or a bug in your own code, can overwhelm a service and take it down for everyone. Rate limiting is one of the simplest and most effective reliability patterns, and yet it is frequently implemented wrong or ignored entirely.

Consuming rate-limited APIs

Before you think about implementing rate limitingWhat is rate limiting?Restricting how many requests a client can make within a time window. Prevents brute-force attacks and protects your API from being overwhelmed., you need to know how to behave as a consumer. Every major APIWhat is api?A set of rules that lets one program talk to another, usually over the internet, by sending requests and getting responses., Stripe, GitHub, Twitter, AWS, has rate limits. If you ignore them, you will get throttled, banned, or charged overage fees.

Handling 429 responses

When you exceed a rate limit, the API returns HTTPWhat is http?The protocol browsers and servers use to exchange web pages, API data, and other resources, defining how requests and responses are formatted. 429 Too Many Requests. The correct response is to read the Retry-After header and wait.

async function fetchWithRateLimit(
  url: string,
  options: RequestInit = {},
  maxRetries = 3
): Promise<Response> {
  for (let attempt = 0; attempt <= maxRetries; attempt++) {
    const response = await fetch(url, options);

    if (response.status === 429) {
      const retryAfter = response.headers.get('Retry-After');
      const waitMs = retryAfter
        ? parseInt(retryAfter, 10) * 1000   // Retry-After is in seconds
        : Math.pow(2, attempt) * 1000;       // fallback: exponential backoff

      console.warn(
        `Rate limited by ${url}. Waiting ${waitMs}ms (attempt ${attempt + 1}/${maxRetries})`
      );

      await sleep(waitMs);
      continue;
    }

    return response;
  }

  throw new Error(`Rate limited after ${maxRetries} retries: ${url}`);
}

function sleep(ms: number): Promise<void> {
  return new Promise((resolve) => setTimeout(resolve, ms));
}

Reading rate limit headers

Most APIs tell you how much budget you have left before hitting the limit. Read these headers proactively instead of waiting for a 429.

HeaderPurposeExample
X-RateLimit-LimitMaximum requests allowed per window100
X-RateLimit-RemainingRequests left in the current window23
X-RateLimit-ResetUnix timestamp when the window resets1710500000
Retry-AfterSeconds to wait before retrying (on 429)30
// Proactive rate limit tracking
class RateLimitTracker {
  private remaining: number = Infinity;
  private resetAt: number = 0;

  updateFromResponse(response: Response): void {
    const remaining = response.headers.get('X-RateLimit-Remaining');
    const reset = response.headers.get('X-RateLimit-Reset');

    if (remaining) this.remaining = parseInt(remaining, 10);
    if (reset) this.resetAt = parseInt(reset, 10) * 1000;
  }

  async waitIfNeeded(): Promise<void> {
    if (this.remaining > 5) return; // comfortable buffer

    const waitMs = this.resetAt - Date.now();
    if (waitMs > 0) {
      console.log(`Rate limit low (${this.remaining} left). Waiting ${waitMs}ms.`);
      await sleep(waitMs);
    }
  }
}
Good to know
The Retry-After header is the most important header in a 429 response. It tells you exactly how long to wait. AI-generated retry logic often ignores this header and uses its own backoff calculation, which can be either too aggressive (hammering the server) or too conservative (waiting 30 seconds when the server said 2).
02

Providing rate limits on your own APIs

When you build APIs that others consume, or even internal microservicesWhat is microservices?An architecture where an application is split into small, independently deployed services that communicate over the network, each owning its own data., you need to protect them from overload. The three main algorithms each have distinct characteristics.

Algorithm comparison

AlgorithmHow it worksProsCons
Fixed windowCount requests in fixed time intervals (e.g., per minute)Simple to implement, low memoryBurst at window boundaries (up to 2x limit)
Sliding windowCount requests in a rolling time windowNo boundary burst problem, smooth limitingSlightly more memory, more complex
Token bucketTokens added at a fixed rate; each request costs one tokenAllows controlled bursts, smooth rateSlightly harder to reason about limits

TokenWhat is token?The smallest unit of text an LLM processes - roughly three-quarters of a word. API pricing is based on how many tokens you use. bucket implementation

The token bucket is the most versatile algorithm. Think of it as a bucket that fills with tokens at a steady rate. Each request takes one token. When the bucket is empty, requests are rejected. The bucket has a maximum capacity, allowing short bursts up to that capacity.

class TokenBucket {
  private tokens: number;
  private lastRefill: number;

  constructor(
    private maxTokens: number,     // bucket capacity
    private refillRate: number      // tokens per second
  ) {
    this.tokens = maxTokens;
    this.lastRefill = Date.now();
  }

  tryConsume(): boolean {
    this.refill();

    if (this.tokens >= 1) {
      this.tokens -= 1;
      return true;     // request allowed
    }

    return false;       // request rejected (rate limited)
  }

  private refill(): void {
    const now = Date.now();
    const elapsed = (now - this.lastRefill) / 1000;
    const newTokens = elapsed * this.refillRate;

    this.tokens = Math.min(this.maxTokens, this.tokens + newTokens);
    this.lastRefill = now;
  }
}

// Example: 100 requests per minute, burst up to 20
const bucket = new TokenBucket(20, 100 / 60);

Sliding window implementation

class SlidingWindowLimiter {
  private timestamps: number[] = [];

  constructor(
    private maxRequests: number,
    private windowMs: number
  ) {}

  tryConsume(): boolean {
    const now = Date.now();
    const windowStart = now - this.windowMs;

    // Remove timestamps outside the window
    this.timestamps = this.timestamps.filter((t) => t > windowStart);

    if (this.timestamps.length < this.maxRequests) {
      this.timestamps.push(now);
      return true;
    }

    return false;
  }

  getRetryAfterMs(): number {
    if (this.timestamps.length === 0) return 0;
    const oldestInWindow = this.timestamps[0];
    return oldestInWindow + this.windowMs - Date.now();
  }
}

Express middlewareWhat is middleware?A function that runs between receiving a request and sending a response. It can check authentication, log data, or modify the request before your main code sees it.

import { Request, Response, NextFunction } from 'express';

// One limiter per client (identified by API key or IP)
const limiters = new Map<string, TokenBucket>();

function rateLimit(maxPerMinute: number, burstSize: number) {
  return (req: Request, res: Response, next: NextFunction) => {
    const clientId = req.headers['x-api-key'] as string || req.ip;

    if (!limiters.has(clientId)) {
      limiters.set(clientId, new TokenBucket(burstSize, maxPerMinute / 60));
    }

    const limiter = limiters.get(clientId)!;

    if (limiter.tryConsume()) {
      next();
    } else {
      res.status(429).json({
        error: 'Too Many Requests',
        retryAfter: 60,
      });
    }
  };
}

// Apply: 100 requests/min with burst of 20
app.use('/api', rateLimit(100, 20));
Edge case
Fixed window rate limiting has a burst problem at window boundaries. A client can send 100 requests at 11:59:59 and another 100 at 12:00:01, 200 requests in 2 seconds while technically staying within a "100 per minute" limit. Sliding window or token bucket algorithms prevent this.
03

BackpressureWhat is backpressure?A mechanism that slows down a fast producer when a slow consumer can't keep up, preventing memory from filling up with unprocessed data.

Rate limitingWhat is rate limiting?Restricting how many requests a client can make within a time window. Prevents brute-force attacks and protects your API from being overwhelmed. says "no" to excess requests. Backpressure says "slow down", it is a feedback mechanism that propagates load information upstream so that producers adjust their rate instead of being abruptly rejected.

Think of it like water flowing through pipes. Rate limiting is a valve that shuts off when pressure is too high. Backpressure is a pressure gauge that tells the source to reduce flow before the valve needs to close.

Backpressure patterns

PatternHow it worksUse case
Queue with max depthReject new items when the queue is fullMessage processing, job queues
Load sheddingDrop low-priority requests when overloadedAPI gateways, real-time systems
Adaptive concurrencyDynamically adjust the number of parallel callsService-to-service communication
Reactive streamsConsumer signals how many items it can handleData streaming, file processing
// Adaptive concurrency: adjust parallelism based on response times
class AdaptiveConcurrency {
  private concurrency: number;
  private inFlight = 0;
  private latencies: number[] = [];

  constructor(
    private minConcurrency: number,
    private maxConcurrency: number,
    private targetLatencyMs: number
  ) {
    this.concurrency = minConcurrency;
  }

  async execute<T>(fn: () => Promise<T>): Promise<T> {
    while (this.inFlight >= this.concurrency) {
      await sleep(10); // wait for a slot
    }

    this.inFlight++;
    const start = Date.now();

    try {
      const result = await fn();
      this.recordLatency(Date.now() - start);
      return result;
    } finally {
      this.inFlight--;
    }
  }

  private recordLatency(ms: number): void {
    this.latencies.push(ms);
    if (this.latencies.length < 10) return;

    const avgLatency =
      this.latencies.reduce((a, b) => a + b, 0) / this.latencies.length;
    this.latencies = [];

    if (avgLatency < this.targetLatencyMs && this.concurrency < this.maxConcurrency) {
      this.concurrency++;  // things are fast, allow more
    } else if (avgLatency > this.targetLatencyMs * 1.5 && this.concurrency > this.minConcurrency) {
      this.concurrency--;  // things are slow, back off
    }
  }
}

The key insight with backpressure is that it transforms hard failures (429 errors, dropped connections) into graceful slowdowns. Your system stays functional under load instead of falling off a cliff.

Rate limiting and backpressure work together. Rate limiting is your hard boundary, the absolute maximum. Backpressure keeps you comfortably below that boundary under normal operation. When backpressure fails, rate limiting catches the overflow.