X-RateLimit-Remaining and slow down before they reach 0.Rate limitingWhat is rate limiting?Restricting how many requests a client can make within a time window. Prevents brute-force attacks and protects your API from being overwhelmed. is the bouncer at the door. It decides how many requests can get in per unit of time, and it turns the restWhat is rest?An architectural style for web APIs where URLs represent resources (nouns) and HTTP methods (GET, POST, PUT, DELETE) represent actions on those resources. away. Without it, a single misbehaving client, or a bug in your own code, can overwhelm a service and take it down for everyone. Rate limiting is one of the simplest and most effective reliability patterns, and yet it is frequently implemented wrong or ignored entirely.
Consuming rate-limited APIs
Before you think about implementing rate limitingWhat is rate limiting?Restricting how many requests a client can make within a time window. Prevents brute-force attacks and protects your API from being overwhelmed., you need to know how to behave as a consumer. Every major APIWhat is api?A set of rules that lets one program talk to another, usually over the internet, by sending requests and getting responses., Stripe, GitHub, Twitter, AWS, has rate limits. If you ignore them, you will get throttled, banned, or charged overage fees.
Handling 429 responses
When you exceed a rate limit, the API returns HTTPWhat is http?The protocol browsers and servers use to exchange web pages, API data, and other resources, defining how requests and responses are formatted. 429 Too Many Requests. The correct response is to read the Retry-After header and wait.
async function fetchWithRateLimit(
url: string,
options: RequestInit = {},
maxRetries = 3
): Promise<Response> {
for (let attempt = 0; attempt <= maxRetries; attempt++) {
const response = await fetch(url, options);
if (response.status === 429) {
const retryAfter = response.headers.get('Retry-After');
const waitMs = retryAfter
? parseInt(retryAfter, 10) * 1000 // Retry-After is in seconds
: Math.pow(2, attempt) * 1000; // fallback: exponential backoff
console.warn(
`Rate limited by ${url}. Waiting ${waitMs}ms (attempt ${attempt + 1}/${maxRetries})`
);
await sleep(waitMs);
continue;
}
return response;
}
throw new Error(`Rate limited after ${maxRetries} retries: ${url}`);
}
function sleep(ms: number): Promise<void> {
return new Promise((resolve) => setTimeout(resolve, ms));
}Reading rate limit headers
Most APIs tell you how much budget you have left before hitting the limit. Read these headers proactively instead of waiting for a 429.
| Header | Purpose | Example |
|---|---|---|
X-RateLimit-Limit | Maximum requests allowed per window | 100 |
X-RateLimit-Remaining | Requests left in the current window | 23 |
X-RateLimit-Reset | Unix timestamp when the window resets | 1710500000 |
Retry-After | Seconds to wait before retrying (on 429) | 30 |
// Proactive rate limit tracking
class RateLimitTracker {
private remaining: number = Infinity;
private resetAt: number = 0;
updateFromResponse(response: Response): void {
const remaining = response.headers.get('X-RateLimit-Remaining');
const reset = response.headers.get('X-RateLimit-Reset');
if (remaining) this.remaining = parseInt(remaining, 10);
if (reset) this.resetAt = parseInt(reset, 10) * 1000;
}
async waitIfNeeded(): Promise<void> {
if (this.remaining > 5) return; // comfortable buffer
const waitMs = this.resetAt - Date.now();
if (waitMs > 0) {
console.log(`Rate limit low (${this.remaining} left). Waiting ${waitMs}ms.`);
await sleep(waitMs);
}
}
}Retry-After header is the most important header in a 429 response. It tells you exactly how long to wait. AI-generated retry logic often ignores this header and uses its own backoff calculation, which can be either too aggressive (hammering the server) or too conservative (waiting 30 seconds when the server said 2).Providing rate limits on your own APIs
When you build APIs that others consume, or even internal microservicesWhat is microservices?An architecture where an application is split into small, independently deployed services that communicate over the network, each owning its own data., you need to protect them from overload. The three main algorithms each have distinct characteristics.
Algorithm comparison
| Algorithm | How it works | Pros | Cons |
|---|---|---|---|
| Fixed window | Count requests in fixed time intervals (e.g., per minute) | Simple to implement, low memory | Burst at window boundaries (up to 2x limit) |
| Sliding window | Count requests in a rolling time window | No boundary burst problem, smooth limiting | Slightly more memory, more complex |
| Token bucket | Tokens added at a fixed rate; each request costs one token | Allows controlled bursts, smooth rate | Slightly harder to reason about limits |
TokenWhat is token?The smallest unit of text an LLM processes - roughly three-quarters of a word. API pricing is based on how many tokens you use. bucket implementation
The token bucket is the most versatile algorithm. Think of it as a bucket that fills with tokens at a steady rate. Each request takes one token. When the bucket is empty, requests are rejected. The bucket has a maximum capacity, allowing short bursts up to that capacity.
class TokenBucket {
private tokens: number;
private lastRefill: number;
constructor(
private maxTokens: number, // bucket capacity
private refillRate: number // tokens per second
) {
this.tokens = maxTokens;
this.lastRefill = Date.now();
}
tryConsume(): boolean {
this.refill();
if (this.tokens >= 1) {
this.tokens -= 1;
return true; // request allowed
}
return false; // request rejected (rate limited)
}
private refill(): void {
const now = Date.now();
const elapsed = (now - this.lastRefill) / 1000;
const newTokens = elapsed * this.refillRate;
this.tokens = Math.min(this.maxTokens, this.tokens + newTokens);
this.lastRefill = now;
}
}
// Example: 100 requests per minute, burst up to 20
const bucket = new TokenBucket(20, 100 / 60);Sliding window implementation
class SlidingWindowLimiter {
private timestamps: number[] = [];
constructor(
private maxRequests: number,
private windowMs: number
) {}
tryConsume(): boolean {
const now = Date.now();
const windowStart = now - this.windowMs;
// Remove timestamps outside the window
this.timestamps = this.timestamps.filter((t) => t > windowStart);
if (this.timestamps.length < this.maxRequests) {
this.timestamps.push(now);
return true;
}
return false;
}
getRetryAfterMs(): number {
if (this.timestamps.length === 0) return 0;
const oldestInWindow = this.timestamps[0];
return oldestInWindow + this.windowMs - Date.now();
}
}Express middlewareWhat is middleware?A function that runs between receiving a request and sending a response. It can check authentication, log data, or modify the request before your main code sees it.
import { Request, Response, NextFunction } from 'express';
// One limiter per client (identified by API key or IP)
const limiters = new Map<string, TokenBucket>();
function rateLimit(maxPerMinute: number, burstSize: number) {
return (req: Request, res: Response, next: NextFunction) => {
const clientId = req.headers['x-api-key'] as string || req.ip;
if (!limiters.has(clientId)) {
limiters.set(clientId, new TokenBucket(burstSize, maxPerMinute / 60));
}
const limiter = limiters.get(clientId)!;
if (limiter.tryConsume()) {
next();
} else {
res.status(429).json({
error: 'Too Many Requests',
retryAfter: 60,
});
}
};
}
// Apply: 100 requests/min with burst of 20
app.use('/api', rateLimit(100, 20));BackpressureWhat is backpressure?A mechanism that slows down a fast producer when a slow consumer can't keep up, preventing memory from filling up with unprocessed data.
Rate limitingWhat is rate limiting?Restricting how many requests a client can make within a time window. Prevents brute-force attacks and protects your API from being overwhelmed. says "no" to excess requests. Backpressure says "slow down", it is a feedback mechanism that propagates load information upstream so that producers adjust their rate instead of being abruptly rejected.
Think of it like water flowing through pipes. Rate limiting is a valve that shuts off when pressure is too high. Backpressure is a pressure gauge that tells the source to reduce flow before the valve needs to close.
Backpressure patterns
| Pattern | How it works | Use case |
|---|---|---|
| Queue with max depth | Reject new items when the queue is full | Message processing, job queues |
| Load shedding | Drop low-priority requests when overloaded | API gateways, real-time systems |
| Adaptive concurrency | Dynamically adjust the number of parallel calls | Service-to-service communication |
| Reactive streams | Consumer signals how many items it can handle | Data streaming, file processing |
// Adaptive concurrency: adjust parallelism based on response times
class AdaptiveConcurrency {
private concurrency: number;
private inFlight = 0;
private latencies: number[] = [];
constructor(
private minConcurrency: number,
private maxConcurrency: number,
private targetLatencyMs: number
) {
this.concurrency = minConcurrency;
}
async execute<T>(fn: () => Promise<T>): Promise<T> {
while (this.inFlight >= this.concurrency) {
await sleep(10); // wait for a slot
}
this.inFlight++;
const start = Date.now();
try {
const result = await fn();
this.recordLatency(Date.now() - start);
return result;
} finally {
this.inFlight--;
}
}
private recordLatency(ms: number): void {
this.latencies.push(ms);
if (this.latencies.length < 10) return;
const avgLatency =
this.latencies.reduce((a, b) => a + b, 0) / this.latencies.length;
this.latencies = [];
if (avgLatency < this.targetLatencyMs && this.concurrency < this.maxConcurrency) {
this.concurrency++; // things are fast, allow more
} else if (avgLatency > this.targetLatencyMs * 1.5 && this.concurrency > this.minConcurrency) {
this.concurrency--; // things are slow, back off
}
}
}The key insight with backpressure is that it transforms hard failures (429 errors, dropped connections) into graceful slowdowns. Your system stays functional under load instead of falling off a cliff.