Messages fail. A payment gateway is down, an APIWhat is api?A set of rules that lets one program talk to another, usually over the internet, by sending requests and getting responses. returns unexpected data, a bug in your consumer crashes on a specific input. When retries are exhausted, the message needs to go somewhere. That somewhere is the dead letter queueWhat is dead letter queue?A holding area for messages that failed processing too many times, letting you inspect and fix them later without blocking the main flow. (DLQ).
Without a DLQ, failed messages either disappear forever (data loss) or retry infinitely (wasting resources and blocking the queue). A DLQ is the safety net that catches messages your system cannot handle right now, so you can investigate and fix the problem later.
Poison pills
A poison pill is a message that will always fail, no matter how many times you retry it. Maybe the message is malformed. Maybe it references a deleted record. Maybe your code has a bug that crashes on a specific data pattern.
// This message will ALWAYS fail - it's a poison pill
const poisonPill = {
orderId: 'ORD-999',
items: [{ productId: null, quantity: -3 }], // Invalid data
userId: undefined // Missing required field
};
// Consumer crashes every time it tries to process this
queue.subscribe('orders', async (message) => {
const { orderId, items, userId } = message.data;
// TypeError: Cannot read property 'name' of undefined
const user = await db.users.findById(userId);
const email = user.email; // CRASH - user is null because userId was undefined
// This code never runs. Message gets redelivered. Crashes again.
});Without a retry limit and a DLQ, this message blocks processing forever. The consumer picks it up, crashes, the queue redelivers it, and the consumer crashes again. Other messages pile up behind it.
The fix: Set a maximum retry count. After N failures, move the message to the DLQ.
// Consumer with retry limit
queue.subscribe('orders', {
maxRetries: 3,
onMaxRetries: 'deadLetterQueue' // Send to DLQ after 3 failures
}, async (message) => {
try {
await processOrder(message.data);
message.ack();
} catch (err) {
console.error(`Attempt ${message.retryCount + 1} failed:`, err.message);
message.nack(); // Will be retried up to 3 times, then DLQ
}
});Retry topology
A well-designed retry system has three layers: the main queue, one or more retry queues with increasing delays, and the DLQ as the final destination.
Main Queue
|
| (processing fails)
v
Retry Queue (delay: 5 seconds)
|
| (fails again)
v
Retry Queue (delay: 30 seconds)
|
| (fails again)
v
Retry Queue (delay: 5 minutes)
|
| (fails again)
v
Dead Letter Queue (manual investigation required)Here is how to implement this with exponential backoffWhat is exponential backoff?A retry strategy where each attempt waits twice as long as the previous one, giving an overloaded server progressively more time to recover.:
// Retry topology with exponential backoff
class RetryableConsumer {
constructor(mainQueue, dlq, options = {}) {
this.mainQueue = mainQueue;
this.dlq = dlq;
this.maxRetries = options.maxRetries || 5;
this.baseDelay = options.baseDelay || 1000; // 1 second
}
async processWithRetry(message, handler) {
const retryCount = message.headers['x-retry-count'] || 0;
try {
await handler(message.data);
message.ack();
} catch (err) {
if (retryCount >= this.maxRetries) {
// Max retries exceeded: send to DLQ
console.error(
`Message ${message.id} failed ${retryCount} times, sending to DLQ`
);
await this.dlq.publish({
originalMessage: message.data,
error: err.message,
failedAt: new Date().toISOString(),
retryCount,
queue: this.mainQueue.name
});
message.ack(); // Remove from main queue
return;
}
// Calculate delay with exponential backoff + jitter
const delay = this.baseDelay * Math.pow(2, retryCount);
const jitter = Math.random() * delay * 0.1; // 10% jitter
const totalDelay = delay + jitter;
console.log(
`Retry ${retryCount + 1}/${this.maxRetries} in ${totalDelay}ms`
);
// Republish with incremented retry count and delay
await this.mainQueue.publish(message.data, {
headers: { 'x-retry-count': retryCount + 1 },
delay: totalDelay
});
message.ack(); // Remove the current copy
}
}
}
// Usage
const consumer = new RetryableConsumer(ordersQueue, ordersDLQ, {
maxRetries: 5,
baseDelay: 2000 // 2s, 4s, 8s, 16s, 32s then DLQ
});
ordersQueue.subscribe(async (message) => {
await consumer.processWithRetry(message, processOrder);
});Why exponential backoff matters: If a downstream service is overwhelmed, retrying immediately makes things worse. Exponential backoff gives the failing service time to recover. The jitterWhat is jitter?Random variation added to retry delays so that many clients don't all retry at the exact same moment and overwhelm a recovering server. prevents a thundering herd where all retries happen at exactly the same moment.
DLQ message structure
When a message lands in the DLQ, you need enough context to understand what went wrong and how to fix it.
// What a DLQ message should contain
const dlqMessage = {
// The original message, untouched
originalMessage: {
orderId: 'ORD-456',
items: [{ productId: 'PROD-789', quantity: 2 }],
userId: 'USER-123'
},
// Metadata about the failure
metadata: {
sourceQueue: 'orders.process',
firstFailedAt: '2025-03-15T10:30:00Z',
lastFailedAt: '2025-03-15T10:35:12Z',
retryCount: 5,
lastError: 'PaymentGatewayTimeout: Connection timed out after 30s',
errorStack: 'Error: PaymentGatewayTimeout\n at charge (payment.js:42)',
consumerVersion: '2.3.1',
environment: 'production'
}
};This context is invaluable when you are debugging at 2 AM. You can see exactly what the message contained, which error occurred, how many times it was retried, and which version of your code processed it.
Manual vs. automatic replay
Once you have fixed the issue that caused messages to fail, you need to reprocess them. There are two approaches.
Manual replay: An operator reviews DLQ messages, selects which ones to reprocess, and triggers replay. Safer but slower.
// Manual replay tool
class DLQReplayTool {
async listMessages(filters = {}) {
const messages = await this.dlq.peek(100); // Look without consuming
return messages.filter(msg => {
if (filters.sourceQueue && msg.metadata.sourceQueue !== filters.sourceQueue) {
return false;
}
if (filters.errorType && !msg.metadata.lastError.includes(filters.errorType)) {
return false;
}
return true;
});
}
async replayOne(messageId) {
const msg = await this.dlq.getMessage(messageId);
await this.originalQueue.publish(msg.originalMessage);
await this.dlq.delete(messageId);
console.log(`Replayed message ${messageId}`);
}
async replayAll(filter) {
const messages = await this.listMessages(filter);
console.log(`Replaying ${messages.length} messages...`);
for (const msg of messages) {
await this.replayOne(msg.id);
await sleep(100); // Don't overwhelm the main queue
}
}
}Automatic replay: After a configured cool-down period, DLQ messages are automatically pushed back to the main queue. Faster but risky: if the underlying issue is not fixed, messages bounce between the main queue and the DLQ forever.
| Approach | Speed | Safety | Best for |
|---|---|---|---|
| Manual replay | Slow (human in the loop) | High (human verifies fix) | Payment processing, critical data |
| Automatic replay | Fast (no human needed) | Medium (may create retry loops) | Non-critical tasks, known transient failures |
| Automatic with circuit breaker | Medium | High (stops if failures continue) | Best of both worlds |
The circuit breakerWhat is circuit breaker?A pattern that stops sending requests to a failing service after repeated errors, giving it time to recover before trying again. approach is a good compromise: automatically replay, but stop if the replayed messages keep failing. This catches transient issues automatically while protecting against persistent bugs.
Monitoring and alerting
A DLQ that nobody watches is useless. Here is what to monitor:
// DLQ monitoring configuration
const dlqAlerts = {
// Alert immediately: any message in a critical DLQ
critical: {
queues: ['payments-dlq', 'orders-dlq'],
threshold: 1,
action: 'page-oncall',
message: 'Critical DLQ has messages. Data is not being processed.'
},
// Alert after 10 messages: non-critical DLQ growing
warning: {
queues: ['emails-dlq', 'analytics-dlq'],
threshold: 10,
action: 'slack-channel',
message: 'DLQ depth growing. Check consumer health.'
},
// Alert on rate: sudden spike in DLQ messages
rateAlert: {
window: '5m',
threshold: 50, // More than 50 DLQ messages in 5 minutes
action: 'page-oncall',
message: 'DLQ spike detected. Possible upstream issue or deployment bug.'
}
};| Metric | What it tells you | Alert threshold |
|---|---|---|
| DLQ depth | How many unprocessed failures exist | > 0 for critical queues |
| DLQ growth rate | How fast failures are accumulating | Sudden spikes |
| Oldest message age | How long the oldest failure has been waiting | > 1 hour for critical |
| Replay success rate | Whether replayed messages are succeeding | < 90% means the fix did not work |
| Error type distribution | What kinds of failures are happening | New error types |
DLQ strategies by system criticality:
| System type | Retry count | Retry delay | DLQ review | Replay method |
|---|---|---|---|---|
| Payment processing | 5 | Exponential (2s-60s) | Immediate (page) | Manual only |
| Order fulfillment | 3 | Exponential (5s-5min) | Within 1 hour | Manual or auto with circuit breaker |
| Email notifications | 3 | Fixed (30s) | Daily review | Automatic replay |
| Analytics ingestion | 1 | None | Weekly review | Automatic replay or discard |
| Audit logging | 10 | Exponential (1s-10min) | Within 4 hours | Manual only (compliance) |
The general principle: the more critical the data, the more retries you allow, the faster you investigate DLQ messages, and the more carefully you replay them.