System Design/
Lesson

Messages fail. A payment gateway is down, an APIWhat is api?A set of rules that lets one program talk to another, usually over the internet, by sending requests and getting responses. returns unexpected data, a bug in your consumer crashes on a specific input. When retries are exhausted, the message needs to go somewhere. That somewhere is the dead letter queueWhat is dead letter queue?A holding area for messages that failed processing too many times, letting you inspect and fix them later without blocking the main flow. (DLQ).

Without a DLQ, failed messages either disappear forever (data loss) or retry infinitely (wasting resources and blocking the queue). A DLQ is the safety net that catches messages your system cannot handle right now, so you can investigate and fix the problem later.

Poison pills

A poison pill is a message that will always fail, no matter how many times you retry it. Maybe the message is malformed. Maybe it references a deleted record. Maybe your code has a bug that crashes on a specific data pattern.

// This message will ALWAYS fail - it's a poison pill
const poisonPill = {
  orderId: 'ORD-999',
  items: [{ productId: null, quantity: -3 }], // Invalid data
  userId: undefined // Missing required field
};

// Consumer crashes every time it tries to process this
queue.subscribe('orders', async (message) => {
  const { orderId, items, userId } = message.data;

  // TypeError: Cannot read property 'name' of undefined
  const user = await db.users.findById(userId);
  const email = user.email; // CRASH - user is null because userId was undefined

  // This code never runs. Message gets redelivered. Crashes again.
});

Without a retry limit and a DLQ, this message blocks processing forever. The consumer picks it up, crashes, the queue redelivers it, and the consumer crashes again. Other messages pile up behind it.

The fix: Set a maximum retry count. After N failures, move the message to the DLQ.

// Consumer with retry limit
queue.subscribe('orders', {
  maxRetries: 3,
  onMaxRetries: 'deadLetterQueue' // Send to DLQ after 3 failures
}, async (message) => {
  try {
    await processOrder(message.data);
    message.ack();
  } catch (err) {
    console.error(`Attempt ${message.retryCount + 1} failed:`, err.message);
    message.nack(); // Will be retried up to 3 times, then DLQ
  }
});
02

Retry topology

A well-designed retry system has three layers: the main queue, one or more retry queues with increasing delays, and the DLQ as the final destination.

Main Queue
   |
   | (processing fails)
   v
Retry Queue (delay: 5 seconds)
   |
   | (fails again)
   v
Retry Queue (delay: 30 seconds)
   |
   | (fails again)
   v
Retry Queue (delay: 5 minutes)
   |
   | (fails again)
   v
Dead Letter Queue (manual investigation required)

Here is how to implement this with exponential backoffWhat is exponential backoff?A retry strategy where each attempt waits twice as long as the previous one, giving an overloaded server progressively more time to recover.:

// Retry topology with exponential backoff
class RetryableConsumer {
  constructor(mainQueue, dlq, options = {}) {
    this.mainQueue = mainQueue;
    this.dlq = dlq;
    this.maxRetries = options.maxRetries || 5;
    this.baseDelay = options.baseDelay || 1000; // 1 second
  }

  async processWithRetry(message, handler) {
    const retryCount = message.headers['x-retry-count'] || 0;

    try {
      await handler(message.data);
      message.ack();
    } catch (err) {
      if (retryCount >= this.maxRetries) {
        // Max retries exceeded: send to DLQ
        console.error(
          `Message ${message.id} failed ${retryCount} times, sending to DLQ`
        );
        await this.dlq.publish({
          originalMessage: message.data,
          error: err.message,
          failedAt: new Date().toISOString(),
          retryCount,
          queue: this.mainQueue.name
        });
        message.ack(); // Remove from main queue
        return;
      }

      // Calculate delay with exponential backoff + jitter
      const delay = this.baseDelay * Math.pow(2, retryCount);
      const jitter = Math.random() * delay * 0.1; // 10% jitter
      const totalDelay = delay + jitter;

      console.log(
        `Retry ${retryCount + 1}/${this.maxRetries} in ${totalDelay}ms`
      );

      // Republish with incremented retry count and delay
      await this.mainQueue.publish(message.data, {
        headers: { 'x-retry-count': retryCount + 1 },
        delay: totalDelay
      });
      message.ack(); // Remove the current copy
    }
  }
}

// Usage
const consumer = new RetryableConsumer(ordersQueue, ordersDLQ, {
  maxRetries: 5,
  baseDelay: 2000 // 2s, 4s, 8s, 16s, 32s then DLQ
});

ordersQueue.subscribe(async (message) => {
  await consumer.processWithRetry(message, processOrder);
});

Why exponential backoff matters: If a downstream service is overwhelmed, retrying immediately makes things worse. Exponential backoff gives the failing service time to recover. The jitterWhat is jitter?Random variation added to retry delays so that many clients don't all retry at the exact same moment and overwhelm a recovering server. prevents a thundering herd where all retries happen at exactly the same moment.

03

DLQ message structure

When a message lands in the DLQ, you need enough context to understand what went wrong and how to fix it.

// What a DLQ message should contain
const dlqMessage = {
  // The original message, untouched
  originalMessage: {
    orderId: 'ORD-456',
    items: [{ productId: 'PROD-789', quantity: 2 }],
    userId: 'USER-123'
  },

  // Metadata about the failure
  metadata: {
    sourceQueue: 'orders.process',
    firstFailedAt: '2025-03-15T10:30:00Z',
    lastFailedAt: '2025-03-15T10:35:12Z',
    retryCount: 5,
    lastError: 'PaymentGatewayTimeout: Connection timed out after 30s',
    errorStack: 'Error: PaymentGatewayTimeout\n    at charge (payment.js:42)',
    consumerVersion: '2.3.1',
    environment: 'production'
  }
};

This context is invaluable when you are debugging at 2 AM. You can see exactly what the message contained, which error occurred, how many times it was retried, and which version of your code processed it.

04

Manual vs. automatic replay

Once you have fixed the issue that caused messages to fail, you need to reprocess them. There are two approaches.

Manual replay: An operator reviews DLQ messages, selects which ones to reprocess, and triggers replay. Safer but slower.

// Manual replay tool
class DLQReplayTool {
  async listMessages(filters = {}) {
    const messages = await this.dlq.peek(100); // Look without consuming
    return messages.filter(msg => {
      if (filters.sourceQueue && msg.metadata.sourceQueue !== filters.sourceQueue) {
        return false;
      }
      if (filters.errorType && !msg.metadata.lastError.includes(filters.errorType)) {
        return false;
      }
      return true;
    });
  }

  async replayOne(messageId) {
    const msg = await this.dlq.getMessage(messageId);
    await this.originalQueue.publish(msg.originalMessage);
    await this.dlq.delete(messageId);
    console.log(`Replayed message ${messageId}`);
  }

  async replayAll(filter) {
    const messages = await this.listMessages(filter);
    console.log(`Replaying ${messages.length} messages...`);

    for (const msg of messages) {
      await this.replayOne(msg.id);
      await sleep(100); // Don't overwhelm the main queue
    }
  }
}

Automatic replay: After a configured cool-down period, DLQ messages are automatically pushed back to the main queue. Faster but risky: if the underlying issue is not fixed, messages bounce between the main queue and the DLQ forever.

ApproachSpeedSafetyBest for
Manual replaySlow (human in the loop)High (human verifies fix)Payment processing, critical data
Automatic replayFast (no human needed)Medium (may create retry loops)Non-critical tasks, known transient failures
Automatic with circuit breakerMediumHigh (stops if failures continue)Best of both worlds

The circuit breakerWhat is circuit breaker?A pattern that stops sending requests to a failing service after repeated errors, giving it time to recover before trying again. approach is a good compromise: automatically replay, but stop if the replayed messages keep failing. This catches transient issues automatically while protecting against persistent bugs.

05

Monitoring and alerting

A DLQ that nobody watches is useless. Here is what to monitor:

// DLQ monitoring configuration
const dlqAlerts = {
  // Alert immediately: any message in a critical DLQ
  critical: {
    queues: ['payments-dlq', 'orders-dlq'],
    threshold: 1,
    action: 'page-oncall',
    message: 'Critical DLQ has messages. Data is not being processed.'
  },

  // Alert after 10 messages: non-critical DLQ growing
  warning: {
    queues: ['emails-dlq', 'analytics-dlq'],
    threshold: 10,
    action: 'slack-channel',
    message: 'DLQ depth growing. Check consumer health.'
  },

  // Alert on rate: sudden spike in DLQ messages
  rateAlert: {
    window: '5m',
    threshold: 50, // More than 50 DLQ messages in 5 minutes
    action: 'page-oncall',
    message: 'DLQ spike detected. Possible upstream issue or deployment bug.'
  }
};
MetricWhat it tells youAlert threshold
DLQ depthHow many unprocessed failures exist> 0 for critical queues
DLQ growth rateHow fast failures are accumulatingSudden spikes
Oldest message ageHow long the oldest failure has been waiting> 1 hour for critical
Replay success rateWhether replayed messages are succeeding< 90% means the fix did not work
Error type distributionWhat kinds of failures are happeningNew error types

DLQ strategies by system criticality:

System typeRetry countRetry delayDLQ reviewReplay method
Payment processing5Exponential (2s-60s)Immediate (page)Manual only
Order fulfillment3Exponential (5s-5min)Within 1 hourManual or auto with circuit breaker
Email notifications3Fixed (30s)Daily reviewAutomatic replay
Analytics ingestion1NoneWeekly reviewAutomatic replay or discard
Audit logging10Exponential (1s-10min)Within 4 hoursManual only (compliance)

The general principle: the more critical the data, the more retries you allow, the faster you investigate DLQ messages, and the more carefully you replay them.

AI pitfall
AI-generated retry logic almost always uses a fixed delay (retry every 5 seconds). What AI gets wrong: if the downstream service is down, hammering it every 5 seconds makes the problem worse. Always use exponential backoff with jitter, each retry waits longer than the last, with a random offset to prevent all consumers from retrying at the same instant.
Good to know
DLQ messages are often more valuable than they look. Patterns in DLQ failures can reveal bugs, schema mismatches, or downstream service changes before they become widespread outages. Treat your DLQ dashboard as a canary signal, not just an error log.
Edge case
Replaying DLQ messages for payment processing requires extreme care. If a payment message failed because the external payment provider was down, replaying it after recovery might double-charge the customer if the original charge actually went through. Always check idempotency keys before replaying financial messages.