Integration & APIs - Dead Letter Queues

Create a free account to save your progress

Earn XP, track streaks, and sync your dashboard across devices.

Lesson

AI pitfall

AI-generated queue consumers almost never include DLQ routing. The AI assumes every message will eventually succeed, which is dangerously optimistic. Poison pills, messages that will never succeed no matter how many times you retry, need to be detected and removed from the main queue immediately.

You have built retries with backoff and jitterWhat is jitter?Random variation added to retry delays so that many clients don't all retry at the exact same moment and overwhelm a recovering server.. Your circuit breakers trip when a service is down. But what happens to the message that still fails after all retries are exhausted? Without a Dead Letter QueueWhat is dead letter queue?A holding area for messages that failed processing too many times, letting you inspect and fix them later without blocking the main flow., the answer is: it disappears. You lose the data, the customer does not get their order confirmation, and nobody knows anything went wrong until someone complains.

A DLQ catches those messages and holds them for investigation or replay. It is your safety net when everything else fails.

The problem DLQs solve

Consider a message queue processing orders:

Order Queue:  [order-1] [order-2] [order-3] [order-4] [order-5]
                  |
              Consumer picks up order-1
                  |
              Processing fails (payment service down)
                  |
              Retry 1... Retry 2... Retry 3... all fail
                  |
              What now?

Without a DLQ, you have two bad options:

Drop the message. The order is lost. The customer paid but never gets their product.
Keep retrying forever. The message blocks the queue. Orders 2-5 wait indefinitely.

A DLQ gives you a third option: move the failed message to a separate queue, continue processing the restWhat is rest?An architectural style for web APIs where URLs represent resources (nouns) and HTTP methods (GET, POST, PUT, DELETE) represent actions on those resources., and deal with the failure later.

Order Queue:  [order-2] [order-3] [order-4] [order-5]  (processing continues)

Dead Letter Queue:  [order-1]  (held for investigation)

Good to know

A DLQ is not optional for any production message queue. Without one, your only options when a message fails permanently are "lose the data" or "block the entire queue." Neither is acceptable for business-critical workflows like payments or order processing.

Poison pills

A poison pill is a message that will never succeed no matter how many times you retry it. Examples:

Malformed JSONWhat is json?A text format for exchanging data between systems. It uses key-value pairs and arrays, and every programming language can read and write it. that cannot be parsed
Missing required field (e.g., an order with no customer ID)
Reference to deleted data (e.g., product ID that no longer exists)
SchemaWhat is schema?A formal definition of the structure your data must follow - which fields exist, what types they have, and which are required. version mismatch (producer sends v2, consumer expects v1)
Message too large for the consumer to handle

Poison pills are dangerous because they can block an entire queue. If your consumer processes messages sequentially and a poison pill is at the front, nothing behind it gets processed. The DLQ removes the poison pill from the main queue so healthy messages can flow.

async function processMessage(message) {
  try {
    const order = JSON.parse(message.body);

    // Validate before processing
    if (!order.customerId || !order.items?.length) {
      // This will NEVER succeed -- send to DLQ immediately
      await sendToDLQ(message, 'Missing required fields');
      await message.ack(); // Remove from main queue
      return;
    }

    await fulfillOrder(order);
    await message.ack();

  } catch (error) {
    if (error instanceof SyntaxError) {
      // Malformed JSON -- poison pill, send to DLQ immediately
      await sendToDLQ(message, `Parse error: ${error.message}`);
      await message.ack();
      return;
    }

    // Transient error -- let the queue's retry policy handle it
    await message.nack();
  }
}

Edge case

A common poison pill that AI-generated validators miss: a message with valid JSON but an empty array where a non-empty array is expected. The JSON parses fine, the type checks pass, but the business logic throws an error on every retry because you cannot fulfill an order with zero items.

DLQ flow

Here is the complete lifecycle of a message from the main queue to the DLQ and back:

Producer --> Main Queue --> Consumer
                |              |
                |         Processing fails
                |              |
                |         Retry 1 (1s delay)
                |              |
                |         Retry 2 (2s delay)
                |              |
                |         Retry 3 (4s delay)
                |              |
                |         Max retries reached
                |              |
                |         Move to DLQ
                |              |
                |         Dead Letter Queue
                |              |
                |         Alert ops team
                |              |
                |    Investigate and fix root cause
                |              |
                |         Replay message
                |              |
                +---- Back to Main Queue (reprocessing)

DLQ design patterns

Pattern 1: automatic DLQ routing

Most message brokers support automatic DLQ routing. You configure a max retry count, and the broker moves the message to the DLQ after that many failures.

// AWS SQS example: automatic DLQ after 3 failures
const queueConfig = {
  QueueName: 'order-processing',
  Attributes: {
    RedrivePolicy: JSON.stringify({
      deadLetterTargetArn: 'arn:aws:sqs:us-east-1:123456:order-processing-dlq',
      maxReceiveCount: 3,  // After 3 failed processing attempts
    }),
    VisibilityTimeout: '30',  // 30 seconds to process before retry
  },
};

Pattern 2: manual DLQ routing with metadata

Sometimes you want more control over what goes to the DLQ and what metadata is attached.

interface DLQMessage {
  originalMessage: unknown;
  originalQueue: string;
  failureReason: string;
  failureCount: number;
  firstFailedAt: string;
  lastFailedAt: string;
  errorStack?: string;
}

async function sendToDLQ(message: QueueMessage, reason: string) {
  const dlqMessage: DLQMessage = {
    originalMessage: message.body,
    originalQueue: message.queue,
    failureReason: reason,
    failureCount: message.attributes.receiveCount || 1,
    firstFailedAt: message.attributes.firstReceivedAt || new Date().toISOString(),
    lastFailedAt: new Date().toISOString(),
    errorStack: new Error().stack,
  };

  await dlqProducer.send({
    queue: `${message.queue}-dlq`,
    body: JSON.stringify(dlqMessage),
    attributes: {
      originalMessageId: message.id,
    },
  });

  console.error(`Message ${message.id} sent to DLQ: ${reason}`);
}

Pattern 3: DLQ with automatic replay

For transient failures, you can configure automatic replay from the DLQ after a delay.

// DLQ consumer that replays messages after a cooldown
async function dlqConsumer(message) {
  const dlqData = JSON.parse(message.body);
  const age = Date.now() - new Date(dlqData.lastFailedAt).getTime();

  // Wait at least 5 minutes before replaying
  if (age < 5 * 60 * 1000) {
    await message.nack(); // Put it back, not ready yet
    return;
  }

  // Max 3 replays from DLQ
  if (dlqData.failureCount > 6) { // 3 original + 3 DLQ replays
    await escalateToHuman(dlqData);
    await message.ack();
    return;
  }

  // Replay to original queue
  await producer.send({
    queue: dlqData.originalQueue,
    body: JSON.stringify(dlqData.originalMessage),
  });

  await message.ack();
}

DLQ strategies comparison

Here is what AI typically generates when asked to handle message failures, and the more nuanced reality:

Strategy	When to use	Pros	Cons
Automatic broker DLQ	Standard retry exhaustion	Zero code, built into SQS/RabbitMQ/etc.	Limited metadata, no custom logic
Manual DLQ with metadata	Need rich failure context	Full control, detailed debugging info	More code to maintain
Automatic replay	Transient failures (service was down)	Self-healing, reduces manual work	Risk of infinite loops if misconfigured
DLQ with human review	Business-critical data (payments)	Maximum safety, no data loss	Requires ops staffing and tooling
Tiered DLQ (DLQ -> DLQ2)	High-volume systems	Separates transient from permanent failures	Added complexity
DLQ with transformation	Schema mismatches	Can fix and replay automatically	Transformation logic can itself fail

Alerting on DLQ depth

AI pitfall

AI-generated DLQ implementations almost always forget the alerting step. A DLQ that silently fills up is worse than no DLQ at all, it gives you a false sense of safety while data piles up unnoticed. You must alert on DLQ depth.

// Monitor DLQ depth and alert
async function monitorDLQ() {
  const dlqDepth = await getQueueDepth('order-processing-dlq');

  // Alert thresholds
  if (dlqDepth > 100) {
    await alert('critical', `DLQ depth is ${dlqDepth} -- systematic failure likely`);
  } else if (dlqDepth > 10) {
    await alert('warning', `DLQ depth is ${dlqDepth} -- investigate`);
  } else if (dlqDepth > 0) {
    await alert('info', `${dlqDepth} messages in DLQ`);
  }

  // Also alert on growth rate
  const previousDepth = await getMetric('dlq_depth_5min_ago');
  if (dlqDepth - previousDepth > 50) {
    await alert('critical', `DLQ growing rapidly: +${dlqDepth - previousDepth} in 5 minutes`);
  }
}

Key metrics to track:

Metric	What it tells you	Alert threshold
DLQ depth (absolute)	How many messages are stuck	> 10 warning, > 100 critical
DLQ growth rate	How fast failures are occurring	> 50/5min critical
DLQ message age	How long the oldest message has been waiting	> 1 hour warning
DLQ replay success rate	Whether replays are actually working	< 80% warning
Main queue → DLQ ratio	What percentage of messages are failing	> 5% warning

Building a DLQ dashboard

At minimum, your DLQ management interface needs to support these operations:

Inspect: View the message body, failure reason, and processing history
Replay single: Send one message back to the main queue for reprocessing
Replay all: Bulk replay after fixing the root cause
Delete: Remove messages that are no longer relevant
Filter: Search by failure reason, date range, or message content

// Simple DLQ management API
const dlqRouter = {
  // List DLQ messages with pagination
  async list(req, res) {
    const { page = 1, limit = 50, reason } = req.query;
    const messages = await dlqStore.find({
      ...(reason && { failureReason: { $contains: reason } }),
    }, { page, limit });
    return res.json(messages);
  },

  // Replay a single message
  async replay(req, res) {
    const message = await dlqStore.findById(req.params.id);
    await producer.send({
      queue: message.originalQueue,
      body: JSON.stringify(message.originalMessage),
    });
    await dlqStore.delete(req.params.id);
    return res.json({ status: 'replayed' });
  },

  // Bulk replay all messages matching a filter
  async replayAll(req, res) {
    const { reason } = req.body;
    const messages = await dlqStore.find({ failureReason: reason });
    let replayed = 0;
    for (const msg of messages) {
      await producer.send({
        queue: msg.originalQueue,
        body: JSON.stringify(msg.originalMessage),
      });
      await dlqStore.delete(msg.id);
      replayed++;
    }
    return res.json({ replayed });
  },
};

DLQs are not glamorous, but they are the difference between "we lost 500 orders and found out a week later" and "we caught the problem in 5 minutes and replayed everything." Every message-based system needs one.

Done

Complete & Next