March 25, 20268 min readRishi

Event-Driven Architecture for LLM Pipelines: Queues, Retries, and the Cost of Synchronous Chains

An LLM pipeline is a distributed system with one new property: every step can cost actual money, take seconds to minutes, and fail in ways that look like success. The first version is usually a function that chains calls in a row. The second version, the one that ships, looks a lot like the event-driven architectures the platform engineering community has been recommending for a decade — because the same pressures apply.

This is the shape I have ended up with across three production LLM pipelines. The lessons are boring. The boring lessons are usually the ones that save the shift.

The Problem With the Synchronous Version

A typical "summarize this document and classify it" pipeline has four steps:

Load the document
Summarize it (LLM call)
Classify it using the summary (LLM call)
Store the result

Written as one function, it is 20 lines of code. It is also a contraption with five failure modes stacked in series:

The document source is slow or down
The summarization call times out
The summarization call returns a degenerate response
The classification call rate-limits
The database write fails

If any of these fail, the whole request fails, and your retry blows through however many LLM calls have already succeeded. If step 2 costs $0.05 and step 3 costs $0.02, a retry after step 4 fails costs $0.07 to get back to where you were — and it re-consumes the token budget.

The Event-Driven Version

Break the pipeline into stages. Each stage reads a message from a queue, does one thing, writes a message to the next queue. Durable storage (the queue) between stages means a failure in step 3 does not lose the work of step 2.

[documents]        [pending_summaries]     [pending_classifications]     [completed]
      │                    │                        │                         │
      ▼                    ▼                        ▼                         ▼
   fetcher ─────────►  summarizer ─────────►   classifier ─────────►      storer

Each consumer is small. Each has its own retry policy, its own rate-limit budget, its own observability. When the classification service rate-limits, only the classification queue backs up; the summarization side keeps running and the fetcher keeps pulling documents.

The infrastructure cost is one queue per boundary (SQS, Pub/Sub, Kafka, Redis Streams — pick what you know). The engineering cost is making each stage idempotent.

Idempotency at Every Boundary

The golden rule of event-driven systems: every consumer should be able to process the same message twice without causing incorrect behavior. You will get duplicates. The queue might deliver the same message twice; a consumer might crash after doing the work but before ack-ing; an operator might replay a dead-letter queue. If any of these double-execute, you are on the hook.

For LLM pipelines specifically, idempotency is worth money. A duplicate summarization call is $0.03 you did not need to spend. A duplicate classification might write a conflicting result. Both are avoidable with a write pattern that encodes the message ID into the target table:

INSERT INTO summaries (document_id, summary, model, message_id)
VALUES (...)
ON CONFLICT (message_id) DO NOTHING;

The message_id from the incoming message serves as the idempotency key. Double-deliver all you want — the second write is a no-op. The LLM call itself is the expensive part; if you have already done the work, skip the call entirely by checking existence first.

The Shape of a Stage

Every stage in the pipeline ends up looking roughly the same:

async def process_message(msg: Message) -> None:
    msg_id = msg.attributes["message_id"]

    # Have we already done this work?
    if await storage.exists(msg_id):
        return

    payload = msg.body

    try:
        result = await do_the_work(payload)  # The LLM call
    except RateLimitError as e:
        # Send back for retry with exponential backoff
        raise RetryableError(e, retry_after=e.retry_after)
    except ModelError as e:
        # Non-retryable; send to DLQ for human review
        await dead_letter.push(msg, error=str(e))
        return

    # Idempotent write
    await storage.save(msg_id, result)

    # Emit to next stage
    await outbox.push({"type": "summary_ready", "document_id": payload["id"], ...})

The five elements: idempotency check up front, retry-aware error handling, DLQ for unrecoverable errors, idempotent write, event emission. Every stage, same shape. Once you have the template, new stages are a day's work.

Retry Budgets Are Not Infinite

Retries in an LLM pipeline cost real tokens. A stage that retries 10 times on a 500 can burn through a serious amount of compute before giving up.

Set a retry budget per message. Something like:

3 retries on transient errors (network, rate limit, 5xx)
Exponential backoff with jitter (1s, 5s, 25s with ±20% jitter)
No retry on 4xx business errors (bad input, content policy violations) — straight to DLQ

And cap the total retries per model across all stages. If your classification model is having a bad hour, every stage that calls it will retry in concert, and you will exhaust your rate limit trying to help. A circuit breaker at the model-client level prevents this cascade.

Dead Letter Queues Are Operational Artifacts

A dead-letter queue (DLQ) is where messages go when they have failed too many times. DLQs are where you discover the most interesting bugs in your pipeline.

Three things to get right:

Alert on DLQ depth. An empty DLQ is normal. A DLQ with 5 messages an hour is suspicious. A DLQ with 500 messages an hour is an outage. Wire up the alert on day one.

Include the full context when sending to DLQ. The original message, the last exception, the stage name, the timestamp. Good DLQ entries are self-contained — an engineer can look at one and figure out what happened without cross-referencing 4 logs.

Provide a replay mechanism. Fixing the code that caused the DLQ entries does not fix the entries themselves. Write a tool (a script is fine) that reads the DLQ and pushes messages back to the appropriate stage's input queue.

Observability: Per-Stage, Not End-to-End

Single-request traces across a pipeline are possible (OpenTelemetry, distributed tracing) but they are the secondary view, not the primary one. The primary view is per-stage metrics:

Input queue depth
Processing rate
P50/P95/P99 latency
Success/failure/DLQ counts
Token usage and cost per successful message

When a pipeline misbehaves, the diagnostic path is "which stage has the anomaly?" — and per-stage dashboards answer that in seconds. Trace views are great for drilling into a specific failed request after you already know which stage to look at.

Cost Attribution: Tag Messages With Origin

One of the harder things in an LLM pipeline is answering "which customer/feature/request caused these $1,200 in LLM costs this week?" If your stages do not carry provenance through, the answer is a long SQL query against raw logs.

The fix: tag every message with its origin when it enters the pipeline, and propagate those tags through every subsequent message. A message leaving the fetcher carries customer_id and feature; the summarizer preserves them when it emits the next message; the classifier does the same. Every LLM call you bill records these tags alongside the token count.

At that point, per-customer cost is a simple aggregation. Without it, you will be guessing for months.

The Patterns That Save You From Yourself

Outbox for cross-stage emissions. If stage B has done the work but crashes before emitting to stage C, you have lost the chain. The fix is the outbox pattern: write the "next stage" message to an outbox table in the same transaction as the result, and have a separate worker publish outbox rows to the queue. Works; boring; effective.

Versioned schemas. The message emitted from stage 2 is the contract stage 3 depends on. Version the schema (v1, v2) and reject messages of wrong versions with a clear error. Rolling out a schema change becomes a two-deploy dance — support v2 in the consumer, then switch the producer — instead of a coordinated release that breaks in flight.

Kill switches per stage. A feature flag that lets you pause consumption on a stage without redeploying. When the classification model starts misbehaving and you need to stop the bleeding while you investigate, you want a pause button.

Deterministic retry. Same input, same prompt, same model, same seed (where supported). A non-deterministic LLM call that retries can produce different outputs on each attempt; choosing the first successful attempt and persisting it keeps your data consistent across retries.

When You Do Not Need This

If your pipeline is one or two LLM calls and they always complete in seconds, a synchronous handler is fine. If the total cost per run is pennies and retries are cheap, you are overbuilding if you reach for queues.

The break point I use: three or more LLM-adjacent steps, or any single step that takes >5 seconds, or any workload where duplicate execution would cost real money. Below that, stay synchronous and spend your complexity budget elsewhere.

The Shape That Holds

An event-driven LLM pipeline is an old pattern applied to a new workload. The ingredients — queues, idempotency, DLQs, retries with budgets, observability per stage — are the same ones that have been making distributed systems resilient for fifteen years. What makes them urgent in LLM pipelines is that each step costs real money and takes real time, so the failures compound in a way a pure-code pipeline never did.

Build it boring. The boring version is the one that is still standing at 3am when something unusual happens.

SharePost Share

Keep reading

Apr 9, 20268 min read

Designing a Production-Ready RAG System: What Actually Matters

Retrieval-augmented generation demos look great. Production RAG is a different engineering problem — chunking, hybrid retrieval, reranking, evaluation, and the failure modes nobody mentions at conferences.

system-design ai llm

Apr 4, 20268 min read

Vector Databases Compared: pgvector, Qdrant, Pinecone, and When You Don't Need Any

A practical comparison of the vector databases people actually deploy in 2026 — and an honest look at when a vector database is the wrong tool for the job.

system-design ai tutorial

Apr 25, 202610 min read

An MCP Server for Dynamics 365 Finance and Operations: Natural-Language Access to ERP

How to expose Dynamics 365 Finance and Operations to Claude through MCP — the architecture, the data entity choices, the auth model, and the operations that should never be one prompt away.