April 25, 20268 min readRishi

Event-Driven System Design: The Decisions That Bite You Later

Event-driven systems fail slowly. The first version works fine. A year in, you have duplicate orders, out-of-order updates, a dead letter queue nobody reads, and a schema change nobody can ship without coordinating across four teams. None of this is the broker's fault. It is the consequence of decisions made in the first week that nobody revisited.

This post is about those decisions. Not the broker comparison — pick Kafka or SQS or NATS based on your operational posture, that is a separate question. This is about the choices that determine whether your event-driven system is a useful asset or a quiet liability two years from now.

The First Decision: What Is an Event?

People conflate two very different things and call them both "events."

Event-as-fact is a record of something that happened. OrderPlaced(order_id=123, items=[...], total=99.99). It is immutable, has a clear owner, and means the same thing forever. Consumers can replay it years later and reconstruct state.

Event-as-message is a notification that something changed and you should go look. OrderUpdated(order_id=123). The receiver calls back to the source for the actual data. It is essentially a fancy webhook.

These have completely different operational profiles. Event-as-fact lets consumers be independent — they can replay history, recover from outages, evolve their data model without coordinating. Event-as-message couples every consumer back to the source service's API and database.

The default in most teams I see is event-as-message, because it feels easier — you do not have to think hard about what to put in the payload. The cost shows up later: every consumer becomes a load source on the producer, replay is impossible, and the "event-driven" architecture is in practice a star topology with one service in the middle.

Pick event-as-fact unless you have a specific reason not to. Put enough in the payload that a consumer can do its job without calling back.

Delivery Guarantees: Pick One and Mean It

Brokers advertise delivery semantics: at-most-once, at-least-once, exactly-once. The honest version is shorter:

At-most-once means you will lose messages. Acceptable for telemetry. Never acceptable for business events.
At-least-once means you will see duplicates. This is what you actually get from every production-grade broker.
Exactly-once is a marketing term. Some brokers offer it within their boundary (Kafka transactions, for example) but the moment a consumer writes to an external system, you are back to at-least-once.

The practical implication: assume duplicates and design for them. Every consumer that mutates state must be idempotent. Not "we'll add idempotency later" — built in from the first handler.

The cheapest idempotency is a dedupe table:

CREATE TABLE processed_events (
  event_id UUID PRIMARY KEY,
  consumer_name TEXT NOT NULL,
  processed_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

Before processing, insert with ON CONFLICT DO NOTHING and check the row count. If zero, you have already processed this event — skip the side effects. Do this in the same transaction as the business write and you have correct at-least-once handling.

Ordering: The Question You Avoid Asking

Default Kafka gives you ordering within a partition. SQS standard does not give you ordering at all. SQS FIFO does, within a message group. NATS JetStream does, within a stream.

The trap is assuming you have ordering when you do not, or assuming you need it when you do not.

Ask the actual question: what is the smallest scope within which order matters? Usually it is "events for the same entity." Order across entities almost never matters.

For Kafka, that means partitioning by entity ID — order_id, user_id, whatever. Same key always lands on the same partition, so per-entity order is preserved and across-entity parallelism still works. For SQS FIFO, the same idea: use the entity ID as the message group ID.

If you need global order, you have a single partition, which means a single consumer, which means your throughput ceiling is one consumer's throughput. This is almost never what you want. If your design seems to require it, the design is probably wrong.

Schema Evolution: The Slow Killer

The first event you publish is fine. The hundredth is fine. The problem is in year two, when the consumer team needs a field changed and the producer team has shipped seven versions since.

Two rules that prevent most pain:

1. Treat the schema as a contract, not a data structure. Use a schema registry (Confluent, Apicurio) or, at minimum, version your event types explicitly: OrderPlaced.v2. Consumers declare which versions they handle. Producers cannot break a version that has live consumers.

2. Only make additive changes by default. Add fields, do not remove or rename. Consumers must tolerate unknown fields. Producers must keep old fields populated until every consumer has migrated. This is boring and it works.

The teams I have seen skip this end up with a "schema document" that diverged from reality six months ago, a Slack channel where people ask what fields actually exist, and a producer service that nobody dares change.

The Outbox Pattern

The most underrated pattern in event-driven systems is the simplest. You want to update your database and publish an event. Doing both atomically across two systems is not possible without a distributed transaction, and you do not want a distributed transaction.

The outbox pattern: write the event to a regular database table in the same transaction as your state change. A separate process reads from the outbox and publishes to the broker.

BEGIN;
UPDATE orders SET status = 'paid' WHERE id = 123;
INSERT INTO outbox (event_type, payload, created_at)
VALUES ('OrderPaid', '{"order_id":123,...}', NOW());
COMMIT;

The publisher process is a small loop: read unpublished rows, publish, mark them published. If the publisher crashes after the broker accepts but before the row is marked, you republish — at-least-once, which you already designed for.

What this gives you: your application code never has to think about partial failure between "wrote to database" and "published event." Either both happen or neither does. There is no broker-down scenario where state and events drift apart.

The variations — change data capture (CDC) reading the database WAL, transactional outbox via Debezium — are the same idea with different mechanics. The principle is what matters: events are derived from durable database state, not produced alongside it.

Dead Letter Queues That Anyone Reads

Every team I have worked with has a dead letter queue. Almost none of them have a process for what happens when something lands in it.

The pattern that actually works:

Bounded retries with backoff. Three attempts, exponential. After that, DLQ.
Alert on first message. Not on a threshold — on the first one. A DLQ that is silently filling is a DLQ nobody reads.
A replay tool that ships with the consumer. A simple CLI or admin endpoint that replays a message from the DLQ after fix. If replaying requires custom code each time, nobody will do it and the DLQ will become a graveyard.
A dashboard with the count and age of the oldest message. Old messages in the DLQ are an outage, even if the rest of the system is green.

A DLQ is not error handling. It is the receipt that error handling failed. Treat it accordingly.

Observability: The Trace Across Async Boundaries

Synchronous systems get distributed tracing for free with a tracing library. Async systems do not — the broker breaks the trace by default.

Two things to do from day one:

Propagate trace context through events. Add a trace_id (or full W3C tracecontext) to every event header. The consumer extracts it and continues the span. OpenTelemetry has spec'd messaging conventions for this; use them.

Log the event ID in every log line a consumer produces while handling that event. When something goes wrong, you want to grep for event_id=abc-123 and see the entire processing trace. Without this, debugging an async system means staring at unrelated log lines and guessing.

Without these two, you will have a system that works fine when nothing goes wrong, and is impossible to debug when something does.

The Decisions, Compressed

If you only remember a checklist:

Events are facts, not notifications. Put enough in the payload.
Assume duplicates. Every state-mutating consumer is idempotent from day one.
Partition by entity ID. You almost never need global order.
Version the schema, only add fields, register the contract.
Use the outbox pattern for atomic state-change-plus-publish.
Treat the DLQ as an alert, not a backlog.
Propagate trace context and event IDs through every consumer.

None of these are exotic. They are the boring, durable choices. The teams whose event-driven systems still feel manageable in year three made these in week one. The teams who skipped them spent year three rewriting.

Closing

Event-driven design is mostly negative space — the things you decide not to do, the assumptions you decide not to rely on. The broker matters less than you think. The discipline around what an event is, how consumers behave, and how schemas evolve matters far more.

If you are starting a new event-driven system, the highest-leverage thing you can do is write down the answers to the seven questions above before you ship the first consumer. Re-deciding them later is expensive. Deciding them deliberately at the start is free.

SharePost Share

Keep reading

Apr 4, 20268 min read

Vector Databases Compared: pgvector, Qdrant, Pinecone, and When You Don't Need Any

A practical comparison of the vector databases people actually deploy in 2026 — and an honest look at when a vector database is the wrong tool for the job.

system-design ai tutorial

Mar 30, 20268 min read

Idempotency Keys: The Pattern That Saves Your Payment System

Idempotency keys are what separate a payment system that double-charges during a retry from one that doesn't. The mechanism looks simple and has five subtle failure modes you need to know about.

system-design tutorial

Mar 20, 20268 min read

Multi-Tenant Data Isolation: Row-Level, Schema-Level, Database-Level, and How to Choose

Three patterns for multi-tenant data isolation in SaaS, the trade-offs between cost, blast radius, and compliance, and a migration path from one to another when you outgrow your first choice.