The Transactional Outbox Pattern: Publishing Events Without Losing Them
Every event-driven system eventually writes this code: save the order to the database, then publish OrderCreated to the message broker. Two operations, two systems, no shared transaction. If the process dies between them, the order exists but the event never happened — and every downstream consumer's view of the world is now quietly wrong. This is the dual-write problem, and the transactional outbox is the standard cure.
The dual write fails in both directions
You cannot order the two writes safely. Publish after commit, and a crash between them loses the event. Publish before commit, and a rollback means consumers received an event for an order that does not exist. Wrapping both in a distributed transaction is theoretically possible and practically miserable — most brokers do not participate in two-phase commit, and the ones that do make you pay for it in throughput and operational pain.
BEGIN; INSERT order; COMMIT; → crash → event lost
PUBLISH event; BEGIN; INSERT; ROLLBACK; → ghost event
The insight behind the outbox: you already have one system that does transactions well. Use it for both writes.
Write the event where you write the data
The outbox is a table in the same database, written in the same transaction. Alongside the order insert, you insert a row describing the event. Either both commit or neither does. The business data and the intent to publish can no longer disagree.
BEGIN;
INSERT INTO orders (id, customer_id, total)
VALUES ('ord_81f2', 'cus_note', 4900);
INSERT INTO outbox (id, aggregate_id, event_type, payload, created_at)
VALUES (
'evt_3c9a',
'ord_81f2',
'OrderCreated',
'{"orderId": "ord_81f2", "total": 4900}',
now()
);
COMMIT;
A separate relay moves rows from the outbox to the broker. The relay reads unpublished rows, publishes them, and marks them done. If it crashes after publishing but before marking, it will publish again on restart — the pattern is at-least-once by construction, which is the correct default for event delivery.
Two relay styles: polling and log tailing
Polling is the simple version. A background worker selects a batch of unpublished rows on an interval, publishes, marks them sent. It is easy to build, easy to reason about, and fine for most systems. Lock the batch (FOR UPDATE SKIP LOCKED in Postgres) so multiple relay instances do not fight over rows.
SELECT * FROM outbox
WHERE published_at IS NULL
ORDER BY created_at
LIMIT 100
FOR UPDATE SKIP LOCKED;
Change data capture is the low-latency version. Tools like Debezium tail the database's replication log and turn outbox inserts into broker messages without polling. You gain latency and lose the extra query load, at the cost of running and operating a CDC pipeline. Choose it when event latency genuinely matters, not because the architecture diagram looks better.
| Concern | Polling relay | CDC relay |
|---|---|---|
| Latency | Poll interval | Near real time |
| Operational complexity | A worker and a query | A CDC platform |
| Database load | Periodic reads | Log tailing, minimal |
| Ordering control | Explicit in query | Log order |
The details that decide whether it works in production
Consumers must be idempotent, because duplicates are guaranteed. At-least-once delivery means the same event will occasionally arrive twice. Put a unique event ID in every outbox row and have consumers record processed IDs. This is the same discipline as idempotency keys on an API — the outbox solves the producer side, not the consumer side.
Ordering only exists per aggregate. Publishing in created_at order across the whole table is fragile under concurrency. What consumers usually need is order per entity: all events for order ord_81f2 in sequence. Use the aggregate ID as the partition key on the broker so per-entity ordering survives fan-out, and do not promise global ordering you cannot keep.
The outbox grows forever unless you clean it. Delete or archive published rows on a schedule, and keep enough history to replay recent events when a consumer needs recovery. A common split: hot rows in the outbox for days, archived events in cheap storage for as long as audit requires.
The transactional outbox is not glamorous — a table, an insert, and a loop. But it converts "we hope the publish worked" into a guarantee bounded only by consumer idempotency, and that is usually the single biggest reliability upgrade available to an event-driven system.
Keep reading
Backpressure: What Happens When Your System Can't Keep Up
A fast producer and a slow consumer is a recipe for an out-of-memory crash. Backpressure is the discipline of letting the slow part tell the fast part to wait. Here is how to design it in.
Database Connection Pooling: The Bottleneck You Forgot to Tune
More connections is not more throughput. Past a point, adding connections makes your database slower. Here is how pools actually work and how to size one without guessing.
Write-Ahead Logging: The Unsung Hero of Database Durability
How does a database survive a power cut mid-write without corrupting your data? The answer is a deceptively simple rule: log the change before you apply it. Here is why WAL is everywhere.
Newsletter
New posts, straight to your inbox
One email per post. No spam, no tracking pixels, unsubscribe anytime.
Comments
- No comments yet. Be the first.