4 min readRishi

The Transactional Outbox Pattern: Publishing Events Without Losing Them

Every event-driven system eventually writes this code: save the order to the database, then publish OrderCreated to the message broker. Two operations, two systems, no shared transaction. If the process dies between them, the order exists but the event never happened — and every downstream consumer's view of the world is now quietly wrong. This is the dual-write problem, and the transactional outbox is the standard cure.

The dual write fails in both directions

You cannot order the two writes safely. Publish after commit, and a crash between them loses the event. Publish before commit, and a rollback means consumers received an event for an order that does not exist. Wrapping both in a distributed transaction is theoretically possible and practically miserable — most brokers do not participate in two-phase commit, and the ones that do make you pay for it in throughput and operational pain.

BEGIN;  INSERT order;  COMMIT;      → crash →   event lost
PUBLISH event;  BEGIN; INSERT; ROLLBACK;        → ghost event

The insight behind the outbox: you already have one system that does transactions well. Use it for both writes.

Write the event where you write the data

The outbox is a table in the same database, written in the same transaction. Alongside the order insert, you insert a row describing the event. Either both commit or neither does. The business data and the intent to publish can no longer disagree.

BEGIN;
  INSERT INTO orders (id, customer_id, total)
  VALUES ('ord_81f2', 'cus_note', 4900);

  INSERT INTO outbox (id, aggregate_id, event_type, payload, created_at)
  VALUES (
    'evt_3c9a',
    'ord_81f2',
    'OrderCreated',
    '{"orderId": "ord_81f2", "total": 4900}',
    now()
  );
COMMIT;

A separate relay moves rows from the outbox to the broker. The relay reads unpublished rows, publishes them, and marks them done. If it crashes after publishing but before marking, it will publish again on restart — the pattern is at-least-once by construction, which is the correct default for event delivery.

Two relay styles: polling and log tailing

Polling is the simple version. A background worker selects a batch of unpublished rows on an interval, publishes, marks them sent. It is easy to build, easy to reason about, and fine for most systems. Lock the batch (FOR UPDATE SKIP LOCKED in Postgres) so multiple relay instances do not fight over rows.

SELECT * FROM outbox
WHERE published_at IS NULL
ORDER BY created_at
LIMIT 100
FOR UPDATE SKIP LOCKED;

Change data capture is the low-latency version. Tools like Debezium tail the database's replication log and turn outbox inserts into broker messages without polling. You gain latency and lose the extra query load, at the cost of running and operating a CDC pipeline. Choose it when event latency genuinely matters, not because the architecture diagram looks better.

ConcernPolling relayCDC relay
LatencyPoll intervalNear real time
Operational complexityA worker and a queryA CDC platform
Database loadPeriodic readsLog tailing, minimal
Ordering controlExplicit in queryLog order

The details that decide whether it works in production

Consumers must be idempotent, because duplicates are guaranteed. At-least-once delivery means the same event will occasionally arrive twice. Put a unique event ID in every outbox row and have consumers record processed IDs. This is the same discipline as idempotency keys on an API — the outbox solves the producer side, not the consumer side.

Ordering only exists per aggregate. Publishing in created_at order across the whole table is fragile under concurrency. What consumers usually need is order per entity: all events for order ord_81f2 in sequence. Use the aggregate ID as the partition key on the broker so per-entity ordering survives fan-out, and do not promise global ordering you cannot keep.

The outbox grows forever unless you clean it. Delete or archive published rows on a schedule, and keep enough history to replay recent events when a consumer needs recovery. A common split: hot rows in the outbox for days, archived events in cheap storage for as long as audit requires.

The transactional outbox is not glamorous — a table, an insert, and a loop. But it converts "we hope the publish worked" into a guarantee bounded only by consumer idempotency, and that is usually the single biggest reliability upgrade available to an event-driven system.

Keep reading

Newsletter

New posts, straight to your inbox

One email per post. No spam, no tracking pixels, unsubscribe anytime.

Comments

  • No comments yet. Be the first.