June 14, 20266 min readRishi

The Saga Pattern: Transactions That Span Services You Can't Roll Back

In a monolith, "place an order" is one database transaction. Reserve inventory, charge the card, create the shipment — all inside BEGIN/COMMIT. If anything fails, the database rolls everything back and the world is clean. You probably didn't appreciate how much that single transaction was doing for you until you split the monolith into services.

Now inventory, payments, and shipping each own their own database. There is no shared transaction. You cannot ROLLBACK a charge that already settled on a payment processor, and you cannot hold a database lock open across three network calls for two seconds while you wait for them. The classic answer — two-phase commit across services — is slow, fragile, and couples your availability to your least-available participant. The practical answer is the saga.

What a saga actually is

A saga is a sequence of local transactions, one per service. Each step commits in its own database and then triggers the next step. The catch: there is no global rollback. So instead, every step that has a side effect gets a paired compensating transaction — an explicit action that semantically undoes it.

You don't roll back. You roll forward with a correction.

Step 1: Reserve inventory     ⟷  Compensation: Release inventory
Step 2: Charge payment        ⟷  Compensation: Refund payment
Step 3: Create shipment       ⟷  Compensation: Cancel shipment

If step 3 fails, the saga runs the compensations for steps 2 and 1, in reverse: refund the payment, release the inventory. The end state is consistent — just reached by undoing forward rather than rolling back.

This reframes the whole problem. You are no longer asking "how do I make these atomic?" You are asking "for each step, what is the action that semantically reverses it?" Sometimes that is clean (release a reservation). Sometimes it is messy (you already shipped the item — the compensation is a return-merchandise flow, not a delete). The messiness is real, and it is the actual work of designing a saga.

Compensation is not rollback

The single most important thing to internalize: a compensating transaction is a new business action, not a magic undo. If you charged a card and need to compensate, you issue a refund — a real, visible, auditable event. The customer may see a charge and then a refund on their statement. That is a business reality you have to design for, not paper over.

This has consequences:

Compensations can fail too. Your refund call can time out. You need to retry compensations, and they must be idempotent so retrying is safe.
Some steps aren't cleanly reversible. Sending an email can't be unsent. Order such steps last, after everything reversible has succeeded, so you rarely have to compensate around them.
Isolation is gone. Between a step and its potential compensation, other transactions can observe the intermediate state. A reserved-but-not-yet-paid order is visible. You handle this with semantic locks (a PENDING status) rather than database locks.

Two ways to coordinate: choreography vs. orchestration

Once you have steps and compensations, something has to drive the sequence. There are two styles, and the choice shapes your whole system.

Choreography — no central coordinator. Each service listens for events and reacts, emitting its own events that the next service listens for.

OrderCreated  →  [Inventory] reserves, emits InventoryReserved
InventoryReserved  →  [Payment] charges, emits PaymentCompleted
PaymentCompleted  →  [Shipping] ships, emits OrderShipped

It is beautifully decoupled and has no single point of failure. But the saga's logic is smeared across every service — no one place tells you what the flow is. With more than a few steps, reasoning about "what happens if step 4 fails after step 2 already compensated" becomes genuinely hard, and debugging means reconstructing the flow from event logs across five services.

Orchestration — a central coordinator (the orchestrator) explicitly calls each service and decides what comes next, including which compensations to run on failure.

class OrderSaga:
    def execute(self, order):
        completed = []
        try:
            self.inventory.reserve(order);  completed.append(self.release)
            self.payment.charge(order);     completed.append(self.refund)
            self.shipping.create(order);    completed.append(self.cancel_ship)
        except StepFailed:
            for compensate in reversed(completed):
                compensate(order)           # idempotent, retried on failure
            raise SagaAborted(order.id)

The flow lives in one readable place. You can see the whole transaction, log its progress, and reason about failure. The cost is a component that knows about every service — more coupling, and a coordinator you must make highly available and crash-recoverable (which usually means persisting saga state in a database so it can resume after a restart).

My default is orchestration for anything with three or more steps. The decoupling of choreography sounds appealing until you are on call at 2 a.m. trying to figure out why an order is stuck in limbo, with no single system that knows the answer. An explicit, persisted, observable orchestrator is worth its weight when things go wrong — and in distributed systems, things go wrong.

The hard parts that aren't optional

Sagas trade ACID for availability, and the bill comes due in these details:

Idempotency everywhere. Every step and every compensation will be retried (networks fail, coordinators restart). Each must be safe to run twice. Use a saga/request ID and dedup on it. This is non-negotiable, not a nice-to-have.
Persist saga state. An in-memory orchestrator that crashes mid-saga leaves the system in an unknown state. Persist each transition so a restarted orchestrator can pick up exactly where it left off.
Eventual consistency is the deal. During a saga, the system is observably inconsistent — inventory reserved, payment not yet charged. The UI and downstream consumers must tolerate "pending" states. If the business genuinely cannot tolerate any intermediate visibility, a saga is the wrong tool, and you should question whether those steps belong in separate services at all.
Timeouts and stuck sagas. A step that never responds leaves the saga hanging. Every step needs a timeout, and a hung saga needs a path to either retry or compensate — plus alerting so a human knows it's stuck.

When to use one

Reach for a saga when a single business operation must update multiple services that each own their data, and you need consistency without a distributed lock — orders, payments, bookings, provisioning workflows. That is the bread and butter of microservice transactions.

Don't reach for one if the operation lives within a single service's database; just use a local ACID transaction and enjoy the isolation you'd otherwise be giving up. And if you find yourself writing sprawling sagas with eight steps and tangled compensations across six services, treat it as a design smell: the data that changes together may want to live together. Sometimes the best saga is the one you avoid by drawing your service boundaries so the transaction stays local in the first place.

SharePost Share

Keep reading

Jun 20, 20267 min read

Backpressure: What Happens When Your System Can't Keep Up

A fast producer and a slow consumer is a recipe for an out-of-memory crash. Backpressure is the discipline of letting the slow part tell the fast part to wait. Here is how to design it in.

system-design tutorial

Jun 18, 20266 min read

Database Connection Pooling: The Bottleneck You Forgot to Tune

More connections is not more throughput. Past a point, adding connections makes your database slower. Here is how pools actually work and how to size one without guessing.