June 17, 20267 min readRishi

Write-Ahead Logging: The Unsung Hero of Database Durability

A database is in the middle of updating a row. The new value is partly written to disk — a few bytes of the page flushed, the rest still in memory — when the server loses power. When it boots back up, that page on disk is now a torn, half-old half-new mess. The row is corrupt. In a naive design, you've just lost data and possibly broken the file.

Yet real databases survive this exact scenario thousands of times a day. Postgres, SQLite, MySQL's InnoDB, and essentially every serious storage engine come back from a hard crash with every committed transaction intact and every uncommitted one cleanly discarded. The mechanism that makes this possible is write-ahead logging, and the rule at its heart is almost insultingly simple.

The rule

Before you modify the actual data on disk, first write a record describing the change to an append-only log — and make sure that log record is durably on disk.

That's it. Log the intent to change, durably, before touching the real data. The log is sequential and append-only; the data files are random-access and updated lazily. The ordering is the whole game: the log entry that says "row 42 changed from A to B" hits stable storage before the database is allowed to consider that change real.

Why does this fix the torn-page disaster? Because now the durable source of truth for a recent change is the log, not the half-written data page. After a crash, the database doesn't trust the data files alone — it replays the log to reconstruct a correct state.

Recovery: redo and undo

When a database with WAL restarts after a crash, it runs a recovery procedure over the log:

Redo. Walk forward through the log and re-apply every change belonging to a transaction that committed. Even if those changes never made it from memory to the data files before the crash, the log has them, so they get re-applied. This is what makes durability real: once your COMMIT returns, the change is in the log, and the log guarantees it survives.
Undo. For any transaction that was in progress but not committed at crash time, roll its changes back. Its log records describe enough to reverse any partial effects that leaked to disk. This is what makes atomicity real: a half-done transaction leaves no trace.

Together, redo and undo turn a chaotic post-crash disk into a clean, consistent state that reflects exactly the set of committed transactions — no more, no less. This is the textbook ARIES recovery algorithm in spirit, and variants of it run inside the database you used five minutes ago.

Why this is also a performance win

Here's the part that surprises people: WAL isn't just a safety tax, it often makes the database faster.

Think about the alternative. Without a log, every commit would have to flush all the modified data pages to their scattered locations across the disk before returning — random I/O, slow, and you'd have to do it synchronously on the commit path. With WAL, a commit only has to do one thing synchronously: append its records to the end of the log and fsync. Sequential appends to a single file are dramatically faster than scattered random writes, even on SSDs and especially on spinning disks.

The actual data pages get updated lazily, in the background, batched up and written when convenient. They can be dirty in memory long after the transaction committed, because the log already guarantees durability. The database flushes them on its own schedule.

COMMIT path (synchronous, fast):
  append change records to WAL  →  fsync  →  return "committed"

Background (asynchronous, batched):
  flush dirty data pages to their real locations
  (a "checkpoint")

So WAL converts slow, random, synchronous-on-commit writes into fast, sequential, synchronous log appends plus lazy background data writes. You get both better durability and better throughput. That's a rare combination, and it's why the pattern is universal.

Checkpoints: keeping recovery bounded

If the log just grew forever, two problems appear: it fills the disk, and recovery would have to replay the entire history from the beginning of time. The fix is a checkpoint — periodically, the database flushes all currently-dirty data pages to disk and writes a checkpoint marker into the log.

After a checkpoint, every change before it is guaranteed to be in the data files, so recovery never needs to look further back than the last checkpoint. The log segments before it can be recycled or archived.

This creates a tuning tension worth understanding:

Frequent checkpoints → fast recovery (less log to replay) but more constant background I/O, since you're flushing dirty pages aggressively.
Infrequent checkpoints → less steady-state I/O but slower recovery, and more log to keep around.

Most databases expose this as a knob (checkpoint_timeout, max_wal_size, and friends in Postgres). The right setting depends on how much recovery time you can tolerate versus how much steady write amplification you want.

The fsync truth nobody can escape

WAL's entire guarantee rests on one assumption: when the database calls fsync (or fdatasync) on the log, the data is actually, physically on stable storage and will survive power loss. If that's a lie, durability is a lie.

And it has been a lie, repeatedly, in real systems:

Disks and controllers with volatile write caches that acknowledge a flush before the data is truly persisted. A power cut then loses "committed" data.
Filesystem and OS bugs where fsync errors were swallowed or mishandled — the infamous "fsyncgate" that affected Postgres and others, where a failed flush could be reported as success on a later call.
Virtualized and networked storage that quietly reorders or buffers writes in ways that violate the ordering WAL depends on.

The lesson: durability is only as strong as the weakest layer beneath your fsync. If you're running a database where data loss is unacceptable, you need to know that your storage stack — disk firmware, RAID controller, filesystem, hypervisor, cloud block store — actually honors flush semantics. This is also why "we have a WAL" doesn't mean "we can't lose data" until you've verified the whole chain.

Where you'll meet it

Once you know the shape of WAL, you start seeing it everywhere, because the idea generalizes far beyond relational databases:

Relational engines — Postgres WAL, InnoDB redo log, SQLite's WAL mode.
Key-value and LSM stores — RocksDB and Cassandra write to a commit log before the in-memory memtable, for exactly the same reason.
Distributed consensus — Raft and Paxos implementations persist a log of operations before applying them to the state machine. Same rule, different altitude.
Filesystems — journaling filesystems (ext4's journal, NTFS) log metadata changes before applying them so a crash doesn't corrupt the directory structure.

It's the same principle at every layer: record what you're about to do, durably, before you do it, so a crash can never catch you in an unrecoverable in-between state. Write-ahead logging is one of those ideas that, once it clicks, reframes how you think about every system that has to survive being killed at the worst possible moment — which, eventually, is all of them.

SharePost Share

Keep reading

Jun 20, 20267 min read

Backpressure: What Happens When Your System Can't Keep Up

A fast producer and a slow consumer is a recipe for an out-of-memory crash. Backpressure is the discipline of letting the slow part tell the fast part to wait. Here is how to design it in.

system-design tutorial

Jun 18, 20266 min read

Database Connection Pooling: The Bottleneck You Forgot to Tune

More connections is not more throughput. Past a point, adding connections makes your database slower. Here is how pools actually work and how to size one without guessing.