PostgreSQL Backup & Recovery

WAL (Write-Ahead Logging)

Learn how PostgreSQL WAL works, why it is required for durability, and how it supports crash recovery, replication, and PITR.

Definition

A logging mechanism where changes are written to a log before being applied to the database files.

How WAL Works Under the Hood

PostgreSQL writes every change as WAL records before the corresponding data pages are flushed to data files. This ordering is what makes crash recovery reliable and enforces ACID durability.

A commit is considered durable only after its WAL is safely flushed according to your durability settings. Data files can be written later because WAL replay can reconstruct the final consistent state.

During restart recovery, PostgreSQL replays WAL records from the last checkpoint forward until the cluster reaches a consistent point.

Write change records to WAL first
Flush WAL to durable storage on commit path
Replay WAL after crashes to restore consistency

WAL in Backup and Recovery

WAL is mandatory for PITR. A base backup provides starting state, while WAL archives provide every change after that snapshot.

If archive shipping breaks for even a short interval, your theoretical recovery target can become impossible in practice.

For production readiness, teams should pair WAL archiving with scheduled restore drills, not backup-job success metrics alone.

No WAL chain, no precise PITR
Base backup + WAL archives are both required
Restore drills validate real recoverability

WAL and Replication

Physical streaming replication sends WAL from primary to standbys. Replica lag is directly tied to WAL transport and replay throughput.

In logical pipelines, slot and consumer health still determine whether required WAL can be safely removed.

Mismanaged replication slots are a common reason for uncontrolled WAL growth and storage incidents.

Why WAL Is Operationally Critical

WAL is foundational for PITR and replication. Any gap in WAL retention can break restore guarantees.

Teams should monitor archive lag, replication lag, and WAL volume growth continuously.

In high-change systems, WAL volume can grow faster than expected; capacity planning must account for peak write bursts, not averages.

WAL volume often spikes during backfills, large transactions, index builds, or deploy windows with schema/data rewrites.

Key WAL Monitoring Signals

Archived WAL success/failure rate
Replication replay lag by standby
WAL disk usage growth slope
WAL generation spikes during deploy windows
Stale replication slots and retained WAL size

WAL Configuration Knobs Teams Should Know

WAL behavior is influenced by checkpoint cadence, retention thresholds, and archiving settings. Tuning these knobs changes both performance and recoverability characteristics.

Configuration should be validated against your target RPO/RTO and peak write patterns, not just average traffic.

Checkpoint cadence: impacts write amplification and restart time
WAL size bounds: affects retention pressure and burst handling
archive_mode + archive_command: determines off-node durability

Common WAL Failure Modes

Archive command fails silently or intermittently
Replication slot stops advancing and WAL accumulates
Disk fills due to unexpected WAL spikes
No tested restore path from base backup through replay

Related Concepts

WAL is tightly linked to archive_mode, archive_command, and recovery targets.

Teams evaluating tooling should also compare operational workflows in this PostgreSQL backup/restore tools guide.

archive_mode and archive_command
Base backups and recovery targets
Replication slots and timeline management

Frequently Asked Questions

What is WAL in PostgreSQL?

WAL is the write-ahead log. PostgreSQL records changes there before data-file writes so crash recovery and replication can replay consistent history.

Does WAL affect performance?

Yes. WAL adds write overhead, but it is required for durability. Correct disk and checkpoint tuning typically keeps this overhead manageable.

How is WAL related to PITR?

PITR restores from a base backup and then replays WAL to a chosen recovery target. Without WAL archives, precise point-in-time restore is not possible.

Can WAL growth cause incidents?

Yes. Failed archiving, stuck replication slots, or heavy write bursts can fill storage. Monitor WAL generation and retention continuously.

Where should teams start with WAL operations?

Start with archive reliability, replay lag visibility, and restore drills. WAL is only safe when replay is regularly tested.