PostgreSQL Replication

Replication Slot

A practical guide to PostgreSQL replication slots, WAL retention safety, and the monitoring needed to avoid storage incidents.

Definition

A mechanism that retains WAL required by a subscriber or backup stream to prevent premature recycling.

Why Replication Slots Matter

Replication slots prevent required WAL from being removed before a consumer has processed it.

They protect replication and CDC pipelines from missing history and are critical when consumers can temporarily lag.

Physical vs Logical Slots

Physical slots are commonly used with standbys; logical slots are used by logical replication and CDC consumers.

Both protect continuity, but both can retain WAL indefinitely if consumers stop advancing.

  • Physical slots: standby replay continuity
  • Logical slots: change-stream continuity for subscribers/CDC
  • Both require strict monitoring and ownership

Main Risk

Inactive or stuck slots can retain WAL indefinitely, driving disk growth and potential outages.

  • Alert on slot lag and retained WAL size
  • Prune stale slots safely
  • Track replication consumer health continuously

Operational Policy

Set ownership, alerting, and cleanup policy for slots. Replication slots are safe when actively governed, risky when ignored.

  • Define slot owner and expected consumer heartbeat
  • Alert on retained WAL and stale slot age
  • Review slots during failover, migration, and decommission events

Frequently Asked Questions

What problem do replication slots solve?
They protect required WAL so consumers do not fall behind past the retention boundary and lose recoverability.
Can replication slots cause disk growth?
Yes. If consumers stop progressing, slots can retain WAL indefinitely until storage is exhausted.
Should every replica use a slot?
Often yes for reliability, but only with monitoring and cleanup discipline.
How do I monitor replication slot health?
Track retained WAL size, replay lag, consumer activity, and stale slot age, with alerts tied to storage risk thresholds.
What is the safest way to remove stale slots?
Confirm consumer decommissioning first, then remove stale slots during controlled operations with monitoring in place.