PostgreSQL high availability¶
This chapter explains the database concepts the operator automates for you: replication, write-ahead logging, backups, point-in-time recovery, and connection pooling. No Kubernetes here — just Postgres.
Primary and standbys¶
A highly-available Postgres setup runs more than one copy of the database:
- One primary accepts writes (and reads).
- One or more standbys (also called replicas) keep a continuously-updated copy of the data. They can serve read-only queries.
If the primary fails, a standby is promoted to become the new primary. The goal is to survive the loss of a node without losing data and with minimal downtime.
flowchart LR
app["Application"] -->|writes & reads| P["Primary"]
app -.->|optional read-only| S1
P -->|stream changes| S1["Standby 1"]
P -->|stream changes| S2["Standby 2"]
The Write-Ahead Log (WAL): the heart of everything¶
Before Postgres changes a data file, it first writes a record of the change to the Write-Ahead Log (WAL). The WAL is an append-only stream of "here is what changed" records. This single mechanism powers three things we care about:
flowchart TB
txn["A transaction commits"] --> wal["WAL record written first<br/>(durable, append-only)"]
wal --> data["Data files updated later"]
wal --> repl["Streamed to standbys<br/>→ replication"]
wal --> arch["Copied to object storage<br/>→ archiving & PITR"]
wal --> crash["Replayed after a crash<br/>→ durability"]
- Crash recovery — after an unclean shutdown, Postgres replays the WAL to reach a consistent state.
- Streaming replication — standbys receive the primary's WAL stream and replay it, staying nearly in sync.
- Archiving / point-in-time recovery — if every WAL segment is also copied to durable storage, you can rebuild the database to any moment in time.
Keep this sentence in mind
A base backup is a photo of the database at one instant; the archived WAL is the film of everything that happened afterwards. Together they let you rewind to any frame.
Synchronous vs asynchronous replication¶
When the primary commits a write, how long does it wait for standbys?
- Asynchronous (default): the primary commits immediately and ships WAL to standbys "soon after". Fast, but if the primary dies before a standby caught up, the very last transactions can be lost.
- Synchronous: the primary waits until at least one standby confirms it has the WAL before telling the client "committed". No data loss on failover, at the cost of some write latency.
sequenceDiagram
participant C as Client
participant P as Primary
participant S as Standby
Note over C,S: Synchronous commit
C->>P: COMMIT
P->>S: here is the WAL
S-->>P: got it, flushed
P-->>C: committed (only now)
We will turn on synchronous replication for durability, and discuss the trade-off when we build the Cluster.
Failover vs switchover¶
Two words that sound similar but are not:
- Failover = unplanned. The primary died; the system promotes a standby to limit the damage. There may be a brief interruption.
- Switchover = planned. You deliberately move the primary role to another instance (e.g. before maintenance) with no data loss.
This distinction matters a lot in our environment, where the OS auto-upgrades will reboot nodes on a schedule — see Failover & switchover.
Backups: physical vs logical¶
- A physical backup copies the actual data files (plus the WAL). It is what you use for disaster recovery and PITR. CloudNativePG does this for you via base backups + WAL archiving to object storage.
- A logical backup (
pg_dump) exports SQL statements. Useful for migrating one database or restoring a single table, but not a business-continuity tool and not what we rely on here.
flowchart LR
subgraph src["Running database"]
base["Base backup<br/>(periodic full copy)"]
walc["WAL archive<br/>(continuous)"]
end
base --> store[("Object storage<br/>e.g. Cloudflare R2")]
walc --> store
store --> restore["Restore = base backup<br/>+ replay WAL up to target time"]
Point-in-Time Recovery (PITR) is the payoff: restore the latest base backup,
then replay archived WAL up to, say, 13:59 — one minute before someone ran the
bad DELETE.
Backups are only real once restored
An untested backup is a hope, not a backup. This is doubly true for us because of the reported R2 restore issue.
Connection pooling¶
Each PostgreSQL connection is a separate operating-system process that consumes
memory. In Kubernetes you might have dozens of application pods, each opening
its own pool — hundreds of connections, most of them idle. That wastes memory
and can exhaust max_connections.
A connection pooler (we use PgBouncer) sits in front of Postgres and multiplexes many short-lived client connections onto a small number of real database connections.
flowchart LR
a1["app pod"] --> pb
a2["app pod"] --> pb
a3["app pod"] --> pb
pb["PgBouncer<br/>(e.g. 500 clients →<br/>25 server connections)"] --> pg["PostgreSQL primary"]
Pooling has modes (session, transaction, statement) with different trade-offs; we cover them in Connecting & pooling.
Where to go deeper¶
- PostgreSQL: High Availability, Load Balancing, and Replication
- PostgreSQL: Write-Ahead Logging (WAL)
- PostgreSQL: Continuous Archiving and PITR
- PgBouncer documentation
Next: Why CloudNativePG — how an operator turns all of the above into something you declare in a few lines of YAML.