Skip to content

PostgreSQL high availability

This chapter explains the database concepts the operator automates for you: replication, write-ahead logging, backups, point-in-time recovery, and connection pooling. No Kubernetes here — just Postgres.

Primary and standbys

A highly-available Postgres setup runs more than one copy of the database:

  • One primary accepts writes (and reads).
  • One or more standbys (also called replicas) keep a continuously-updated copy of the data. They can serve read-only queries.

If the primary fails, a standby is promoted to become the new primary. The goal is to survive the loss of a node without losing data and with minimal downtime.

flowchart LR
    app["Application"] -->|writes & reads| P["Primary"]
    app -.->|optional read-only| S1
    P -->|stream changes| S1["Standby 1"]
    P -->|stream changes| S2["Standby 2"]

The Write-Ahead Log (WAL): the heart of everything

Before Postgres changes a data file, it first writes a record of the change to the Write-Ahead Log (WAL). The WAL is an append-only stream of "here is what changed" records. This single mechanism powers three things we care about:

flowchart TB
    txn["A transaction commits"] --> wal["WAL record written first<br/>(durable, append-only)"]
    wal --> data["Data files updated later"]
    wal --> repl["Streamed to standbys<br/>→ replication"]
    wal --> arch["Copied to object storage<br/>→ archiving & PITR"]
    wal --> crash["Replayed after a crash<br/>→ durability"]
  1. Crash recovery — after an unclean shutdown, Postgres replays the WAL to reach a consistent state.
  2. Streaming replication — standbys receive the primary's WAL stream and replay it, staying nearly in sync.
  3. Archiving / point-in-time recovery — if every WAL segment is also copied to durable storage, you can rebuild the database to any moment in time.

Keep this sentence in mind

A base backup is a photo of the database at one instant; the archived WAL is the film of everything that happened afterwards. Together they let you rewind to any frame.

Synchronous vs asynchronous replication

When the primary commits a write, how long does it wait for standbys?

  • Asynchronous (default): the primary commits immediately and ships WAL to standbys "soon after". Fast, but if the primary dies before a standby caught up, the very last transactions can be lost.
  • Synchronous: the primary waits until at least one standby confirms it has the WAL before telling the client "committed". No data loss on failover, at the cost of some write latency.
sequenceDiagram
    participant C as Client
    participant P as Primary
    participant S as Standby
    Note over C,S: Synchronous commit
    C->>P: COMMIT
    P->>S: here is the WAL
    S-->>P: got it, flushed
    P-->>C: committed (only now)

We will turn on synchronous replication for durability, and discuss the trade-off when we build the Cluster.

Failover vs switchover

Two words that sound similar but are not:

  • Failover = unplanned. The primary died; the system promotes a standby to limit the damage. There may be a brief interruption.
  • Switchover = planned. You deliberately move the primary role to another instance (e.g. before maintenance) with no data loss.

This distinction matters a lot in our environment, where the OS auto-upgrades will reboot nodes on a schedule — see Failover & switchover.

Backups: physical vs logical

  • A physical backup copies the actual data files (plus the WAL). It is what you use for disaster recovery and PITR. CloudNativePG does this for you via base backups + WAL archiving to object storage.
  • A logical backup (pg_dump) exports SQL statements. Useful for migrating one database or restoring a single table, but not a business-continuity tool and not what we rely on here.
flowchart LR
    subgraph src["Running database"]
        base["Base backup<br/>(periodic full copy)"]
        walc["WAL archive<br/>(continuous)"]
    end
    base --> store[("Object storage<br/>e.g. Cloudflare R2")]
    walc --> store
    store --> restore["Restore = base backup<br/>+ replay WAL up to target time"]

Point-in-Time Recovery (PITR) is the payoff: restore the latest base backup, then replay archived WAL up to, say, 13:59 — one minute before someone ran the bad DELETE.

Backups are only real once restored

An untested backup is a hope, not a backup. This is doubly true for us because of the reported R2 restore issue.

Connection pooling

Each PostgreSQL connection is a separate operating-system process that consumes memory. In Kubernetes you might have dozens of application pods, each opening its own pool — hundreds of connections, most of them idle. That wastes memory and can exhaust max_connections.

A connection pooler (we use PgBouncer) sits in front of Postgres and multiplexes many short-lived client connections onto a small number of real database connections.

flowchart LR
    a1["app pod"] --> pb
    a2["app pod"] --> pb
    a3["app pod"] --> pb
    pb["PgBouncer<br/>(e.g. 500 clients →<br/>25 server connections)"] --> pg["PostgreSQL primary"]

Pooling has modes (session, transaction, statement) with different trade-offs; we cover them in Connecting & pooling.

Where to go deeper

Next: Why CloudNativePG — how an operator turns all of the above into something you declare in a few lines of YAML.