10. Failover & switchover¶

Goal: understand and test what happens when the primary goes away. In this environment it is not hypothetical — Kured reboots make failovers routine, so you must know the behavior and confirm your app survives it.

The vocabulary, once more¶

Failover = unplanned. The primary disappeared; the operator promotes the most up-to-date standby and re-points pg-rw.
Switchover = planned. You deliberately move the primary role with no data loss (the right tool before maintenance).

What the operator does on failover¶

sequenceDiagram
    participant N as Primary's node
    participant O as CNPG operator
    participant S as Best standby
    participant Svc as pg-rw Service
    participant App as Application
    N--xO: primary unreachable
    O->>S: promote (most up-to-date)
    O->>Svc: re-point pg-rw → new primary
    App->>Svc: reconnect → writes resume
    N-->>O: node returns
    O->>N: rebuild old primary as a standby

With our 1-replica disposable storage, the old primary is often rebuilt from scratch (a fresh base copy from the new primary) rather than reusing its old volume. That is by design.

Step 10.1 — Find the current primary¶

kubectl cnpg status pg -n production | grep -i primary
# or:
kubectl get pods -n production -L cnpg.io/instanceRole -l cnpg.io/cluster=pg

Step 10.2 — Test a planned switchover (safe, no data loss)¶

Always prefer this for anything intentional:

kubectl cnpg promote pg <standby-pod-name> -n production

The named standby becomes primary via a clean switchover. Watch pg-rw follow:

kubectl cnpg status pg -n production

Step 10.3 — Test an unplanned failover (simulate a crash)¶

In a non-production, disposable cluster, delete the primary pod and watch the operator react:

kubectl delete pod <current-primary-pod> -n production
kubectl get pods -n production -w
kubectl cnpg status pg -n production

A standby should be promoted within seconds to tens of seconds, pg-rw should re-point, and the deleted instance should be recreated as a standby.

Step 10.4 — Prove the application reconnects¶

This is the point of the whole exercise in this environment. While a failover is in progress, run a tiny write loop and confirm it recovers after a brief blip:

# from a debug pod or via port-forward + psql in a loop
while true; do
  psql -h pg-rw.production.svc.cluster.local -U app_user -d app \
    -c "insert into ping(ts) values (now());" || echo "retry...";
  sleep 1;
done

You should see a short burst of "retry..." during the switchover, then writes resume. If your real application doesn't recover, fix its connection-retry logic now — Kured will trigger this for real.

Make maintenance gentle later

For the production cluster, configure Kured maintenance windows so OS reboots (and the failovers they cause) happen during quiet hours, and rely on the operator's PodDisruptionBudgets to avoid draining a primary and its only in-sync standby together.

What could go wrong¶

Failover doesn't happen → check operator logs and that more than one instance is healthy.
Writes lost on failover → expected only if you run asynchronous or dataDurability: preferred; with strict synchronous mode the promoted standby had every committed write.
App errors persist after failover → the app is caching a dead connection; it must reconnect to the Service name, not a pod IP.

Where to go deeper¶

Next: Disaster recovery & PITR — the chapter that makes your "rebuild from zero" dream real (and validates R2).