Skip to content

Operations runbook

A quick-reference for day-to-day operations. Replace pg / production with your cluster name and namespace.

Daily lifecycle (your create/destroy workflow)

# Morning: bring up the platform
terraform apply
export KUBECONFIG=$PWD/<clustername>_kubeconfig.yaml
kubectl get nodes

# Apply Layer 2 (until it becomes Layer 3)
kubectl apply -f longhorn-postgres.yaml
kubectl apply --server-side -f <cnpg-1.29.1.yaml>
kubectl apply -f <barman-plugin manifest>
kubectl apply -f image-catalog.yaml
kubectl create namespace production
# ... secrets, objectstore, cluster, pooler, scheduledbackup, netpol ...

# Night: tear everything down
terraform destroy

Inspecting the cluster

kubectl cnpg status pg -n production          # the single most useful command
kubectl get cluster pg -n production
kubectl get pods -n production -o wide -l cnpg.io/cluster=pg
kubectl describe cluster pg -n production      # events at the bottom
kubectl logs -n cnpg-system deploy/cnpg-controller-manager

Backups

kubectl cnpg backup pg -n production           # on-demand base backup
kubectl get backup -n production
kubectl cnpg status pg -n production | grep -i archiv   # WAL archiving health
# verify objects landed in R2:
aws s3 ls --endpoint-url https://<account-id>.r2.cloudflarestorage.com \
  s3://<bucket>/pg-cluster/ --recursive | tail

Failover / switchover

kubectl cnpg promote pg <standby-pod> -n production   # planned switchover
kubectl delete pod <primary-pod> -n production         # simulate a crash (test only)

Recovery

# Re-create secret + ObjectStore on the fresh cluster, then:
kubectl apply -f restore-latest.yaml
kubectl cnpg status pg-restored -n production

Common mistakes (and the fix)

Symptom Likely cause Fix
Pods stuck Pending storage not ready / wrong class check Longhorn + longhorn-postgres SC
Writes hang synchronous mode, no standby available wait for standby, or set dataDurability: preferred
WAL archive failing R2 checksum env vars / wrong endpoint fix ObjectStore (chapter 5)
Backup OK but restore fails the R2 restore issue validate early; use Longhorn-to-R2 or AWS S3
No pg-superuser secret enableSuperuserAccess default false set it true if you need superuser
NetworkPolicy ignored CNI not enforcing verify k3s policy controller / Cilium / Calico
App breaks on every failover app connects to pod IP connect to the Service/pooler DNS name
Mixed-arch error a cax (ARM) node slipped in all nodes must be amd64 (cx/cpx/ccx)
cnpg plugin not found wrong Homebrew formula brew install kubectl-cnpg

Best-practice checklist

  • [ ] Operator pinned to 1.29.1+ (CVE fix).
  • [ ] Barman Cloud Plugin (not in-tree barmanObjectStore).
  • [ ] All nodes amd64, one location.
  • [ ] longhorn-postgres StorageClass with 1 replica (disposable).
  • [ ] Separate walStorage volume.
  • [ ] Synchronous replication configured; durability trade-off chosen deliberately.
  • [ ] WAL archiving verified OK (not just base backups).
  • [ ] Full backup → restore cycle validated on R2.
  • [ ] App retries on connection loss (Kured will cause failovers).
  • [ ] Secrets kept out of Git (Sealed/External Secrets) before Layer 3.
  • [ ] Versions captured and pinned for reproducibility.