Operations runbook¶
A quick-reference for day-to-day operations. Replace pg / production with
your cluster name and namespace.
Daily lifecycle (your create/destroy workflow)¶
# Morning: bring up the platform
terraform apply
export KUBECONFIG=$PWD/<clustername>_kubeconfig.yaml
kubectl get nodes
# Apply Layer 2 (until it becomes Layer 3)
kubectl apply -f longhorn-postgres.yaml
kubectl apply --server-side -f <cnpg-1.29.1.yaml>
kubectl apply -f <barman-plugin manifest>
kubectl apply -f image-catalog.yaml
kubectl create namespace production
# ... secrets, objectstore, cluster, pooler, scheduledbackup, netpol ...
# Night: tear everything down
terraform destroy
Inspecting the cluster¶
kubectl cnpg status pg -n production # the single most useful command
kubectl get cluster pg -n production
kubectl get pods -n production -o wide -l cnpg.io/cluster=pg
kubectl describe cluster pg -n production # events at the bottom
kubectl logs -n cnpg-system deploy/cnpg-controller-manager
Backups¶
kubectl cnpg backup pg -n production # on-demand base backup
kubectl get backup -n production
kubectl cnpg status pg -n production | grep -i archiv # WAL archiving health
# verify objects landed in R2:
aws s3 ls --endpoint-url https://<account-id>.r2.cloudflarestorage.com \
s3://<bucket>/pg-cluster/ --recursive | tail
Failover / switchover¶
kubectl cnpg promote pg <standby-pod> -n production # planned switchover
kubectl delete pod <primary-pod> -n production # simulate a crash (test only)
Recovery¶
# Re-create secret + ObjectStore on the fresh cluster, then:
kubectl apply -f restore-latest.yaml
kubectl cnpg status pg-restored -n production
Common mistakes (and the fix)¶
| Symptom | Likely cause | Fix |
|---|---|---|
Pods stuck Pending |
storage not ready / wrong class | check Longhorn + longhorn-postgres SC |
| Writes hang | synchronous mode, no standby available | wait for standby, or set dataDurability: preferred |
| WAL archive failing | R2 checksum env vars / wrong endpoint | fix ObjectStore (chapter 5) |
| Backup OK but restore fails | the R2 restore issue | validate early; use Longhorn-to-R2 or AWS S3 |
No pg-superuser secret |
enableSuperuserAccess default false |
set it true if you need superuser |
| NetworkPolicy ignored | CNI not enforcing | verify k3s policy controller / Cilium / Calico |
| App breaks on every failover | app connects to pod IP | connect to the Service/pooler DNS name |
| Mixed-arch error | a cax (ARM) node slipped in |
all nodes must be amd64 (cx/cpx/ccx) |
cnpg plugin not found |
wrong Homebrew formula | brew install kubectl-cnpg |
Best-practice checklist¶
- [ ] Operator pinned to 1.29.1+ (CVE fix).
- [ ] Barman Cloud Plugin (not in-tree
barmanObjectStore). - [ ] All nodes amd64, one location.
- [ ]
longhorn-postgresStorageClass with 1 replica (disposable). - [ ] Separate walStorage volume.
- [ ] Synchronous replication configured; durability trade-off chosen deliberately.
- [ ] WAL archiving verified OK (not just base backups).
- [ ] Full backup → restore cycle validated on R2.
- [ ] App retries on connection loss (Kured will cause failovers).
- [ ] Secrets kept out of Git (Sealed/External Secrets) before Layer 3.
- [ ] Versions captured and pinned for reproducibility.