Operations runbook¶

A quick-reference for day-to-day operations. Replace pg / production with your cluster name and namespace.

Daily lifecycle (your create/destroy workflow)¶

# Morning: bring up the platform
terraform apply
export KUBECONFIG=$PWD/<clustername>_kubeconfig.yaml
kubectl get nodes

# Apply Layer 2 (until it becomes Layer 3)
kubectl apply -f longhorn-postgres.yaml
kubectl apply --server-side -f <cnpg-1.29.1.yaml>
kubectl apply -f <barman-plugin manifest>
kubectl apply -f image-catalog.yaml
kubectl create namespace production
# ... secrets, objectstore, cluster, pooler, scheduledbackup, netpol ...

# Night: tear everything down
terraform destroy

Inspecting the cluster¶

kubectl cnpg status pg -n production          # the single most useful command
kubectl get cluster pg -n production
kubectl get pods -n production -o wide -l cnpg.io/cluster=pg
kubectl describe cluster pg -n production      # events at the bottom
kubectl logs -n cnpg-system deploy/cnpg-controller-manager

Backups¶

kubectl cnpg backup pg -n production           # on-demand base backup
kubectl get backup -n production
kubectl cnpg status pg -n production | grep -i archiv   # WAL archiving health
# verify objects landed in R2:
aws s3 ls --endpoint-url https://<account-id>.r2.cloudflarestorage.com \
  s3://<bucket>/pg-cluster/ --recursive | tail

Failover / switchover¶

kubectl cnpg promote pg <standby-pod> -n production   # planned switchover
kubectl delete pod <primary-pod> -n production         # simulate a crash (test only)

Recovery¶

# Re-create secret + ObjectStore on the fresh cluster, then:
kubectl apply -f restore-latest.yaml
kubectl cnpg status pg-restored -n production

Common mistakes (and the fix)¶

Symptom	Likely cause	Fix
Pods stuck `Pending`	storage not ready / wrong class	check Longhorn + `longhorn-postgres` SC
Writes hang	synchronous mode, no standby available	wait for standby, or set `dataDurability: preferred`
WAL archive failing	R2 checksum env vars / wrong endpoint	fix `ObjectStore` (chapter 5)
Backup OK but restore fails	the R2 restore issue	validate early; use Longhorn-to-R2 or AWS S3
No `pg-superuser` secret	`enableSuperuserAccess` default false	set it `true` if you need superuser
NetworkPolicy ignored	CNI not enforcing	verify k3s policy controller / Cilium / Calico
App breaks on every failover	app connects to pod IP	connect to the Service/pooler DNS name
Mixed-arch error	a `cax` (ARM) node slipped in	all nodes must be amd64 (`cx`/`cpx`/`ccx`)
`cnpg` plugin not found	wrong Homebrew formula	`brew install kubectl-cnpg`

Best-practice checklist¶

[ ] Operator pinned to 1.29.1+ (CVE fix).
[ ] Barman Cloud Plugin (not in-tree barmanObjectStore).
[ ] All nodes amd64, one location.
[ ] longhorn-postgres StorageClass with 1 replica (disposable).
[ ] Separate walStorage volume.
[ ] Synchronous replication configured; durability trade-off chosen deliberately.
[ ] WAL archiving verified OK (not just base backups).
[ ] Full backup → restore cycle validated on R2.
[ ] App retries on connection loss (Kured will cause failovers).
[ ] Secrets kept out of Git (Sealed/External Secrets) before Layer 3.
[ ] Versions captured and pinned for reproducibility.