11. Disaster recovery & PITR¶
Goal: rebuild the database from R2 on a fresh, empty cluster, perform point-in-time recovery, and — most importantly — prove the restore actually works, given the known R2 issue. This chapter is the heart of your "levantar desde cero con las salvas" goal.
Recovery is a bootstrap, not an in-place operation¶
In CloudNativePG you never "restore into" an existing cluster. Instead you
create a new Cluster whose bootstrap reads from the backup. Recovery and
PITR are the same mechanism with a different target.
flowchart LR
r2[("R2: base backups + WAL")] --> ext["externalClusters:<br/>points at the ObjectStore"]
ext --> boot["new Cluster<br/>bootstrap.recovery"]
boot --> latest["restore to latest"]
boot --> pit["or PITR: stop at targetTime"]
Step 11.1 — Recreate the ObjectStore (on the fresh cluster)¶
On a brand-new cluster you must re-create the secret and the ObjectStore
(same content as chapter 5) so the plugin can
read the existing backups in R2. The bucket already holds your data.
Step 11.2 — A recovery Cluster (restore to latest)¶
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
name: pg-restored
namespace: production
spec:
instances: 3
imageCatalogRef:
apiGroup: postgresql.cnpg.io
kind: ClusterImageCatalog
name: postgresql
major: 18
storage:
storageClass: longhorn-postgres
size: 20Gi
walStorage:
storageClass: longhorn-postgres
size: 10Gi
bootstrap:
recovery: # (1)!
source: pg-origin
externalClusters: # (2)!
- name: pg-origin
plugin:
name: barman-cloud.cloudnative-pg.io
parameters:
barmanObjectName: pg-r2-store # (3)!
serverName: pg # (4)!
plugins: # (5)!
- name: barman-cloud.cloudnative-pg.io
isWALArchiver: true
parameters:
barmanObjectName: pg-r2-store
bootstrap.recovery(notinitdb) → start by restoring, not by creating an empty database.externalClustersdescribes where the backups live: the same plugin + ObjectStore, naming the original server.- Reuse the
ObjectStoreyou re-created in Step 11.1. serverNamemust match the original cluster's name/prefix in the bucket (herepg).- Wire the plugin again so the restored cluster also archives going forward.
kubectl apply -f restore-latest.yaml
kubectl get cluster pg-restored -n production -w
kubectl cnpg status pg-restored -n production
Step 11.3 — Point-in-time recovery¶
To rewind to a moment (e.g. just before a bad DELETE), add a recovery target:
bootstrap:
recovery:
source: pg-origin
recoveryTarget:
targetTime: "2026-06-15 13:59:00" # restore up to this instant
The operator restores the latest base backup before that time and replays WAL up to it.
Step 11.4 — VALIDATE (do not skip this)¶
This is the most important step in the entire guide for your environment.
The R2 restore caveat
There is a reported case where the Barman Cloud Plugin uploads to Cloudflare R2 fine but fails to restore. Until you have personally seen a full backup → restore cycle succeed on your R2, treat backups as unproven.
A concrete validation drill (do it on day one, not in a crisis):
- Insert a known marker row into the running database and note the time.
- Trigger an on-demand backup; confirm WAL archiving is OK.
- Apply
restore-latest.yamlto a new cluster name. - Connect to the restored cluster and confirm the marker row is present.
- Optionally repeat with a
targetTimebefore the marker and confirm it is absent — that proves PITR, not just restore.
If any of this fails on R2, go to the fallbacks below.
Fallbacks if R2 restore proves unreliable¶
Because the Hetzner CSI has no snapshots, your fallbacks are:
- Longhorn backup to R2. Longhorn can back up its volumes to S3-compatible
storage (R2) and restore them on a fresh cluster — an independent,
volume-level safety net. Configure a Longhorn backup target and a recurring
backup job for the
longhorn-postgresvolumes. - Switch the Barman target to a known-good S3. AWS S3 works cleanly;
Backblaze B2 works with the documented checksum + region env workaround. You
only change the
ObjectStoreendpoint/credentials; the rest of the guide is unchanged.
flowchart TB
primary["Primary path:<br/>Barman → R2 (validate!)"]
fb1["Fallback A:<br/>Longhorn volume backup → R2"]
fb2["Fallback B:<br/>Barman → AWS S3 / B2"]
primary -->|if restore fails| fb1
primary -->|or| fb2
Why this unlocks "from zero in minutes"¶
Once restore is proven, your full-IaC flow becomes: terraform apply builds the
platform, the operator + plugin deploy, and a bootstrap.recovery Cluster pulls
your data from R2 — a working database in minutes, from nothing. That is
Layer 3.
Where to go deeper¶
Next: Monitoring.