Skip to content

11. Disaster recovery & PITR

Goal: rebuild the database from R2 on a fresh, empty cluster, perform point-in-time recovery, and — most importantly — prove the restore actually works, given the known R2 issue. This chapter is the heart of your "levantar desde cero con las salvas" goal.

Recovery is a bootstrap, not an in-place operation

In CloudNativePG you never "restore into" an existing cluster. Instead you create a new Cluster whose bootstrap reads from the backup. Recovery and PITR are the same mechanism with a different target.

flowchart LR
    r2[("R2: base backups + WAL")] --> ext["externalClusters:<br/>points at the ObjectStore"]
    ext --> boot["new Cluster<br/>bootstrap.recovery"]
    boot --> latest["restore to latest"]
    boot --> pit["or PITR: stop at targetTime"]

Step 11.1 — Recreate the ObjectStore (on the fresh cluster)

On a brand-new cluster you must re-create the secret and the ObjectStore (same content as chapter 5) so the plugin can read the existing backups in R2. The bucket already holds your data.

Step 11.2 — A recovery Cluster (restore to latest)

restore-latest.yaml
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
  name: pg-restored
  namespace: production
spec:
  instances: 3
  imageCatalogRef:
    apiGroup: postgresql.cnpg.io
    kind: ClusterImageCatalog
    name: postgresql
    major: 18
  storage:
    storageClass: longhorn-postgres
    size: 20Gi
  walStorage:
    storageClass: longhorn-postgres
    size: 10Gi

  bootstrap:
    recovery:                         # (1)!
      source: pg-origin

  externalClusters:                   # (2)!
    - name: pg-origin
      plugin:
        name: barman-cloud.cloudnative-pg.io
        parameters:
          barmanObjectName: pg-r2-store   # (3)!
          serverName: pg                  # (4)!

  plugins:                            # (5)!
    - name: barman-cloud.cloudnative-pg.io
      isWALArchiver: true
      parameters:
        barmanObjectName: pg-r2-store
  1. bootstrap.recovery (not initdb) → start by restoring, not by creating an empty database.
  2. externalClusters describes where the backups live: the same plugin + ObjectStore, naming the original server.
  3. Reuse the ObjectStore you re-created in Step 11.1.
  4. serverName must match the original cluster's name/prefix in the bucket (here pg).
  5. Wire the plugin again so the restored cluster also archives going forward.
kubectl apply -f restore-latest.yaml
kubectl get cluster pg-restored -n production -w
kubectl cnpg status pg-restored -n production

Step 11.3 — Point-in-time recovery

To rewind to a moment (e.g. just before a bad DELETE), add a recovery target:

  bootstrap:
    recovery:
      source: pg-origin
      recoveryTarget:
        targetTime: "2026-06-15 13:59:00"   # restore up to this instant

The operator restores the latest base backup before that time and replays WAL up to it.

Step 11.4 — VALIDATE (do not skip this)

This is the most important step in the entire guide for your environment.

The R2 restore caveat

There is a reported case where the Barman Cloud Plugin uploads to Cloudflare R2 fine but fails to restore. Until you have personally seen a full backup → restore cycle succeed on your R2, treat backups as unproven.

A concrete validation drill (do it on day one, not in a crisis):

  1. Insert a known marker row into the running database and note the time.
  2. Trigger an on-demand backup; confirm WAL archiving is OK.
  3. Apply restore-latest.yaml to a new cluster name.
  4. Connect to the restored cluster and confirm the marker row is present.
  5. Optionally repeat with a targetTime before the marker and confirm it is absent — that proves PITR, not just restore.

If any of this fails on R2, go to the fallbacks below.

Fallbacks if R2 restore proves unreliable

Because the Hetzner CSI has no snapshots, your fallbacks are:

  1. Longhorn backup to R2. Longhorn can back up its volumes to S3-compatible storage (R2) and restore them on a fresh cluster — an independent, volume-level safety net. Configure a Longhorn backup target and a recurring backup job for the longhorn-postgres volumes.
  2. Switch the Barman target to a known-good S3. AWS S3 works cleanly; Backblaze B2 works with the documented checksum + region env workaround. You only change the ObjectStore endpoint/credentials; the rest of the guide is unchanged.
flowchart TB
    primary["Primary path:<br/>Barman → R2 (validate!)"]
    fb1["Fallback A:<br/>Longhorn volume backup → R2"]
    fb2["Fallback B:<br/>Barman → AWS S3 / B2"]
    primary -->|if restore fails| fb1
    primary -->|or| fb2

Why this unlocks "from zero in minutes"

Once restore is proven, your full-IaC flow becomes: terraform apply builds the platform, the operator + plugin deploy, and a bootstrap.recovery Cluster pulls your data from R2 — a working database in minutes, from nothing. That is Layer 3.

Where to go deeper

Next: Monitoring.