Skip to content

12. Monitoring

Goal: expose and scrape CloudNativePG metrics so you can see replication lag, backup health, and connection saturation — the signals that tell you the cluster is healthy before something breaks.

What CNPG exposes

Each instance exposes Prometheus metrics on port 9187. Useful series:

Metric Tells you
cnpg_pg_replication_lag How far standbys trail the primary (watch during reboots)
cnpg_backends_total Connection count (is the pooler doing its job?)
cnpg_pg_postmaster_start_time Restarts/failovers
cnpg_collector_last_* Collector health
flowchart LR
    pg["PostgreSQL instances<br/>:9187 /metrics"] --> pm["PodMonitor / ServiceMonitor"]
    pm --> prom["Prometheus"]
    prom --> graf["Grafana dashboards"]

Step 12.1 — Turn on the PodMonitor

If you run the Prometheus Operator (it provides the PodMonitor CRD), enable metrics scraping in the Cluster:

  monitoring:
    enablePodMonitor: true

Apply it (edit the Cluster or re-apply cluster.yaml). CNPG creates a PodMonitor that Prometheus picks up automatically.

No Prometheus yet?

On kube-hetzner you can add the kube-prometheus-stack Helm chart (Prometheus Operator + Grafana). Until then, you can still scrape :9187 manually or just read health via kubectl cnpg status.

Step 12.2 — Security note on the metrics exporter (important)

This is the component behind the critical CVE we pinned the operator to fix.

In 1.29.1 / 1.28.3, the metrics exporter no longer connects as the PostgreSQL superuser. It now uses a dedicated, low-privilege cnpg_metrics_exporter role with only pg_monitor rights.

Custom monitoring queries may need a GRANT

If you add custom metric queries that read your own tables, or use target_databases: '*' where PUBLIC CONNECT was revoked, you must now GRANT the needed access to cnpg_metrics_exporter explicitly — the exporter is no longer a superuser. The built-in queries are already handled.

Step 12.3 — What to alert on

Practical alerts for this environment:

  • Replication lag rising and not recovering (a standby fell behind or is stuck) — especially relevant around Kured reboots.
  • WAL archiving failing (no PITR!) — arguably your most important alert.
  • Backup age exceeding your schedule (a base backup hasn't completed).
  • Connection saturation approaching max_connections.

Step 12.4 — Dashboards

The community publishes a Grafana dashboard for CloudNativePG. Import it and point it at your Prometheus to get replication, backup, and connection panels out of the box.

What could go wrong

  • enablePodMonitor: true but no PodMonitor CRD → you don't have the Prometheus Operator installed; install it first or use a ServiceMonitor/raw scrape as appropriate.
  • Custom query returns no data after upgrading to 1.29.1 → the exporter lost superuser; add the required GRANT to cnpg_metrics_exporter.

Where to go deeper

Next: Toward full IaC.