12. Monitoring¶

Goal: expose and scrape CloudNativePG metrics so you can see replication lag, backup health, and connection saturation — the signals that tell you the cluster is healthy before something breaks.

What CNPG exposes¶

Each instance exposes Prometheus metrics on port 9187. Useful series:

Metric	Tells you
`cnpg_pg_replication_lag`	How far standbys trail the primary (watch during reboots)
`cnpg_backends_total`	Connection count (is the pooler doing its job?)
`cnpg_pg_postmaster_start_time`	Restarts/failovers
`cnpg_collector_last_*`	Collector health

flowchart LR
    pg["PostgreSQL instances<br/>:9187 /metrics"] --> pm["PodMonitor / ServiceMonitor"]
    pm --> prom["Prometheus"]
    prom --> graf["Grafana dashboards"]

Step 12.1 — Turn on the PodMonitor¶

If you run the Prometheus Operator (it provides the PodMonitor CRD), enable metrics scraping in the Cluster:

  monitoring:
    enablePodMonitor: true

Apply it (edit the Cluster or re-apply cluster.yaml). CNPG creates a PodMonitor that Prometheus picks up automatically.

No Prometheus yet?

On kube-hetzner you can add the kube-prometheus-stack Helm chart (Prometheus Operator + Grafana). Until then, you can still scrape :9187 manually or just read health via kubectl cnpg status.

Step 12.2 — Security note on the metrics exporter (important)¶

This is the component behind the critical CVE we pinned the operator to fix.

In 1.29.1 / 1.28.3, the metrics exporter no longer connects as the PostgreSQL superuser. It now uses a dedicated, low-privilege cnpg_metrics_exporter role with only pg_monitor rights.

Custom monitoring queries may need a GRANT

If you add custom metric queries that read your own tables, or use target_databases: '*' where PUBLIC CONNECT was revoked, you must now GRANT the needed access to cnpg_metrics_exporter explicitly — the exporter is no longer a superuser. The built-in queries are already handled.

Step 12.3 — What to alert on¶

Practical alerts for this environment:

Replication lag rising and not recovering (a standby fell behind or is stuck) — especially relevant around Kured reboots.
WAL archiving failing (no PITR!) — arguably your most important alert.
Backup age exceeding your schedule (a base backup hasn't completed).
Connection saturation approaching max_connections.

Step 12.4 — Dashboards¶

The community publishes a Grafana dashboard for CloudNativePG. Import it and point it at your Prometheus to get replication, backup, and connection panels out of the box.

What could go wrong¶

enablePodMonitor: true but no PodMonitor CRD → you don't have the Prometheus Operator installed; install it first or use a ServiceMonitor/raw scrape as appropriate.
Custom query returns no data after upgrading to 1.29.1 → the exporter lost superuser; add the required GRANT to cnpg_metrics_exporter.

Where to go deeper¶

Next: Toward full IaC.