12. Monitoring¶
Goal: expose and scrape CloudNativePG metrics so you can see replication lag, backup health, and connection saturation — the signals that tell you the cluster is healthy before something breaks.
What CNPG exposes¶
Each instance exposes Prometheus metrics on port 9187. Useful series:
| Metric | Tells you |
|---|---|
cnpg_pg_replication_lag |
How far standbys trail the primary (watch during reboots) |
cnpg_backends_total |
Connection count (is the pooler doing its job?) |
cnpg_pg_postmaster_start_time |
Restarts/failovers |
cnpg_collector_last_* |
Collector health |
flowchart LR
pg["PostgreSQL instances<br/>:9187 /metrics"] --> pm["PodMonitor / ServiceMonitor"]
pm --> prom["Prometheus"]
prom --> graf["Grafana dashboards"]
Step 12.1 — Turn on the PodMonitor¶
If you run the Prometheus Operator (it provides the PodMonitor CRD), enable
metrics scraping in the Cluster:
Apply it (edit the Cluster or re-apply cluster.yaml). CNPG creates a
PodMonitor that Prometheus picks up automatically.
No Prometheus yet?
On kube-hetzner you can add the kube-prometheus-stack Helm chart (Prometheus
Operator + Grafana). Until then, you can still scrape :9187 manually or
just read health via kubectl cnpg status.
Step 12.2 — Security note on the metrics exporter (important)¶
This is the component behind the critical CVE we pinned the operator to fix.
In 1.29.1 / 1.28.3, the metrics exporter no longer connects as the
PostgreSQL superuser. It now uses a dedicated, low-privilege
cnpg_metrics_exporter role with only pg_monitor rights.
Custom monitoring queries may need a GRANT
If you add custom metric queries that read your own tables, or use
target_databases: '*' where PUBLIC CONNECT was revoked, you must now
GRANT the needed access to cnpg_metrics_exporter explicitly — the
exporter is no longer a superuser. The built-in queries are already handled.
Step 12.3 — What to alert on¶
Practical alerts for this environment:
- Replication lag rising and not recovering (a standby fell behind or is stuck) — especially relevant around Kured reboots.
- WAL archiving failing (no PITR!) — arguably your most important alert.
- Backup age exceeding your schedule (a base backup hasn't completed).
- Connection saturation approaching
max_connections.
Step 12.4 — Dashboards¶
The community publishes a Grafana dashboard for CloudNativePG. Import it and point it at your Prometheus to get replication, backup, and connection panels out of the box.
What could go wrong¶
enablePodMonitor: truebut noPodMonitorCRD → you don't have the Prometheus Operator installed; install it first or use aServiceMonitor/raw scrape as appropriate.- Custom query returns no data after upgrading to 1.29.1 → the exporter lost
superuser; add the required
GRANTtocnpg_metrics_exporter.
Where to go deeper¶
Next: Toward full IaC.