Skip to content

1. Platform layer (kube.tf)

Goal of this chapter: declare the cluster, cert-manager, Longhorn, and a Postgres-tuned StorageClass in Terraform, so every fresh cluster comes up with the platform ready. This is Layer 1 — pure Infrastructure-as-Code.

Prerequisites

  • A Hetzner Cloud project + API token (Read & Write).
  • terraform/tofu, packer, kubectl, and hcloud installed.
  • You have run the kube-hetzner create.sh script once to generate the MicroOS snapshot and a starter kube.tf. See the kube-hetzner Getting Started.

What we are declaring

flowchart TB
    tf["kube.tf"] --> cluster["3 amd64 nodes, one location,<br/>all schedulable"]
    tf --> certm["enable_cert_manager = true"]
    tf --> longhorn["enable_longhorn = true<br/>(node storage)"]
    sc["longhorn-postgres StorageClass<br/>(1 replica, disposable)"]
    longhorn --> sc

Step 1.1 — Pin the module

In your kube.tf, pin the module version. Never float it for reproducible infra:

module "kube-hetzner" {
  source  = "kube-hetzner/kube-hetzner/hcloud"
  version = "2.20.0"   # (1)!
  # ... provider, hcloud_token, ssh keys, network_region ...
}
  1. Pin to the version you tested. Check the releases page and bump deliberately, not automatically.

Step 1.2 — Nodes: 3 × amd64, one location, schedulable

  # All three nodes are control-planes AND run workloads (learning-phase layout).
  allow_scheduling_on_control_plane = true   # (1)!

  control_plane_nodepools = [
    {
      name        = "cp"
      server_type = "cpx31"   # (2)!  amd64 — adjust size to your workload
      location    = "nbg1"    # (3)!  one location for all three
      labels = [
        "node.longhorn.io/create-default-disk=true",   # (4)!
        "node.kubernetes.io/server-usage=storage",
      ]
      taints               = []
      count                = 3
      longhorn_volume_size = 0   # (5)!  0 = node storage (fast, recommended for DBs)
    }
  ]

  agent_nodepools = []
  1. Production should use dedicated agent nodes instead — see Toward full IaC. For daily create/destroy, this is fine.
  2. amd64 only. cpx31 is an AMD shared-vCPU type. Never use cax* (ARM) here, or you risk a mixed-architecture cluster CNPG cannot run.
  3. A single location avoids zonal-volume scheduling problems.
  4. Tells Longhorn to create a default disk on these nodes.
  5. longhorn_volume_size = 0 uses the node's own disk (faster) instead of an attached Hetzner volume — the module's documented recommendation for databases.

Step 1.3 — Enable cert-manager and Longhorn

  # cert-manager: required by the Barman Cloud Plugin's TLS channel.
  enable_cert_manager = true
  # cert_manager_version = "v1.x.y"   # pin after first deploy (see Versions)

  # Longhorn: our storage layer.
  enable_longhorn = true
  # longhorn_version = "vX.Y.Z"       # pin after first deploy
  longhorn_values = <<-EOT
    defaultSettings:
      defaultDataPath: /var/lib/longhorn
    persistence:
      defaultClassReplicaCount: 2   # (1)!
  EOT
  1. This is the cluster-wide default Longhorn class replica count, used by things other than the database. Postgres gets its own class with 1 replica in the next step.

Enabling Longhorn also enables iscsid

Longhorn needs iscsid on the nodes. The module turns it on automatically when enable_longhorn = true, so there is nothing extra to do on MicroOS.

Step 1.4 — A StorageClass tuned for Postgres

PostgreSQL owns redundancy, so the database volumes use 1 Longhorn replica (disposable storage). Save this as longhorn-postgres.yaml. During the learning phase apply it with kubectl; in Layer 3 it moves into the module's extra-manifests.

longhorn-postgres.yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: longhorn-postgres
provisioner: driver.longhorn.io
allowVolumeExpansion: true          # (1)!
reclaimPolicy: Delete               # (2)!
volumeBindingMode: WaitForFirstConsumer
parameters:
  numberOfReplicas: "1"             # (3)!
  staleReplicaTimeout: "2880"
  dataLocality: "best-effort"       # (4)!
  fsType: "ext4"
  1. Lets a PVC grow later. Shrinking is never supported, so size generously.
  2. Delete removes the volume when the PVC is deleted — appropriate for disposable storage where R2 is the durable copy. Use Retain for the production cluster if you want volumes to survive accidental deletion.
  3. One replica. If the node dies, the operator rebuilds that instance from the primary. PostgreSQL is the redundancy.
  4. Keep a replica on the same node as the pod → lower latency.

Step 1.5 — Apply and verify

cd <your-project-folder>
terraform init --upgrade
terraform validate
terraform apply           # review the plan, then approve

export KUBECONFIG=$PWD/<clustername>_kubeconfig.yaml

kubectl get nodes -o wide                 # 3 nodes, all Ready, amd64
kubectl get pods -n cert-manager          # cert-manager Running
kubectl get pods -n longhorn-system       # Longhorn Running
kubectl apply -f longhorn-postgres.yaml
kubectl get storageclass                  # longhorn-postgres listed

Confirm cert-manager's API is actually ready (the Barman plugin will need it):

kubectl get crd | grep cert-manager.io    # certificates.cert-manager.io etc.

What could go wrong

  • cax server type by mistake → mixed/ARM cluster. Double-check every server_type is cx/cpx/ccx.
  • Nodes in different locations → Postgres pods can get stuck Pending after a reschedule because their volume is in another location. Keep one location.
  • cert-manager not ready before the plugin → plugin install fails to get certificates. Always verify cert-manager first.
  • MicroOS auto-reboots during a long session → expected; your workloads should tolerate it. This is the failover reality from the constraints chapter.

Where to go deeper

Next: Operator & cnpg plugin — Layer 2 begins.