Deploy CloudNativePG operator + Postgres cluster to k3s #11

Closed
opened 2026-03-02 18:31:34 +00:00 by forgejo_admin · 0 comments

Plan

plan-2026-02-26-tf-modularize-postgres -- Phase 2

Repo

forgejo_admin/pal-e-platform

User Story

As a platform operator
I want a shared Postgres instance running on k3s via CloudNativePG
So that services can migrate from SQLite to enterprise-grade Postgres with transactional DDL, automated failover, and continuous backup

Context

Two production outages from SQLite auto-committed DDL crashing Alembic migrations on pal-e-docs. Root cause: SQLite cannot roll back DDL inside a transaction. Postgres eliminates this entire class of bugs.

We chose CloudNativePG (CNCF project) over Bitnami (too simple) and Zalando operator (heavier, older architecture). CloudNativePG gives us k8s-native CRDs, automated failover, built-in WAL archiving to object storage, and PgBouncer integration.

Architecture: shared Postgres cluster with per-service databases (same pattern as MinIO). One CloudNativePG Cluster resource, CREATE DATABASE per service.

MinIO is already deployed in the cluster and will serve as the WAL archive target for continuous backup with point-in-time recovery.

Key decisions:

  • CloudNativePG operator via Helm
  • 1 primary, 0 replicas to start (scale later when needed)
  • WAL archiving to MinIO (s3://postgres-wal/)
  • Per-service databases created via bootstrap SQL
  • Credentials stored as k8s Secrets, referenced by pal-e-docs deployment

File Targets

Files to create or modify:

  • terraform/main.tf — add CloudNativePG operator helm_release in a new section, add CNPG Cluster resource via kubernetes_manifest, add MinIO bucket for WAL archive
  • terraform/variables.tf — add Postgres-related variables (admin password, pal-e-docs db credentials)
  • terraform/terraform.tfvars or salt/pillar/secrets/platform.sls — actual credential values (GPG-encrypted if using Salt)

Files NOT to touch:

  • Existing helm_release resources for other services — this is additive only
  • Anything in pal-e-docs repo — that's Phase 3

Acceptance Criteria

  • When I run tofu apply, then CloudNativePG operator is installed in its own namespace
  • When I run kubectl get clusters -A, then a Postgres cluster is running with 1 primary instance
  • When I exec into a debug pod and run psql -h <service> -U <user> -d pal_e_docs, then I connect successfully
  • When I check MinIO bucket postgres-wal, then WAL files are being archived
  • When I run tofu plan after apply, then there is zero diff (idempotent)

Test Expectations

  • tofu validate passes
  • tofu fmt -check passes
  • tofu plan shows only additive changes (no modifications to existing resources)
  • Post-apply: psql connectivity verified
  • Post-apply: WAL archiving to MinIO verified
  • Run command: cd terraform && tofu validate && tofu fmt -check

Constraints

  • Use OpenTofu (tofu not terraform) — this is a k3s cluster managed by tofu
  • Match existing patterns in main.tf (helm_release for operators, kubernetes_manifest for CRDs)
  • CloudNativePG Helm chart: cloudnative-pg/cloudnative-pg from https://cloudnative-pg.github.io/charts
  • Postgres credentials must be in terraform variables with sensitive = true
  • MinIO access for WAL archiving: reuse existing MinIO credentials pattern (see how litestream-backups bucket is configured)
  • Do NOT create Terraform modules — add directly to main.tf with clear section comments

Checklist

  • PR opened
  • tofu validate passes
  • tofu plan output included in PR
  • No unrelated changes
  • phase-postgres-2-deploy-cnpg — phase note
  • todo-pal-e-docs-deployment-reliability — the incident analysis
### Plan `plan-2026-02-26-tf-modularize-postgres` -- Phase 2 ### Repo `forgejo_admin/pal-e-platform` ### User Story As a platform operator I want a shared Postgres instance running on k3s via CloudNativePG So that services can migrate from SQLite to enterprise-grade Postgres with transactional DDL, automated failover, and continuous backup ### Context Two production outages from SQLite auto-committed DDL crashing Alembic migrations on pal-e-docs. Root cause: SQLite cannot roll back DDL inside a transaction. Postgres eliminates this entire class of bugs. We chose CloudNativePG (CNCF project) over Bitnami (too simple) and Zalando operator (heavier, older architecture). CloudNativePG gives us k8s-native CRDs, automated failover, built-in WAL archiving to object storage, and PgBouncer integration. Architecture: shared Postgres cluster with per-service databases (same pattern as MinIO). One CloudNativePG Cluster resource, CREATE DATABASE per service. MinIO is already deployed in the cluster and will serve as the WAL archive target for continuous backup with point-in-time recovery. Key decisions: - CloudNativePG operator via Helm - 1 primary, 0 replicas to start (scale later when needed) - WAL archiving to MinIO (`s3://postgres-wal/`) - Per-service databases created via bootstrap SQL - Credentials stored as k8s Secrets, referenced by pal-e-docs deployment ### File Targets Files to create or modify: - `terraform/main.tf` — add CloudNativePG operator helm_release in a new section, add CNPG Cluster resource via kubernetes_manifest, add MinIO bucket for WAL archive - `terraform/variables.tf` — add Postgres-related variables (admin password, pal-e-docs db credentials) - `terraform/terraform.tfvars` or `salt/pillar/secrets/platform.sls` — actual credential values (GPG-encrypted if using Salt) Files NOT to touch: - Existing helm_release resources for other services — this is additive only - Anything in pal-e-docs repo — that's Phase 3 ### Acceptance Criteria - [ ] When I run `tofu apply`, then CloudNativePG operator is installed in its own namespace - [ ] When I run `kubectl get clusters -A`, then a Postgres cluster is running with 1 primary instance - [ ] When I exec into a debug pod and run `psql -h <service> -U <user> -d pal_e_docs`, then I connect successfully - [ ] When I check MinIO bucket `postgres-wal`, then WAL files are being archived - [ ] When I run `tofu plan` after apply, then there is zero diff (idempotent) ### Test Expectations - [ ] `tofu validate` passes - [ ] `tofu fmt -check` passes - [ ] `tofu plan` shows only additive changes (no modifications to existing resources) - [ ] Post-apply: psql connectivity verified - [ ] Post-apply: WAL archiving to MinIO verified - Run command: `cd terraform && tofu validate && tofu fmt -check` ### Constraints - Use OpenTofu (`tofu` not `terraform`) — this is a k3s cluster managed by tofu - Match existing patterns in main.tf (helm_release for operators, kubernetes_manifest for CRDs) - CloudNativePG Helm chart: `cloudnative-pg/cloudnative-pg` from `https://cloudnative-pg.github.io/charts` - Postgres credentials must be in terraform variables with `sensitive = true` - MinIO access for WAL archiving: reuse existing MinIO credentials pattern (see how litestream-backups bucket is configured) - Do NOT create Terraform modules — add directly to main.tf with clear section comments ### Checklist - [ ] PR opened - [ ] `tofu validate` passes - [ ] `tofu plan` output included in PR - [ ] No unrelated changes ### Related - `phase-postgres-2-deploy-cnpg` — phase note - `todo-pal-e-docs-deployment-reliability` — the incident analysis
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
forgejo_admin/pal-e-platform#11
No description provided.