P2: off-cluster postgres backup destination — same-cluster MinIO is not DR #299
Labels
No labels
domain:backend
domain:devops
domain:frontend
status:approved
status:in-progress
status:needs-fix
status:qa
type:bug
type:devops
type:feature
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
ldraney/pal-e-platform#299
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Type
Feature
Lineage
Discovered while scoping
pal-e-platform#297(P0 tf-state drift). CNPG backups land ins3://postgres-wal/onminio.minio.svc.cluster.local:9000— the in-cluster MinIO. If the cluster melts (etcd corruption, archbox node failure with no replacement, ransomware, accidentalkubectl delete ns postgres+minio), backups die with it. Today we have local resilience (CNPG can restore from local MinIO if pal-e-postgres-1 pod fails). We do NOT have disaster recovery (cluster-level loss = data loss).Repo
forgejo_admin/pal-e-platformUser Story
As Lucas (and the platform's downstream users — Marcus, Westside parents, agents writing notes), I want postgres backups stored OFF the production cluster so that loss of the k3s cluster, archbox node, or in-cluster MinIO does NOT cause permanent data loss. After this lands, "the cluster died" is recoverable; today it is not.
Context
CNPG
Cluster.spec.backup.barmanObjectStoreaccepts any S3-compatible endpoint. Today'sendpointURLishttp://minio.minio.svc.cluster.local:9000— internal cluster service. To get DR, we need a destination that is BOTH:Options worth evaluating: Backblaze B2 (cheap, S3 API), AWS S3 (canonical, costlier), Cloudflare R2 (cheap, S3 API, no egress), DigitalOcean Spaces, off-cluster MinIO on a separate machine (most operational overhead but cheapest infrastructure).
Backup duplication (write to BOTH local MinIO AND off-cluster) is feasible via CNPG
replicaClusteror by running barman-cloud-wal-archive twice — both add operational complexity. Simpler: pick one off-cluster destination and switch.This work has spillover: every other CNPG cluster on the platform (today:
woodpecker-db) should likely get the same treatment.File Targets
pal-e-services/terraform/cnpg.tf—kubernetes_manifest.cnpg_clusterspec.backup.barmanObjectStoreblock update (or extension to multi-destination)pal-e-services/terraform/k3s.tfvars(gitignored) — new variables for off-cluster S3 credentials/endpointpal-e-services/terraform/k3s.tfvars.example— document the new variablescnpg-s3-creds-offclusterk8s secret (or extend existing) with credentials for chosen providersop-postgres-restoreupdated with off-cluster restore procedureconvention-postgres-backup-destinationdocumenting "every CNPG cluster gets an off-cluster backup destination" ruleTest Expectations
validation-postgres-restoredrill from sibling ticket can be re-run AGAINST the off-cluster backup and succeedConstraints
Acceptance Criteria
convention-postgres-backup-destinationwith rationale (cost, latency, vendor)cnpg.tfupdated to point at off-cluster destination (or multi-destination if pursued);tofu planshows the intended change cleanlysop-platform-tf-changesflow)validation-postgres-restoredrill re-run against off-cluster backup → PASS, validation note publishedsop-postgres-restoreupdated with off-cluster destination stepsconvention-postgres-backup-destinationpublished as pal-e-docs note (tags: convention,active)woodpecker-db(and any future CNPG clusters)Checklist
Same as Acceptance Criteria; tracked there.
Out of Scope
Environment
s3://postgres-wal/onminio.minio.svc.cluster.local:9000pal-e-postgres(postgres ns),woodpecker-db(woodpecker ns)Related
pal-e-platform#297— drift reconcile work; this ticket can land independently but cleaner if #297 is DONE first (canonical tf flow restored)pal-e-platform#{TBD}— sibling ticket: validatesop-postgres-restoredrill (must be PASS before this work goes near prod)feedback_funnel_requires_auth.md— postgres holds PII; off-site backup is part of PII protectionfeedback_enterprise_no_workarounds.md— single-point-of-failure for prod data is exactly the kind of thing that needs the "do it right" treatmentScope Review: NEEDS_REFINEMENT
Review note:
review-1066-2026-04-21Premise is correct (in-cluster MinIO = local resilience, not DR). But the ticket has a wrong file target, a competing active plan, missing backing notes, and an AC with no scratch environment named. Major scope conflict needs a human call before this can advance to
todo.Blockers (need human decision before execution):
plan-pal-e-backupPhase 2 — existing active plan scopes off-site postgres backup via dailypg_dumpto Backblaze B2 (different approach than this ticket's CNPG-nativebarmanObjectStoreoff-cluster). Ava + Lucas must decide: kill one, or coexist as complementary layers.arch-cnpgnote in pal-e-docs (label references it; note missing).sop-postgres-restorevia sibling #298 before this ticket's AC can be satisfied.Body fixes needed:
woodpecker-dbbackup config is inpal-e-platform/terraform/modules/ci/main.tf, NOTpal-e-services/terraform/cnpg.tfas implied. Either narrow ticket to pal-e-postgres only + sibling for woodpecker, or broaden File Targets to cover both repos.pal-e-platform/terraform/modules/database/main.tfto File Targets (cnpg-s3-credssecret +cnpg_backup_verifyCronJob both live here and need updates when destinations move).feedback_funnel_requires_auth.md4-hour PII leak lesson.Verified correct:
pal-e-services/terraform/cnpg.tflines 123-145 holdpal-e-postgresbarmanObjectStore— matches ticket.story:superuser-recoververified on project-pal-e-platform user-stories table.Full analysis + evidence in
review-1066-2026-04-21.Closing — scope collision with existing
plan-pal-e-backupPhase 2.Review
review-1066-2026-04-21surfaced that an existing active plan already scopes this exact problem with a different (and more complete) technical approach:pg_dumpper database → Backblaze B2, 30-day retention, covers all 4 DBs (pal-e-docs CNPG, woodpecker CNPG, basketball-api plain pod, mcd-tracker plain pod).barmanObjectStore.destinationPathoff-cluster (continuous WAL). Covers only the 2 CNPG clusters; leaves plain-pod DBs uncovered.Decision (2026-04-21): Path A — adopt plan-pal-e-backup Phase 2 as canonical, close #299.
Rationale:
See
plan-pal-e-backupfor the canonical scope. Off-cluster destination work will be delivered as tickets cut from Phase 1 (foundation — B2 bucket + creds) and Phase 2 (pg_dump CronJobs).