ldraney/pal-e-platform

Fork 0

P2: off-cluster postgres backup destination — same-cluster MinIO is not DR #299

New issue

Closed

opened 2026-04-21 03:17:05 +00:00 by forgejo_admin · 2 comments

forgejo_admin commented

2026-04-21 03:17:05 +00:00

Contributor

Type

Feature

Lineage

Discovered while scoping pal-e-platform#297 (P0 tf-state drift). CNPG backups land in s3://postgres-wal/ on minio.minio.svc.cluster.local:9000 — the in-cluster MinIO. If the cluster melts (etcd corruption, archbox node failure with no replacement, ransomware, accidental kubectl delete ns postgres + minio), backups die with it. Today we have local resilience (CNPG can restore from local MinIO if pal-e-postgres-1 pod fails). We do NOT have disaster recovery (cluster-level loss = data loss).

Repo

forgejo_admin/pal-e-platform

User Story

As Lucas (and the platform's downstream users — Marcus, Westside parents, agents writing notes), I want postgres backups stored OFF the production cluster so that loss of the k3s cluster, archbox node, or in-cluster MinIO does NOT cause permanent data loss. After this lands, "the cluster died" is recoverable; today it is not.

Context

CNPG Cluster.spec.backup.barmanObjectStore accepts any S3-compatible endpoint. Today's endpointURL is http://minio.minio.svc.cluster.local:9000 — internal cluster service. To get DR, we need a destination that is BOTH:

Survives cluster loss (lives on different infrastructure)
Network-reachable from inside the cluster (so CNPG can write WAL + scheduled backups)

Options worth evaluating: Backblaze B2 (cheap, S3 API), AWS S3 (canonical, costlier), Cloudflare R2 (cheap, S3 API, no egress), DigitalOcean Spaces, off-cluster MinIO on a separate machine (most operational overhead but cheapest infrastructure).

Backup duplication (write to BOTH local MinIO AND off-cluster) is feasible via CNPG replicaCluster or by running barman-cloud-wal-archive twice — both add operational complexity. Simpler: pick one off-cluster destination and switch.

This work has spillover: every other CNPG cluster on the platform (today: woodpecker-db) should likely get the same treatment.

File Targets

pal-e-services/terraform/cnpg.tf — kubernetes_manifest.cnpg_cluster spec.backup.barmanObjectStore block update (or extension to multi-destination)
pal-e-services/terraform/k3s.tfvars (gitignored) — new variables for off-cluster S3 credentials/endpoint
pal-e-services/terraform/k3s.tfvars.example — document the new variables
New cnpg-s3-creds-offcluster k8s secret (or extend existing) with credentials for chosen provider
pal-e-docs SOP: sop-postgres-restore updated with off-cluster restore procedure
pal-e-docs convention: new convention-postgres-backup-destination documenting "every CNPG cluster gets an off-cluster backup destination" rule

Test Expectations

After apply: CNPG operator status shows backups landing at the new off-cluster destination AND (if multi-destination chosen) the existing in-cluster MinIO continues to receive them in parallel
A backup file is verifiably present in the off-cluster bucket within 24 hours of apply (next scheduled run)
The validation-postgres-restore drill from sibling ticket can be re-run AGAINST the off-cluster backup and succeed
Cluster-loss simulation (in a scratch test environment, not prod): destroy a test cluster, recreate, restore from off-cluster backup → succeeds

Constraints

No data loss during the cutover. Existing in-cluster backups must keep working until the off-cluster destination is verified writing successfully.
Cost target: under $5/month for current data volume (paledocs + twitch2kwager + basketball_test are small; MinIO doesn't cost anything but off-cluster will). Estimate volume before picking provider.
Credentials live in pillar / sealed secret, not committed plaintext.
DO NOT change the in-cluster MinIO setup as part of this ticket — that's a separate decision.
Cluster loss test happens in a SCRATCH environment, not prod.

Acceptance Criteria

Provider chosen and documented in convention-postgres-backup-destination with rationale (cost, latency, vendor)
Credentials stored as k8s secret + pillar/sealed secret, NOT in committed plaintext
cnpg.tf updated to point at off-cluster destination (or multi-destination if pursued); tofu plan shows the intended change cleanly
Apply through PR #297-style review-fix-Lucas-approve loop (and by then #297 is DONE so the apply is the canonical sop-platform-tf-changes flow)
CNPG operator status reports successful backup to new destination within 24h of apply
validation-postgres-restore drill re-run against off-cluster backup → PASS, validation note published
sop-postgres-restore updated with off-cluster destination steps
convention-postgres-backup-destination published as pal-e-docs note (tags: convention,active)
Follow-up ticket filed: same treatment for woodpecker-db (and any future CNPG clusters)

Checklist

Same as Acceptance Criteria; tracked there.

Out of Scope

HA replicas for the prod CNPG cluster (separate ticket — different problem)
Off-cluster MinIO infrastructure setup (if that's the chosen path, files separately)
Changing the in-cluster MinIO destination (it stays as the local-resilience path)
Backup encryption at rest (separate ticket if not already provider-default)

Environment

Cluster: prod (single k3s cluster on archbox)
Existing destination: s3://postgres-wal/ on minio.minio.svc.cluster.local:9000
Existing CNPG clusters affected: pal-e-postgres (postgres ns), woodpecker-db (woodpecker ns)
Retention: 7 days currently; off-cluster retention TBD per AC

pal-e-platform#297 — drift reconcile work; this ticket can land independently but cleaner if #297 is DONE first (canonical tf flow restored)
pal-e-platform#{TBD} — sibling ticket: validate sop-postgres-restore drill (must be PASS before this work goes near prod)
feedback_funnel_requires_auth.md — postgres holds PII; off-site backup is part of PII protection
feedback_enterprise_no_workarounds.md — single-point-of-failure for prod data is exactly the kind of thing that needs the "do it right" treatment

### Type Feature ### Lineage Discovered while scoping `pal-e-platform#297` (P0 tf-state drift). CNPG backups land in `s3://postgres-wal/` on `minio.minio.svc.cluster.local:9000` — the in-cluster MinIO. If the cluster melts (etcd corruption, archbox node failure with no replacement, ransomware, accidental `kubectl delete ns postgres` + `minio`), backups die with it. Today we have **local resilience** (CNPG can restore from local MinIO if pal-e-postgres-1 pod fails). We do NOT have **disaster recovery** (cluster-level loss = data loss). ### Repo `forgejo_admin/pal-e-platform` ### User Story As Lucas (and the platform's downstream users — Marcus, Westside parents, agents writing notes), I want postgres backups stored OFF the production cluster so that loss of the k3s cluster, archbox node, or in-cluster MinIO does NOT cause permanent data loss. After this lands, "the cluster died" is recoverable; today it is not. ### Context CNPG `Cluster.spec.backup.barmanObjectStore` accepts any S3-compatible endpoint. Today's `endpointURL` is `http://minio.minio.svc.cluster.local:9000` — internal cluster service. To get DR, we need a destination that is BOTH: - Survives cluster loss (lives on different infrastructure) - Network-reachable from inside the cluster (so CNPG can write WAL + scheduled backups) Options worth evaluating: Backblaze B2 (cheap, S3 API), AWS S3 (canonical, costlier), Cloudflare R2 (cheap, S3 API, no egress), DigitalOcean Spaces, off-cluster MinIO on a separate machine (most operational overhead but cheapest infrastructure). Backup duplication (write to BOTH local MinIO AND off-cluster) is feasible via CNPG `replicaCluster` or by running barman-cloud-wal-archive twice — both add operational complexity. Simpler: pick one off-cluster destination and switch. This work has spillover: every other CNPG cluster on the platform (today: `woodpecker-db`) should likely get the same treatment. ### File Targets - `pal-e-services/terraform/cnpg.tf` — `kubernetes_manifest.cnpg_cluster` `spec.backup.barmanObjectStore` block update (or extension to multi-destination) - `pal-e-services/terraform/k3s.tfvars` (gitignored) — new variables for off-cluster S3 credentials/endpoint - `pal-e-services/terraform/k3s.tfvars.example` — document the new variables - New `cnpg-s3-creds-offcluster` k8s secret (or extend existing) with credentials for chosen provider - pal-e-docs SOP: `sop-postgres-restore` updated with off-cluster restore procedure - pal-e-docs convention: new `convention-postgres-backup-destination` documenting "every CNPG cluster gets an off-cluster backup destination" rule ### Test Expectations - After apply: CNPG operator status shows backups landing at the new off-cluster destination AND (if multi-destination chosen) the existing in-cluster MinIO continues to receive them in parallel - A backup file is verifiably present in the off-cluster bucket within 24 hours of apply (next scheduled run) - The `validation-postgres-restore` drill from sibling ticket can be re-run AGAINST the off-cluster backup and succeed - Cluster-loss simulation (in a scratch test environment, not prod): destroy a test cluster, recreate, restore from off-cluster backup → succeeds ### Constraints - **No data loss during the cutover.** Existing in-cluster backups must keep working until the off-cluster destination is verified writing successfully. - Cost target: under $5/month for current data volume (paledocs + twitch2kwager + basketball_test are small; MinIO doesn't cost anything but off-cluster will). Estimate volume before picking provider. - Credentials live in pillar / sealed secret, not committed plaintext. - DO NOT change the in-cluster MinIO setup as part of this ticket — that's a separate decision. - Cluster loss test happens in a SCRATCH environment, not prod. ### Acceptance Criteria - [ ] Provider chosen and documented in `convention-postgres-backup-destination` with rationale (cost, latency, vendor) - [ ] Credentials stored as k8s secret + pillar/sealed secret, NOT in committed plaintext - [ ] `cnpg.tf` updated to point at off-cluster destination (or multi-destination if pursued); `tofu plan` shows the intended change cleanly - [ ] Apply through PR #297-style review-fix-Lucas-approve loop (and by then #297 is DONE so the apply is the canonical `sop-platform-tf-changes` flow) - [ ] CNPG operator status reports successful backup to new destination within 24h of apply - [ ] `validation-postgres-restore` drill re-run against off-cluster backup → PASS, validation note published - [ ] `sop-postgres-restore` updated with off-cluster destination steps - [ ] `convention-postgres-backup-destination` published as pal-e-docs note (`tags: convention,active`) - [ ] Follow-up ticket filed: same treatment for `woodpecker-db` (and any future CNPG clusters) ### Checklist Same as Acceptance Criteria; tracked there. ### Out of Scope - HA replicas for the prod CNPG cluster (separate ticket — different problem) - Off-cluster MinIO infrastructure setup (if that's the chosen path, files separately) - Changing the in-cluster MinIO destination (it stays as the local-resilience path) - Backup encryption at rest (separate ticket if not already provider-default) ### Environment - Cluster: prod (single k3s cluster on archbox) - Existing destination: `s3://postgres-wal/` on `minio.minio.svc.cluster.local:9000` - Existing CNPG clusters affected: `pal-e-postgres` (postgres ns), `woodpecker-db` (woodpecker ns) - Retention: 7 days currently; off-cluster retention TBD per AC ### Related - `pal-e-platform#297` — drift reconcile work; this ticket can land independently but cleaner if #297 is DONE first (canonical tf flow restored) - `pal-e-platform#{TBD}` — sibling ticket: validate `sop-postgres-restore` drill (must be PASS before this work goes near prod) - `feedback_funnel_requires_auth.md` — postgres holds PII; off-site backup is part of PII protection - `feedback_enterprise_no_workarounds.md` — single-point-of-failure for prod data is exactly the kind of thing that needs the "do it right" treatment

forgejo_admin commented

2026-04-21 12:08:57 +00:00

Author

Contributor

Scope Review: NEEDS_REFINEMENT

Review note: review-1066-2026-04-21

Premise is correct (in-cluster MinIO = local resilience, not DR). But the ticket has a wrong file target, a competing active plan, missing backing notes, and an AC with no scratch environment named. Major scope conflict needs a human call before this can advance to todo.

Blockers (need human decision before execution):

[SCOPE] Reconcile with plan-pal-e-backup Phase 2 — existing active plan scopes off-site postgres backup via daily pg_dump to Backblaze B2 (different approach than this ticket's CNPG-native barmanObjectStore off-cluster). Ava + Lucas must decide: kill one, or coexist as complementary layers.
[SCOPE] Create arch-cnpg note in pal-e-docs (label references it; note missing).
[SCOPE] Coordinate sop-postgres-restore via sibling #298 before this ticket's AC can be satisfied.

Body fixes needed:

Fix File Targets: woodpecker-db backup config is in pal-e-platform/terraform/modules/ci/main.tf, NOT pal-e-services/terraform/cnpg.tf as implied. Either narrow ticket to pal-e-postgres only + sibling for woodpecker, or broaden File Targets to cover both repos.
Add pal-e-platform/terraform/modules/database/main.tf to File Targets (cnpg-s3-creds secret + cnpg_backup_verify CronJob both live here and need updates when destinations move).
Name the scratch environment for cluster-loss drill (AC currently untestable on our single archbox node; suggest "Hetzner VPS per plan-pal-e-backup Phase 7").
Mark #298 as hard dependency (AC "validation-postgres-restore re-run PASS" requires #298 to have delivered the drill).
State #297 relationship explicitly (blocked-by, or parallel-with-rebase).
Add encryption-at-rest AC — postgres holds PII; "out of scope" per body is wrong here given feedback_funnel_requires_auth.md 4-hour PII leak lesson.
Add one-line data-residency constraint.
Spell out cutover mechanism (classic footgun: WAL archive gap between old destination stopping and new destination starting).
Replace $5/month cost target (meaningless at current DB volumes — pennies) with operational-simplicity + encryption-defaults driver.

Verified correct:

Premise: same-cluster MinIO = local resilience, not DR. True.
pal-e-services/terraform/cnpg.tf lines 123-145 hold pal-e-postgres barmanObjectStore — matches ticket.
story:superuser-recover verified on project-pal-e-platform user-stories table.
Provider list (B2, AWS S3, R2, Spaces, off-cluster MinIO) — reasonable set; no sibling spike needed, decision gate is AC #1.

Full analysis + evidence in review-1066-2026-04-21.

## Scope Review: NEEDS_REFINEMENT Review note: `review-1066-2026-04-21` Premise is correct (in-cluster MinIO = local resilience, not DR). But the ticket has a wrong file target, a competing active plan, missing backing notes, and an AC with no scratch environment named. Major scope conflict needs a human call before this can advance to `todo`. **Blockers (need human decision before execution):** - [SCOPE] **Reconcile with `plan-pal-e-backup` Phase 2** — existing active plan scopes off-site postgres backup via daily `pg_dump` to Backblaze B2 (different approach than this ticket's CNPG-native `barmanObjectStore` off-cluster). Ava + Lucas must decide: kill one, or coexist as complementary layers. - [SCOPE] Create `arch-cnpg` note in pal-e-docs (label references it; note missing). - [SCOPE] Coordinate `sop-postgres-restore` via sibling #298 before this ticket's AC can be satisfied. **Body fixes needed:** - Fix File Targets: `woodpecker-db` backup config is in `pal-e-platform/terraform/modules/ci/main.tf`, NOT `pal-e-services/terraform/cnpg.tf` as implied. Either narrow ticket to pal-e-postgres only + sibling for woodpecker, or broaden File Targets to cover both repos. - Add `pal-e-platform/terraform/modules/database/main.tf` to File Targets (`cnpg-s3-creds` secret + `cnpg_backup_verify` CronJob both live here and need updates when destinations move). - Name the scratch environment for cluster-loss drill (AC currently untestable on our single archbox node; suggest "Hetzner VPS per plan-pal-e-backup Phase 7"). - Mark #298 as hard dependency (AC "validation-postgres-restore re-run PASS" requires #298 to have delivered the drill). - State #297 relationship explicitly (blocked-by, or parallel-with-rebase). - **Add encryption-at-rest AC** — postgres holds PII; "out of scope" per body is wrong here given `feedback_funnel_requires_auth.md` 4-hour PII leak lesson. - Add one-line data-residency constraint. - Spell out cutover mechanism (classic footgun: WAL archive gap between old destination stopping and new destination starting). - Replace $5/month cost target (meaningless at current DB volumes — pennies) with operational-simplicity + encryption-defaults driver. **Verified correct:** - Premise: same-cluster MinIO = local resilience, not DR. True. - `pal-e-services/terraform/cnpg.tf` lines 123-145 hold `pal-e-postgres` `barmanObjectStore` — matches ticket. - `story:superuser-recover` verified on project-pal-e-platform user-stories table. - Provider list (B2, AWS S3, R2, Spaces, off-cluster MinIO) — reasonable set; no sibling spike needed, decision gate is AC #1. Full analysis + evidence in `review-1066-2026-04-21`.

forgejo_admin referenced this issue

2026-04-21 12:26:30 +00:00

P1: validate sop-postgres-restore via dry-run drill — backup we've never tested = no backup #298

forgejo_admin commented

2026-04-21 12:26:54 +00:00

Author

Contributor

Closing — scope collision with existing plan-pal-e-backup Phase 2.

Review review-1066-2026-04-21 surfaced that an existing active plan already scopes this exact problem with a different (and more complete) technical approach:

Plan Phase 2 approach: Daily pg_dump per database → Backblaze B2, 30-day retention, covers all 4 DBs (pal-e-docs CNPG, woodpecker CNPG, basketball-api plain pod, mcd-tracker plain pod).
This ticket's approach: Redirect CNPG barmanObjectStore.destinationPath off-cluster (continuous WAL). Covers only the 2 CNPG clusters; leaves plain-pod DBs uncovered.

Decision (2026-04-21): Path A — adopt plan-pal-e-backup Phase 2 as canonical, close #299.

Rationale:

pg_dump artifacts are restorable anywhere (no CNPG operator dependency at restore time). WAL-to-B2 introduces a tighter dependency chain in the DR scenario.
24h RPO is acceptable for these DBs (journal + wager volumes are small; the plan's DR math was done deliberately — "CNPG does continuous WAL locally. Cloud copy is disaster insurance, daily is sufficient.").
Plan covers plain-pod DBs that #299 would have left unaddressed.
The plan is the more thoroughly thought-through artifact (7 phases covering forgejo + MinIO + identity, not just databases). Pain of #297 is the right trigger to execute the plan rather than fork a parallel track.

See plan-pal-e-backup for the canonical scope. Off-cluster destination work will be delivered as tickets cut from Phase 1 (foundation — B2 bucket + creds) and Phase 2 (pg_dump CronJobs).

**Closing — scope collision with existing `plan-pal-e-backup` Phase 2.** Review `review-1066-2026-04-21` surfaced that an existing active plan already scopes this exact problem with a different (and more complete) technical approach: - **Plan Phase 2 approach:** Daily `pg_dump` per database → Backblaze B2, 30-day retention, covers **all 4 DBs** (pal-e-docs CNPG, woodpecker CNPG, basketball-api plain pod, mcd-tracker plain pod). - **This ticket's approach:** Redirect CNPG `barmanObjectStore.destinationPath` off-cluster (continuous WAL). Covers only the 2 CNPG clusters; leaves plain-pod DBs uncovered. **Decision (2026-04-21): Path A — adopt plan-pal-e-backup Phase 2 as canonical, close #299.** Rationale: 1. pg_dump artifacts are restorable anywhere (no CNPG operator dependency at restore time). WAL-to-B2 introduces a tighter dependency chain in the DR scenario. 2. 24h RPO is acceptable for these DBs (journal + wager volumes are small; the plan's DR math was done deliberately — "CNPG does continuous WAL locally. Cloud copy is disaster insurance, daily is sufficient."). 3. Plan covers plain-pod DBs that #299 would have left unaddressed. 4. The plan is the more thoroughly thought-through artifact (7 phases covering forgejo + MinIO + identity, not just databases). Pain of #297 is the right trigger to *execute* the plan rather than fork a parallel track. See `plan-pal-e-backup` for the canonical scope. Off-cluster destination work will be delivered as tickets cut from Phase 1 (foundation — B2 bucket + creds) and Phase 2 (pg_dump CronJobs).

forgejo_admin closed this issue

2026-04-21 12:27:02 +00:00