P1: validate sop-postgres-restore via dry-run drill — backup we've never tested = no backup #298
Labels
No labels
domain:backend
domain:devops
domain:frontend
status:approved
status:in-progress
status:needs-fix
status:qa
type:bug
type:devops
type:feature
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
ldraney/pal-e-platform#298
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Type
Spike
Lineage
Discovered while scoping
pal-e-platform#297(P0 tf-state drift). Live verification confirmed daily CNPG backups are completing, WAL continuous archive is active, and the SOPsop-postgres-restoreexists. But no validation note proves the procedure actually works. A backup that's never been restored is a backup that doesn't exist. This blocks #297 — the drift-reconcile work has a hard gate that says "if any step would causepal-e-postgres-1pod restart, run sop-postgres-restore dry-run first." That dry-run cannot happen for the first time during a P0 incident.Repo
forgejo_admin/pal-e-platformQuestion
Does
sop-postgres-restoreactually restore a working pal-e-postgres database from a daily backup + WAL replay, end-to-end, with current data integrity? How long does it take? What unstated prerequisites does it have?What to Explore
sop-postgres-restorestep-by-step against a recent daily backup (e.g.pal-e-postgres-daily-20260420020000).pal-e-postgres-restore-testinpostgres-restore-testns. Do NOT restore over the prod cluster. See "Scratch Namespace Setup" below for required YAML deltas.pal-e-postgres-rw.$(date -u -d '30 minutes ago' +%FT%TZ)). This is safely within the last-archived-WAL window AND safely after the most recent base backup. Do NOT target "5 minutes ago" — WAL archive is async and that window may not yet be archived (causes SOP failure mode: "recovery ended before configured recovery target was reached").Success Criteria
recoveryTargetapplied).validation-postgres-restore-2026-04-XXpublished with: timing, exact commands run, any gaps in the SOP, screenshots/log excerpts of success.sop-postgres-restoreupdated in pal-e-docs with any gap fixes; cross-referenced from the validation note.Time-box
4 hours of agent + Lucas execution time, measured from the moment the pre-flight checklist passes. If the procedure can't complete in that window (or fails), STOP and surface what's missing — that result is itself the answer (the SOP is broken, fix it as a separate ticket).
Environment
postgres-restore-test)s3://postgres-wal/onminio.minio.svc.cluster.local:9000(in-cluster MinIO)kubectl get backup -n postgres): 8 daily backups completed clean, 2026-04-14 through 2026-04-21sop-postgres-restore(slug exists; needs to be walked end-to-end)Pre-flight (run BEFORE the 4h timer starts)
Gate the drill on all five checks. Any failure → file as a separate ticket and abort.
cnpg-s3-credsreadable inpostgresns —kubectl get secret cnpg-s3-creds -n postgresreturns without error.kubectl get cluster -n postgres pal-e-postgres -o jsonpath='{.spec.imageName}'. This EXACT value must be used asimageNamein the scratch cluster.kubectl get deployment -n cnpg-system cnpg-controller-manager -o jsonpath='{.spec.template.spec.containers[*].image}'. Note in validation note; Barman Cloud Plugin is deprecated in CNPG 1.28, removed in 1.29.kubectl get ns postgres-restore-testreturns NotFound.phase: completed—kubectl get backup -n postgresshows ≥1 completed row.No Prod Writes (hard constraint)
This drill is 100% read-only on prod. The agent does NOT:
pg_switch_wal()(SOP Gotcha #3 — the drill uses last-archived WAL, not "now")pal-e-postgrespostgresnspal-e-postgres-rwAll mutation happens in
postgres-restore-testns only.Scratch Namespace Setup
Before bootstrapping the scratch cluster, apply these deltas to the YAML template in SOP Step 2:
metadata.namespace:postgres-restore-test(NOTpostgres— SOP default restores alongside prod, which is dangerous)metadata.name:pal-e-postgres-restore-testspec.imageName: exact value from Pre-flight check #2 (must match prod Pg major)spec.externalClusters[0].serverName: keep aspal-e-postgres(per SOP Gotcha #2 — this matches the MinIO object path; changing it breaks the restore)Copy the
cnpg-s3-credssecret into the scratch ns first (SOP Prerequisite #3):Baseline Capture (Lucas runs on prod, ONCE, before dispatching drill agent)
To satisfy the sample-row AC without the agent ever connecting psql to prod:
Lucas runs read-only queries against each prod DB (paledocs, twitch2kwager, basketball_test) — for each, pick a core/high-activity table and capture
SELECT COUNT(*), MAX(updated_at)(or equivalent timestamp column). Paste outputs into the validation note BEFORE dispatch.The drill agent runs the EXACT same queries against the RESTORED cluster and compares. Full-match = PASS sample-row AC. No psql connections from the agent to prod.
Acceptance Criteria
validation-postgres-restore-2026-04-XXpublished with PASS verdict OR explicit FAIL + gap listsop-postgres-restoreimprovement ticketsop-postgres-restoreSOP updated with any clarifications discovered during the drill (including scratch-ns YAML delta and no-prod-writes clause, if not already there)postgres-restore-testcleaned up after drill (no orphaned PVCs)pal-e-platform#297confirming blocking gate satisfied (or NOT)pal-e-postgrescluster across the entire drill (confirmed via absence of psql sessions in agent logs)Out of Scope
plan-pal-e-backupPhase 2 (daily pg_dump → Backblaze B2 for all 4 DBs). Companion ticket #299 closed 2026-04-21; scope belongs to the plan.arch-cnpgnote (deferred;arch:cnpglabel resolves toproject-pal-e-platformDeployment diagram for now — downstream scope ticket)Related
pal-e-platform#297— blocked by this; #297's hard gate requires this drill to be PASS before CNPG-touching worksop-postgres-restore— the procedure being validatedplan-pal-e-backupPhase 2 — canonical scope for off-cluster backup destination (replaces closed #299)feedback_validate_before_done.md— never trust untested guaranteesfeedback_funnel_requires_auth.md— postgres holds PII (journal entries, user data)feedback_never_write_prod_db.md— prod is read-only; scratch namespace for the drillRefined 2026-04-21 per
review-1065-2026-04-21— 5 [BODY] edits: pre-flight checklist, scratch-ns YAML delta, baseline capture mechanism, PITR 30-min target rule, no-prod-writes clause.Scope Review: NEEDS_REFINEMENT
Review note:
review-1065-2026-04-21Scope is strong — lineage to #297 is airtight, spike template is complete, story/arch labels both resolve. Five concrete body edits needed before todo→next_up because latent footguns around prod read-only discipline + missing pre-flight prereqs could blow the 4h time-box or (worst case) touch prod.
[BODY] edits required:
cnpg-s3-credssecret, prod Pg image tag (SOP Gotcha #1 is the #1 failure mode), CNPG operator version, scratch ns not pre-existing, at least onephase: completedbackup.namespace: postgres(prod ns!). Ticket specifies scratch nspostgres-restore-testbut does NOT show adjusted YAML, does NOT document copyingcnpg-s3-credsinto scratch ns, does NOT preserveexternalClusters.serverName: pal-e-postgres(SOP Gotcha #2). High risk of accidental prod-ns restore on verbatim copy-paste.SELECT COUNT(*), MAX(updated_at)baseline BEFORE drill. Agent queries restored cluster only. Agent never touches prodpal-e-postgres-rw.pg_switch_wal()is a write; ticket must explicitly SKIP it. Drill is 100% read-only on prod, all mutation in scratch ns.[SCOPE] downstream (not blocking): No
arch-cnpgnote exists in pal-e-docs;arch:cnpglabel resolves to project-page Architecture section. Codify convention or create the note — file separately.No decomposition needed. No label changes. No repo change needed. Once the five [BODY] edits land on #298, verdict upgrades to READY.
Full analysis (template completeness, traceability, file targets, blast radius, AC assessment): see review note
review-1065-2026-04-21in pal-e-docs.Scope Review R2: APPROVED
Review note:
review-1065-2026-04-21-r2All five round-1
[BODY]edits landed cleanly and closed their respective gaps. No regressions. Ready to advancetodo→next_up.Round-1 gaps closed:
pg_switch_walcalled out)Verified out-of-band:
forgejo_admin/pal-e-platform#299closed 2026-04-21, scope routed toplan-pal-e-backupPhase 2 ("Database Backups") as canonicalarch-cnpgnote deferral acceptable (explicitly captured in Out of Scope)Non-blocking nits captured in the review note (Pg 17.x phrasing, quarterly re-run ticket suggestion) — do not require refinement.
Downstream
[SCOPE]carry-over: filearch-cnpgnote or convention clarifier when convenient. Not blocking this ticket.Pre-flight snapshot — 2026-04-21 (Ava, ahead of drill dispatch)
All 5 pre-flight gates are GREEN. Recording exact values so the drill agent does not re-derive them.
cnpg-s3-credsinpostgresnsghcr.io/cloudnative-pg/postgresql:17.4-1← drill agent MUST use this exact value forspec.imageNamein scratch clusterghcr.io/cloudnative-pg/cloudnative-pg:1.28.1postgres-restore-testpal-e-postgres-daily-20260414..21 020000)Finding for SOP gap list
The ticket body's pre-flight #3 suggested
kubectl get deployment -n cnpg-system cnpg-controller-manager ...— actual deployment name in this cluster iscnpg-cloudnative-pg. The drill agent should record this discrepancy and updatesop-postgres-restoreif it has the same stale name.Operator-version consideration (non-blocker)
CNPG operator is at 1.28.1. Per SOP Gotcha #4, Barman Cloud Plugin is deprecated in 1.28 and removed in 1.29 — we are currently on the last version where native
barmanObjectStoreworks. Native path is fine for this drill; upgrade to the barman-plugin architecture is a separate concern (phase-postgres-4a-barman-plugin-migration).Next step
Waiting on Lucas's baseline capture against prod (see ticket body's "Baseline Capture" section) before dispatching the drill agent.
Baseline captured — drill ready to dispatch
Baseline note:
validation-postgres-restore-2026-04-21(created,in-progress)Summary for drill agent
2026-04-21T17:43:13Z2026-04-21T17:43:13Z(exact baseline time), not the ticket body's generic "30 min ago." Gives perfect comparator. Rationale + WAL-archival pre-check in the validation note.Full baseline + pre-flight snapshot in the validation note. Dispatching drill agent next.
Verdict: PASS
Drill completed 2026-04-21T17:53Z. Full results in
validation-postgres-restore-2026-04-21.Summary
2026-04-21T17:43:13Zreproduces baseline values EXACTLY for paledocs (8 tables) and twitch2kwager (5 of 6 tables;game_completebaseline-capture anomaly — not a restore bug; see validation note).sop-postgres-restore. 1 spun out aspal-e-platform#300(real-DR swap hardening — out of drill scope).All 9 ticket AC met
validation-postgres-restore-2026-04-21published with PASS verdictsop-postgres-restoreupdated in-place (all prereq, YAML deltas, netpol footgun, Step 4 expansion, Gotcha #1 rewrite, deployment-name fix, last-tested date)postgres-restore-testcleaned up (NotFound; PVCs gone; netpol reverted)pal-e-platform#297posted (gate satisfied)pal-e-postgres-rw, nopg_switch_wal(), no DDL/DMLBlocker gaps that would have broken a cold SOP read
:17.4-1bundles barman-cloud 3.13.0 which is incompatible with CNPG 1.28.1 (errorsunrecognized arguments). Restore MUST use:17tag. SOP Gotcha #1 was correct but the prod-vs-restore-tag distinction was not explicit. Now fixed.default-deny-ingressinminions only allows 7 namespaces; a scratch-ns restore pod gets blocked atbarman-cloud-backup-listwith a cryptic endpoint-URL error. SOP now documents the patch-and-revert pattern.Key timing values
Time-box was 4 hours; drill completed in ~7 minutes of actual work once the two blockers were identified. The bulk of SOP revision happened in parallel.