SOP harden: sop-postgres-restore Step 5 (real-DR swap) needs PDB + ArgoCD lock + service-collision runbook

### Type SOP hardening (discovered scope) ### Lineage Surfaced during `pal-e-platform#298` restore drill (2026-04-21). Drill was PASS, but gap #7 in `validation-postgres-restore-2026-04-21` is out of scope for the drill itself: Step 5 "Swap (if replacing production)" in `sop-postgres-restore` describes the happy-path sequence but doesn't address three real-world footguns. ### Repo `forgejo_admin/pal-e-platform` ### User Story As the on-call engineer running a real P0 Postgres recovery, I need `sop-postgres-restore` Step 5 (Swap) to cover the three operational footguns that block a clean cutover — so that a cluster-restore incident doesn't trigger a second incident on top of the first. ### Context Step 5 currently says: scale app to 0, delete old cluster, rename or repoint, scale app back up. Real-world swap requires more: 1. **App PodDisruptionBudgets** — apps with PDBs may block `replicas=0` scale-down. Need to either `--ignore-pdb` or delete the PDB first. 2. **ArgoCD sync lock** — if `pal-e-postgres` is ArgoCD-managed, ArgoCD will re-sync the deleted cluster immediately. Need to suspend sync, or annotate the Application with `argocd.argoproj.io/sync-options: Prune=false` before delete. 3. **Service-name collisions** — CNPG creates `-rw`, `-ro`, `-r` services per cluster. The new cluster's services will collide with the old unless renamed first, OR the app's DATABASE_URL must atomically cut over (no graceful handoff window). Without these steps in the runbook, a P0 swap has a real chance of a second incident on top of the first. ### File Targets - `pal-e-docs` note `sop-postgres-restore` (slug) — Step 5b section expansion - No code changes required for this ticket; doc-only ### Test Expectations - SOP contains new Step 5b.1, 5b.2, 5b.3 subsections covering each footgun with verified kubectl/argocd commands - Pre-swap checklist (app quiesced? ArgoCD suspended? DATABASE_URL cutover method decided?) present - Post-swap checklist (write traffic? alerts green? backup job re-enrolled with new cluster name?) present - Optional stretch: dry-run the swap against a throwaway prod-clone cluster. If pursued, spin as a separate ticket. ### Constraints - 100% read-only on prod `pal-e-postgres` during authoring - Every command in the new section must be verified against a live (non-prod) CNPG cluster before merging ### Acceptance Criteria - [ ] `sop-postgres-restore` Step 5b expanded into three subsections (5b.1 Quiesce, 5b.2 Cut over, 5b.3 Resume) - [ ] Pre-swap checklist added and reviewed - [ ] Post-swap checklist added and reviewed - [ ] Every kubectl/argocd command verified against a scratch CNPG cluster (not prod) — evidence pasted into PR body - [ ] Cross-link from `validation-postgres-restore-2026-04-21` gap #7 updated to point at the resolved SOP revision - [ ] No AC item touches prod ### Checklist - [ ] Draft Step 5b.1 Quiesce (PDB handling, ArgoCD suspend) - [ ] Draft Step 5b.2 Cut over (service rename vs DATABASE_URL atomic switch) - [ ] Draft Step 5b.3 Resume (unpause ArgoCD, scale app, verify write + metrics) - [ ] Pre-swap and post-swap checklists authored - [ ] Verify every command on scratch cluster - [ ] Update SOP + cross-link ### Related - `pal-e-platform#298` — parent drill ticket (PASS verdict) - `validation-postgres-restore-2026-04-21` — drill results, gap #7 - `sop-postgres-restore` — SOP being hardened - `feedback_validate_before_done.md` — never trust untested runbooks

forgejo_admin commented

2026-04-21 17:57:40 +00:00

Contributor

Type

SOP hardening (discovered scope)

Lineage

Surfaced during pal-e-platform#298 restore drill (2026-04-21). Drill was PASS, but gap #7 in validation-postgres-restore-2026-04-21 is out of scope for the drill itself: Step 5 "Swap (if replacing production)" in sop-postgres-restore describes the happy-path sequence but doesn't address three real-world footguns.

Repo

forgejo_admin/pal-e-platform

User Story

As the on-call engineer running a real P0 Postgres recovery, I need sop-postgres-restore Step 5 (Swap) to cover the three operational footguns that block a clean cutover — so that a cluster-restore incident doesn't trigger a second incident on top of the first.

Context

Step 5 currently says: scale app to 0, delete old cluster, rename or repoint, scale app back up. Real-world swap requires more:

App PodDisruptionBudgets — apps with PDBs may block replicas=0 scale-down. Need to either --ignore-pdb or delete the PDB first.
ArgoCD sync lock — if pal-e-postgres is ArgoCD-managed, ArgoCD will re-sync the deleted cluster immediately. Need to suspend sync, or annotate the Application with argocd.argoproj.io/sync-options: Prune=false before delete.
Service-name collisions — CNPG creates -rw, -ro, -r services per cluster. The new cluster's services will collide with the old unless renamed first, OR the app's DATABASE_URL must atomically cut over (no graceful handoff window).

Without these steps in the runbook, a P0 swap has a real chance of a second incident on top of the first.

File Targets

pal-e-docs note sop-postgres-restore (slug) — Step 5b section expansion
No code changes required for this ticket; doc-only

Test Expectations

SOP contains new Step 5b.1, 5b.2, 5b.3 subsections covering each footgun with verified kubectl/argocd commands
Pre-swap checklist (app quiesced? ArgoCD suspended? DATABASE_URL cutover method decided?) present
Post-swap checklist (write traffic? alerts green? backup job re-enrolled with new cluster name?) present
Optional stretch: dry-run the swap against a throwaway prod-clone cluster. If pursued, spin as a separate ticket.

Constraints

100% read-only on prod pal-e-postgres during authoring
Every command in the new section must be verified against a live (non-prod) CNPG cluster before merging

Acceptance Criteria

sop-postgres-restore Step 5b expanded into three subsections (5b.1 Quiesce, 5b.2 Cut over, 5b.3 Resume)
Pre-swap checklist added and reviewed
Post-swap checklist added and reviewed
Every kubectl/argocd command verified against a scratch CNPG cluster (not prod) — evidence pasted into PR body
Cross-link from validation-postgres-restore-2026-04-21 gap #7 updated to point at the resolved SOP revision
No AC item touches prod

Checklist

Draft Step 5b.1 Quiesce (PDB handling, ArgoCD suspend)
Draft Step 5b.2 Cut over (service rename vs DATABASE_URL atomic switch)
Draft Step 5b.3 Resume (unpause ArgoCD, scale app, verify write + metrics)
Pre-swap and post-swap checklists authored
Verify every command on scratch cluster
Update SOP + cross-link

pal-e-platform#298 — parent drill ticket (PASS verdict)
validation-postgres-restore-2026-04-21 — drill results, gap #7
sop-postgres-restore — SOP being hardened
feedback_validate_before_done.md — never trust untested runbooks

forgejo_admin referenced this issue

2026-04-21 17:58:32 +00:00

P1: validate sop-postgres-restore via dry-run drill — backup we've never tested = no backup #298

Rows
Columns

SOP harden: sop-postgres-restore Step 5 (real-DR swap) needs PDB + ArgoCD lock + service-collision runbook #300