SOP harden: sop-postgres-restore Step 5 (real-DR swap) needs PDB + ArgoCD lock + service-collision runbook #300

Open
opened 2026-04-21 17:57:40 +00:00 by forgejo_admin · 0 comments
Contributor

Type

SOP hardening (discovered scope)

Lineage

Surfaced during pal-e-platform#298 restore drill (2026-04-21). Drill was PASS, but gap #7 in validation-postgres-restore-2026-04-21 is out of scope for the drill itself: Step 5 "Swap (if replacing production)" in sop-postgres-restore describes the happy-path sequence but doesn't address three real-world footguns.

Repo

forgejo_admin/pal-e-platform

User Story

As the on-call engineer running a real P0 Postgres recovery, I need sop-postgres-restore Step 5 (Swap) to cover the three operational footguns that block a clean cutover — so that a cluster-restore incident doesn't trigger a second incident on top of the first.

Context

Step 5 currently says: scale app to 0, delete old cluster, rename or repoint, scale app back up. Real-world swap requires more:

  1. App PodDisruptionBudgets — apps with PDBs may block replicas=0 scale-down. Need to either --ignore-pdb or delete the PDB first.
  2. ArgoCD sync lock — if pal-e-postgres is ArgoCD-managed, ArgoCD will re-sync the deleted cluster immediately. Need to suspend sync, or annotate the Application with argocd.argoproj.io/sync-options: Prune=false before delete.
  3. Service-name collisions — CNPG creates -rw, -ro, -r services per cluster. The new cluster's services will collide with the old unless renamed first, OR the app's DATABASE_URL must atomically cut over (no graceful handoff window).

Without these steps in the runbook, a P0 swap has a real chance of a second incident on top of the first.

File Targets

  • pal-e-docs note sop-postgres-restore (slug) — Step 5b section expansion
  • No code changes required for this ticket; doc-only

Test Expectations

  • SOP contains new Step 5b.1, 5b.2, 5b.3 subsections covering each footgun with verified kubectl/argocd commands
  • Pre-swap checklist (app quiesced? ArgoCD suspended? DATABASE_URL cutover method decided?) present
  • Post-swap checklist (write traffic? alerts green? backup job re-enrolled with new cluster name?) present
  • Optional stretch: dry-run the swap against a throwaway prod-clone cluster. If pursued, spin as a separate ticket.

Constraints

  • 100% read-only on prod pal-e-postgres during authoring
  • Every command in the new section must be verified against a live (non-prod) CNPG cluster before merging

Acceptance Criteria

  • sop-postgres-restore Step 5b expanded into three subsections (5b.1 Quiesce, 5b.2 Cut over, 5b.3 Resume)
  • Pre-swap checklist added and reviewed
  • Post-swap checklist added and reviewed
  • Every kubectl/argocd command verified against a scratch CNPG cluster (not prod) — evidence pasted into PR body
  • Cross-link from validation-postgres-restore-2026-04-21 gap #7 updated to point at the resolved SOP revision
  • No AC item touches prod

Checklist

  • Draft Step 5b.1 Quiesce (PDB handling, ArgoCD suspend)
  • Draft Step 5b.2 Cut over (service rename vs DATABASE_URL atomic switch)
  • Draft Step 5b.3 Resume (unpause ArgoCD, scale app, verify write + metrics)
  • Pre-swap and post-swap checklists authored
  • Verify every command on scratch cluster
  • Update SOP + cross-link
  • pal-e-platform#298 — parent drill ticket (PASS verdict)
  • validation-postgres-restore-2026-04-21 — drill results, gap #7
  • sop-postgres-restore — SOP being hardened
  • feedback_validate_before_done.md — never trust untested runbooks
### Type SOP hardening (discovered scope) ### Lineage Surfaced during `pal-e-platform#298` restore drill (2026-04-21). Drill was PASS, but gap #7 in `validation-postgres-restore-2026-04-21` is out of scope for the drill itself: Step 5 "Swap (if replacing production)" in `sop-postgres-restore` describes the happy-path sequence but doesn't address three real-world footguns. ### Repo `forgejo_admin/pal-e-platform` ### User Story As the on-call engineer running a real P0 Postgres recovery, I need `sop-postgres-restore` Step 5 (Swap) to cover the three operational footguns that block a clean cutover — so that a cluster-restore incident doesn't trigger a second incident on top of the first. ### Context Step 5 currently says: scale app to 0, delete old cluster, rename or repoint, scale app back up. Real-world swap requires more: 1. **App PodDisruptionBudgets** — apps with PDBs may block `replicas=0` scale-down. Need to either `--ignore-pdb` or delete the PDB first. 2. **ArgoCD sync lock** — if `pal-e-postgres` is ArgoCD-managed, ArgoCD will re-sync the deleted cluster immediately. Need to suspend sync, or annotate the Application with `argocd.argoproj.io/sync-options: Prune=false` before delete. 3. **Service-name collisions** — CNPG creates `-rw`, `-ro`, `-r` services per cluster. The new cluster's services will collide with the old unless renamed first, OR the app's DATABASE_URL must atomically cut over (no graceful handoff window). Without these steps in the runbook, a P0 swap has a real chance of a second incident on top of the first. ### File Targets - `pal-e-docs` note `sop-postgres-restore` (slug) — Step 5b section expansion - No code changes required for this ticket; doc-only ### Test Expectations - SOP contains new Step 5b.1, 5b.2, 5b.3 subsections covering each footgun with verified kubectl/argocd commands - Pre-swap checklist (app quiesced? ArgoCD suspended? DATABASE_URL cutover method decided?) present - Post-swap checklist (write traffic? alerts green? backup job re-enrolled with new cluster name?) present - Optional stretch: dry-run the swap against a throwaway prod-clone cluster. If pursued, spin as a separate ticket. ### Constraints - 100% read-only on prod `pal-e-postgres` during authoring - Every command in the new section must be verified against a live (non-prod) CNPG cluster before merging ### Acceptance Criteria - [ ] `sop-postgres-restore` Step 5b expanded into three subsections (5b.1 Quiesce, 5b.2 Cut over, 5b.3 Resume) - [ ] Pre-swap checklist added and reviewed - [ ] Post-swap checklist added and reviewed - [ ] Every kubectl/argocd command verified against a scratch CNPG cluster (not prod) — evidence pasted into PR body - [ ] Cross-link from `validation-postgres-restore-2026-04-21` gap #7 updated to point at the resolved SOP revision - [ ] No AC item touches prod ### Checklist - [ ] Draft Step 5b.1 Quiesce (PDB handling, ArgoCD suspend) - [ ] Draft Step 5b.2 Cut over (service rename vs DATABASE_URL atomic switch) - [ ] Draft Step 5b.3 Resume (unpause ArgoCD, scale app, verify write + metrics) - [ ] Pre-swap and post-swap checklists authored - [ ] Verify every command on scratch cluster - [ ] Update SOP + cross-link ### Related - `pal-e-platform#298` — parent drill ticket (PASS verdict) - `validation-postgres-restore-2026-04-21` — drill results, gap #7 - `sop-postgres-restore` — SOP being hardened - `feedback_validate_before_done.md` — never trust untested runbooks
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
ldraney/pal-e-platform#300
No description provided.