ldraney/pal-e-platform

Fork 0

P0: pal-e-services terraform state drifted from cluster reality — prod postgres in blast radius #297

New issue

Open

opened 2026-04-21 03:00:06 +00:00 by forgejo_admin · 4 comments

forgejo_admin commented

2026-04-21 03:00:06 +00:00

Contributor

Type

Bug

Lineage

Discovered during tofu plan -var-file=k3s.tfvars on pal-e-services while adding the pal-e-docs Keycloak realm (pal-e-services#58 / #59). Plan showed 10 to add, 11 to change, 0 to destroy — but 5 of the "creates" already exist in the cluster, including the prod CNPG postgres cluster. Applying the full plan could destroy or duplicate production data.

Repo

Tracking lives on forgejo_admin/pal-e-platform. Code lands across two repos:

Execution Repos:

forgejo_admin/pal-e-platform — SOP and convention updates (service-onboarding-sop, sop-platform-tf-changes).
forgejo_admin/pal-e-services — terraform tofu import, configuration removal, k3s.tfvars adjustments. Branch naming should reflect the repo, e.g. 297-tf-drift-cnpg-import on pal-e-services and 297-tf-drift-sop-update on pal-e-platform.

What Broke

Terraform state for forgejo_admin/pal-e-services has diverged from live cluster reality on at least 5 resources. Something (Helm? kubectl apply? ArgoCD auto-sync? manual kubectl create?) provisioned these resources outside terraform, and they were never imported into state.

Resources terraform plan wants to CREATE that already exist:

Resource	Cluster reality	Risk
`kubernetes_manifest.cnpg_cluster`	`clusters.postgresql.cnpg.io/pal-e-postgres` in `postgres` ns, 49d old, healthy, primary `pal-e-postgres-1` serving traffic	CRITICAL — prod postgres, 3+ live databases (paledocs, twitch2kwager, basketball_test) + user data
`kubernetes_manifest.cnpg_scheduled_backup`	`scheduledbackups.postgresql.cnpg.io/pal-e-postgres-daily`, last backup ~63min ago	Backup schedule + MinIO target
`argocd_application.service["pal-e-mail"]`	`argocd/pal-e-mail` Application, Synced + Healthy	pal-e-mail is "ARCHIVED" per memory but ArgoCD app is alive — contradiction must be resolved (resurrect canonically OR remove from tf config)
`kubernetes_secret_v1.harbor_creds["pal-e-app"]`	Secret exists in `pal-e-app` ns, 8d old	Container image pull creds
`harbor_robot_account.service_ci["playme2k"]`	unverified Harbor-side	CI push creds for playme2k

Resources terraform plan wants to UPDATE (mostly cosmetic + credential cleanup):

10× kubernetes_secret_v1.harbor_creds[*] — strip argocd.argoproj.io/instance label (ArgoCD re-adds on next sync → churn loop)
1× kubernetes_ingress_v1.service_funnel["pal-e-app"] — same label strip
2× of the harbor_creds (playme2k, westside-ai-assistant) include credential data rotation — currently placeholder, tf would fill real robot creds

Why this is P0

CNPG manifests touch the prod postgres. Applying a kubernetes_manifest resource for a cluster that already exists typically no-ops if specs match, but if the tf-rendered spec differs from the live cluster (operator version drift, field defaults, etc.), the CNPG operator could reconcile — worst case: cluster recreation, data loss.
CNPG operator reconcile semantics are non-trivial. A kubernetes_manifest resource against an operator-reconciled CRD (CNPG Cluster) does NOT have the same import semantics as a plain k8s resource. The CNPG operator re-projects fields the terraform manifest does not declare. The header comment at cnpg.tf:58-60 already acknowledges this ("CNPG operator manages all other fields... including them here would cause perpetual plan drift"). After import, expect non-trivial plan output until either the terraform manifest precisely matches operator-projected state for the managed fields, OR we explicitly accept perpetual no-op drift on operator-managed subtrees. This is the technical source of "import worked but we now have a forever-drift loop" risk.
Anybody running tofu apply without -target to make an unrelated change lights this fuse. Phase 28 AC explicitly required "tofu plan on existing state shows zero changes (import complete)" — that guarantee has been lost.
Any future onboarding PR (following service-onboarding-sop) will bring this drift along by default.

Repro Steps

cd ~/pal-e-services/terraform && tofu plan -lock=false -var-file=k3s.tfvars
Observe: Plan: 10 to add, 11 to change, 0 to destroy.
For each "create" line, verify in cluster with kubectl get <kind> -A | grep <name> → at least 4 of the 5 creates already exist as live resources.
For each "update" line, observe that most are stripping argocd.argoproj.io/instance label that ArgoCD re-adds automatically.

Expected Behavior

tofu plan shows 0 to add, 0 to change, 0 to destroy on a clean state. Every live cluster resource that should be terraform-managed is imported into terraform state. Every stale resource in terraform state that has been archived is removed from configuration.

Environment

Cluster: prod (single cluster, multi-namespace)
Terraform repo: forgejo_admin/pal-e-services
State: local (per bootstrap convention)
Affected namespaces: postgres, argocd, pal-e-app, playme2k, westside-ai-assistant, basketball-api, gcal-scheduler, mcd-tracker, mcd-tracker-app, pal-e-docs, pal-e-mail, platform-validation, westsidekingsandqueens
Sibling cluster to verify is OUT of scope: clusters.postgresql.cnpg.io/woodpecker-db (in woodpecker ns, 37d old, healthy). Likely managed by pal-e-platform (not pal-e-services). Investigation must confirm this is NOT in the pal-e-services drift list before proceeding.

Acceptance Criteria

Drift investigation note created. A pal-e-docs note drift-investigation-2026-04-20 (note_type: doc, tags: drift,investigation) is published with a per-resource decision table: for each of the 5 "create" drifts, decide (a) tofu import (adopt the live resource into tf state), (b) remove from configuration (resource is archived/should not exist), or (c) kubectl delete + tofu apply create (tf should own fresh). Decision recorded with rationale per row.
Per-resource UPDATE decisions recorded in same note. For each of the 11 "update" drifts, decide whether the tf-rendered manifest or the cluster-side value should win (ArgoCD labels vs tf, credential rotation correctness).
Import plan executed one resource at a time, with verification gate. After EACH tofu import, run tofu plan and confirm the imported resource shows zero diff OR only operator-managed-field drift before moving to the next import. CNPG cluster import goes FIRST and ALONE.
No pod restart on pal-e-postgres-1 during reconciliation. If any step would cause restart, pause and confirm backup restorability via sop-postgres-restore dry-run BEFORE proceeding. This is a hard gate.
Zero-diff verification complete. tofu plan -var-file=k3s.tfvars returns No changes. Your infrastructure matches the configuration. (zero-diff gate per Phase 28).
service-onboarding-sop updated. Add a new "Plan-diff check" row to the Pre-Deploy Validation Checklist: "Run tofu plan and confirm zero-diff OR the diff is exclusively the intended change. If unintended changes appear, STOP and file a drift ticket before applying."
sop-platform-tf-changes updated. Add a new bullet under "What NOT to Do": "No tofu apply when plan diff includes unintended resources. Use -target to isolate or file a drift ticket."
Re-run keycloak onboarding plan (pal-e-services#58) → confirm 5-resource plan shows 5 creates + 0 other changes.

Scope Boundary

This ticket is investigation + import/cleanup PRs. Does NOT include:

Schema changes to CNPG (DB migrations are out of scope)
Moving from local tf state to remote backend (separate ticket — see Related: phase-platform-17b-tf-state-governance may absorb this in the future)
Migrating manual-tool-managed resources to IaC (separate tickets per tool, e.g. Helm-managed → tf-managed)

Interim Safety Protocol (in effect NOW)

This protocol supersedes sop-platform-tf-changes for pal-e-services until the zero-diff gate is restored. sop-platform-tf-changes defines pal-e-services as plan-and-apply-before-merge — DO NOT follow that pattern until this ticket closes.

Until the zero-diff gate is restored:

No tofu apply on pal-e-services without -target. Every apply must isolate the specific intended resource.
pal-e-services#59 (pal-e-docs Keycloak realm) will be applied with -target per Path B documented in that PR.
Any other onboarding work that would touch pal-e-services should pause and link this ticket as a blocker, OR follow the same -target workaround pattern with explicit Lucas approval.

pal-e-services#58 / #59 — the PR that surfaced this; currently using -target workaround per Interim Safety Protocol
phase-pal-e-platform-28-keycloak-smtp — Phase 28 AC ("zero-diff plan gate") now violated
phase-platform-17b-tf-state-governance — canonical phase for tf-state-governance work; this P0 ticket may be the trigger that kicks off 17b execution. Open scope decision (Ava): does this ticket absorb 17b, or does 17b remain as the remote-state-backend epic and this stays standalone? Likely standalone — 17b's primary scope is remote backend migration, this is import/reconciliation.
service-onboarding-sop — step 5 says "Requires Lucas approval before apply" — needs stronger drift-detection clause (see AC above)
sop-platform-tf-changes — needs "No apply when plan diff includes unintended resources" bullet (see AC above)
sop-postgres-restore — required for the no-pod-restart AC (backup-restore dry-run before any CNPG step)
feedback_never_alter_prod_directly.md — no direct prod writes
feedback_never_write_prod_db.md — postgres is in blast radius
feedback_enterprise_no_workarounds.md — fix it right, don't hack around
deployment-lessons — past terraform-drift fixes in the lessons-learned note

### Type Bug ### Lineage Discovered during `tofu plan -var-file=k3s.tfvars` on `pal-e-services` while adding the `pal-e-docs` Keycloak realm (pal-e-services#58 / #59). Plan showed 10 to add, 11 to change, 0 to destroy — but 5 of the "creates" already exist in the cluster, including the prod CNPG postgres cluster. Applying the full plan could destroy or duplicate production data. ### Repo Tracking lives on `forgejo_admin/pal-e-platform`. Code lands across two repos: **Execution Repos:** - `forgejo_admin/pal-e-platform` — SOP and convention updates (`service-onboarding-sop`, `sop-platform-tf-changes`). - `forgejo_admin/pal-e-services` — terraform `tofu import`, configuration removal, k3s.tfvars adjustments. Branch naming should reflect the repo, e.g. `297-tf-drift-cnpg-import` on pal-e-services and `297-tf-drift-sop-update` on pal-e-platform. ### What Broke Terraform state for `forgejo_admin/pal-e-services` has diverged from live cluster reality on at least 5 resources. Something (Helm? `kubectl apply`? ArgoCD auto-sync? manual `kubectl create`?) provisioned these resources outside terraform, and they were never imported into state. **Resources terraform plan wants to CREATE that already exist:** | Resource | Cluster reality | Risk | |---|---|---| | `kubernetes_manifest.cnpg_cluster` | `clusters.postgresql.cnpg.io/pal-e-postgres` in `postgres` ns, 49d old, healthy, primary `pal-e-postgres-1` serving traffic | **CRITICAL — prod postgres, 3+ live databases (paledocs, twitch2kwager, basketball_test) + user data** | | `kubernetes_manifest.cnpg_scheduled_backup` | `scheduledbackups.postgresql.cnpg.io/pal-e-postgres-daily`, last backup ~63min ago | Backup schedule + MinIO target | | `argocd_application.service["pal-e-mail"]` | `argocd/pal-e-mail` Application, Synced + Healthy | pal-e-mail is "ARCHIVED" per memory but ArgoCD app is alive — contradiction must be resolved (resurrect canonically OR remove from tf config) | | `kubernetes_secret_v1.harbor_creds["pal-e-app"]` | Secret exists in `pal-e-app` ns, 8d old | Container image pull creds | | `harbor_robot_account.service_ci["playme2k"]` | unverified Harbor-side | CI push creds for playme2k | **Resources terraform plan wants to UPDATE (mostly cosmetic + credential cleanup):** - 10× `kubernetes_secret_v1.harbor_creds[*]` — strip `argocd.argoproj.io/instance` label (ArgoCD re-adds on next sync → churn loop) - 1× `kubernetes_ingress_v1.service_funnel["pal-e-app"]` — same label strip - 2× of the harbor_creds (`playme2k`, `westside-ai-assistant`) include credential data rotation — currently placeholder, tf would fill real robot creds ### Why this is P0 1. **CNPG manifests touch the prod postgres.** Applying a `kubernetes_manifest` resource for a cluster that already exists typically no-ops if specs match, but if the tf-rendered spec differs from the live cluster (operator version drift, field defaults, etc.), the CNPG operator could reconcile — worst case: cluster recreation, data loss. 2. **CNPG operator reconcile semantics are non-trivial.** A `kubernetes_manifest` resource against an operator-reconciled CRD (CNPG Cluster) does NOT have the same import semantics as a plain k8s resource. The CNPG operator re-projects fields the terraform manifest does not declare. The header comment at `cnpg.tf:58-60` already acknowledges this ("CNPG operator manages all other fields... including them here would cause perpetual plan drift"). After import, expect non-trivial plan output until either the terraform manifest precisely matches operator-projected state for the managed fields, OR we explicitly accept perpetual no-op drift on operator-managed subtrees. **This is the technical source of "import worked but we now have a forever-drift loop" risk.** 3. **Anybody running `tofu apply` without `-target` to make an unrelated change lights this fuse.** Phase 28 AC explicitly required "tofu plan on existing state shows zero changes (import complete)" — that guarantee has been lost. 4. **Any future onboarding PR (following `service-onboarding-sop`) will bring this drift along by default.** ### Repro Steps 1. `cd ~/pal-e-services/terraform && tofu plan -lock=false -var-file=k3s.tfvars` 2. Observe: `Plan: 10 to add, 11 to change, 0 to destroy.` 3. For each "create" line, verify in cluster with `kubectl get <kind> -A | grep <name>` → at least 4 of the 5 creates already exist as live resources. 4. For each "update" line, observe that most are stripping `argocd.argoproj.io/instance` label that ArgoCD re-adds automatically. ### Expected Behavior `tofu plan` shows `0 to add, 0 to change, 0 to destroy` on a clean state. Every live cluster resource that should be terraform-managed is imported into terraform state. Every stale resource in terraform state that has been archived is removed from configuration. ### Environment - Cluster: prod (single cluster, multi-namespace) - Terraform repo: `forgejo_admin/pal-e-services` - State: local (per bootstrap convention) - Affected namespaces: `postgres`, `argocd`, `pal-e-app`, `playme2k`, `westside-ai-assistant`, `basketball-api`, `gcal-scheduler`, `mcd-tracker`, `mcd-tracker-app`, `pal-e-docs`, `pal-e-mail`, `platform-validation`, `westsidekingsandqueens` - **Sibling cluster to verify is OUT of scope:** `clusters.postgresql.cnpg.io/woodpecker-db` (in `woodpecker` ns, 37d old, healthy). Likely managed by pal-e-platform (not pal-e-services). Investigation must confirm this is NOT in the pal-e-services drift list before proceeding. ### Acceptance Criteria - [ ] **Drift investigation note created.** A pal-e-docs note `drift-investigation-2026-04-20` (note_type: `doc`, tags: `drift,investigation`) is published with a per-resource decision table: for each of the 5 "create" drifts, decide (a) `tofu import` (adopt the live resource into tf state), (b) remove from configuration (resource is archived/should not exist), or (c) `kubectl delete` + `tofu apply create` (tf should own fresh). Decision recorded with rationale per row. - [ ] **Per-resource UPDATE decisions recorded in same note.** For each of the 11 "update" drifts, decide whether the tf-rendered manifest or the cluster-side value should win (ArgoCD labels vs tf, credential rotation correctness). - [ ] **Import plan executed one resource at a time, with verification gate.** After EACH `tofu import`, run `tofu plan` and confirm the imported resource shows zero diff OR only operator-managed-field drift before moving to the next import. CNPG cluster import goes FIRST and ALONE. - [ ] **No pod restart on `pal-e-postgres-1` during reconciliation.** If any step would cause restart, pause and confirm backup restorability via `sop-postgres-restore` dry-run BEFORE proceeding. This is a hard gate. - [ ] **Zero-diff verification complete.** `tofu plan -var-file=k3s.tfvars` returns `No changes. Your infrastructure matches the configuration.` (zero-diff gate per Phase 28). - [ ] **`service-onboarding-sop` updated.** Add a new "Plan-diff check" row to the Pre-Deploy Validation Checklist: "Run `tofu plan` and confirm zero-diff OR the diff is exclusively the intended change. If unintended changes appear, STOP and file a drift ticket before applying." - [ ] **`sop-platform-tf-changes` updated.** Add a new bullet under "What NOT to Do": "No `tofu apply` when plan diff includes unintended resources. Use `-target` to isolate or file a drift ticket." - [ ] **Re-run keycloak onboarding plan (`pal-e-services#58`)** → confirm 5-resource plan shows 5 creates + 0 other changes. ### Scope Boundary This ticket is **investigation + import/cleanup PRs**. Does NOT include: - Schema changes to CNPG (DB migrations are out of scope) - Moving from local tf state to remote backend (separate ticket — see Related: `phase-platform-17b-tf-state-governance` may absorb this in the future) - Migrating manual-tool-managed resources to IaC (separate tickets per tool, e.g. Helm-managed → tf-managed) ### Interim Safety Protocol (in effect NOW) **This protocol supersedes `sop-platform-tf-changes` for `pal-e-services` until the zero-diff gate is restored.** `sop-platform-tf-changes` defines pal-e-services as plan-and-apply-before-merge — DO NOT follow that pattern until this ticket closes. Until the zero-diff gate is restored: - **No `tofu apply` on pal-e-services without `-target`.** Every apply must isolate the specific intended resource. - `pal-e-services#59` (pal-e-docs Keycloak realm) will be applied with `-target` per Path B documented in that PR. - Any other onboarding work that would touch `pal-e-services` should pause and link this ticket as a blocker, OR follow the same `-target` workaround pattern with explicit Lucas approval. ### Related - `pal-e-services#58` / `#59` — the PR that surfaced this; currently using `-target` workaround per Interim Safety Protocol - `phase-pal-e-platform-28-keycloak-smtp` — Phase 28 AC ("zero-diff plan gate") now violated - `phase-platform-17b-tf-state-governance` — canonical phase for tf-state-governance work; this P0 ticket may be the trigger that kicks off 17b execution. **Open scope decision (Ava):** does this ticket absorb 17b, or does 17b remain as the remote-state-backend epic and this stays standalone? Likely standalone — 17b's primary scope is remote backend migration, this is import/reconciliation. - `service-onboarding-sop` — step 5 says "Requires Lucas approval before apply" — needs stronger drift-detection clause (see AC above) - `sop-platform-tf-changes` — needs "No apply when plan diff includes unintended resources" bullet (see AC above) - `sop-postgres-restore` — required for the no-pod-restart AC (backup-restore dry-run before any CNPG step) - `feedback_never_alter_prod_directly.md` — no direct prod writes - `feedback_never_write_prod_db.md` — postgres is in blast radius - `feedback_enterprise_no_workarounds.md` — fix it right, don't hack around - `deployment-lessons` — past terraform-drift fixes in the lessons-learned note

forgejo_admin commented

2026-04-21 03:05:42 +00:00

Author

Contributor

Scope Review: NEEDS_REFINEMENT

Review note: review-1064-2026-04-20

Scope is fundamentally correct and P0 severity is justified — live cluster verification confirms every "create" drift claimed. Ticket is well-written on the whole. Refinements required before todo→next_up, then route to decomposition.

Required fixes

[LABEL] story:platform-bootstrap is not a valid story key on project-pal-e-platform. Replace with story:superuser-deploy (strongest fit — success metric for that story is exactly what's broken) or story:superuser-onboard-service.
[BODY] Phase reference phase-platform-28-keycloak-declarative-onboarding does not exist. Actual slug: phase-pal-e-platform-28-keycloak-smtp. Also add phase-platform-17b-tf-state-governance to Related — this ticket is essentially its P0 trigger.
[BODY] Add explicit "Execution Repos" note: SOP updates land in pal-e-platform; tofu import/config cleanup lands in pal-e-services.
[BODY] AC #1 and #2 need an explicit location for per-resource decisions — recommend a drift-investigation-2026-04-20 pal-e-docs note.
[BODY] AC #3 needs per-resource zero-diff gate (import-one-verify-one, not batch).
[BODY] AC #5 needs to split into two explicit SOP updates (service-onboarding-sop checklist row + sop-platform-tf-changes What-NOT-to-Do bullet), not "or."
[BODY] Add AC: "No pod restart on pal-e-postgres-1 during reconciliation; if any step would cause restart, confirm backup restorability via sop-postgres-restore dry-run first."
[BODY] Call out CNPG operator reconcile semantics in "Why this is P0" — the technical source of forever-drift risk on kubernetes_manifest against operator-reconciled CRDs. cnpg.tf:58-60 header comment already acknowledges this.
[BODY] Reconcile Interim Safety Protocol vs sop-platform-tf-changes: the protocol supersedes the SOP's plan-and-apply-before-merge pattern for pal-e-services until zero-diff is restored — state that explicitly.
[BODY] Flag woodpecker-db CNPG cluster in scope: verify during investigation that it's pal-e-platform-managed and NOT in the drift list.

Decomposition

[DECOMPOSE] 5-minute-rule fails (>5 file targets across 2 repos, 6 ACs, CNPG import is careful verify-after-each-step work). After BODY fixes, route to skill-decompose-ticket for a child board. Recommended 6 sub-tickets: (1) drift investigation note, (2) CNPG cluster+backup import alone, (3) harbor_creds+argocd_application drift bulk, (4) pal-e-mail keep-or-remove decision, (5) SOP updates, (6) zero-diff verification + lift safety protocol.

Open scope question

[SCOPE] Ava to decide: is this ticket the P0 expression of phase-platform-17b-tf-state-governance, or does 17b remain as the remote-state-backend epic separately? If former, update 17b to reference #297.

Full review in note review-1064-2026-04-20.

## Scope Review: NEEDS_REFINEMENT Review note: `review-1064-2026-04-20` **Scope is fundamentally correct and P0 severity is justified** — live cluster verification confirms every "create" drift claimed. Ticket is well-written on the whole. Refinements required before todo→next_up, then route to decomposition. ### Required fixes - **[LABEL]** `story:platform-bootstrap` is not a valid story key on `project-pal-e-platform`. Replace with `story:superuser-deploy` (strongest fit — success metric for that story is exactly what's broken) or `story:superuser-onboard-service`. - **[BODY]** Phase reference `phase-platform-28-keycloak-declarative-onboarding` does not exist. Actual slug: `phase-pal-e-platform-28-keycloak-smtp`. Also add `phase-platform-17b-tf-state-governance` to Related — this ticket is essentially its P0 trigger. - **[BODY]** Add explicit "Execution Repos" note: SOP updates land in pal-e-platform; tofu import/config cleanup lands in pal-e-services. - **[BODY]** AC #1 and #2 need an explicit location for per-resource decisions — recommend a `drift-investigation-2026-04-20` pal-e-docs note. - **[BODY]** AC #3 needs per-resource zero-diff gate (import-one-verify-one, not batch). - **[BODY]** AC #5 needs to split into two explicit SOP updates (service-onboarding-sop checklist row + sop-platform-tf-changes What-NOT-to-Do bullet), not "or." - **[BODY]** Add AC: "No pod restart on `pal-e-postgres-1` during reconciliation; if any step would cause restart, confirm backup restorability via `sop-postgres-restore` dry-run first." - **[BODY]** Call out CNPG operator reconcile semantics in "Why this is P0" — the technical source of forever-drift risk on `kubernetes_manifest` against operator-reconciled CRDs. `cnpg.tf:58-60` header comment already acknowledges this. - **[BODY]** Reconcile Interim Safety Protocol vs `sop-platform-tf-changes`: the protocol supersedes the SOP's plan-and-apply-before-merge pattern for pal-e-services until zero-diff is restored — state that explicitly. - **[BODY]** Flag `woodpecker-db` CNPG cluster in scope: verify during investigation that it's pal-e-platform-managed and NOT in the drift list. ### Decomposition **[DECOMPOSE]** 5-minute-rule fails (>5 file targets across 2 repos, 6 ACs, CNPG import is careful verify-after-each-step work). After BODY fixes, route to `skill-decompose-ticket` for a child board. Recommended 6 sub-tickets: (1) drift investigation note, (2) CNPG cluster+backup import alone, (3) harbor_creds+argocd_application drift bulk, (4) pal-e-mail keep-or-remove decision, (5) SOP updates, (6) zero-diff verification + lift safety protocol. ### Open scope question **[SCOPE]** Ava to decide: is this ticket the P0 expression of `phase-platform-17b-tf-state-governance`, or does 17b remain as the remote-state-backend epic separately? If former, update 17b to reference #297. Full review in note `review-1064-2026-04-20`.

forgejo_admin commented

2026-04-21 03:08:27 +00:00

Author

Contributor

Body refined per /review-ticket round 1 verdict NEEDS_REFINEMENT (review note review-1064-2026-04-20). Changes applied to body:

[BODY] Phase reference fixed: phase-platform-28-keycloak-declarative-onboarding (didn't exist) → phase-pal-e-platform-28-keycloak-smtp. Added phase-platform-17b-tf-state-governance to Related as the canonical parent phase.
[BODY] Added "Execution Repos" subsection clarifying SOP updates land in pal-e-platform, code lands in pal-e-services. Branch naming guidance included.
[BODY] Tightened AC #1/#2: per-resource decisions go in a new pal-e-docs note drift-investigation-2026-04-20, with a concrete decision table.
[BODY] Strengthened AC #3: import + plan + verify, one resource at a time. CNPG cluster goes first and alone. No batch imports.
[BODY] Added new AC: no pod restart on pal-e-postgres-1 during reconciliation. If a step would cause restart, pause and run sop-postgres-restore dry-run first. Hard gate.
[BODY] Split AC #5/#6 into two explicit deliverables: service-onboarding-sop Pre-Deploy Validation Checklist gets a "Plan-diff check" row; sop-platform-tf-changes gets a "What NOT to Do" bullet. Both, not one-or-the-other.
[BODY] Added CNPG operator reconcile semantics as a distinct "Why P0" item (now item 2). Names the technical source of perpetual-drift risk (operator re-projects fields tf doesn't declare).
[BODY] Added explicit supersedes-statement to Interim Safety Protocol: this protocol overrides sop-platform-tf-changes for pal-e-services until zero-diff is restored.
[BODY] Flagged woodpecker-db CNPG cluster (in woodpecker ns) for verification — investigation must confirm it is NOT in pal-e-services drift list.

[LABEL] Board item #1064 label fixed: story:platform-bootstrap (didn't exist) → story:superuser-deploy (the user story that owns the "tofu plan/apply succeeds without manual intervention" success metric — the metric this ticket is restoring).

[SCOPE] Open question stated explicitly in Related section: does this absorb phase-platform-17b-tf-state-governance or stand alongside? My read: standalone. 17b's primary scope is remote-state-backend migration; this is import/reconciliation. Documented as such; will refine if challenged.

[DECOMPOSE] Will route to decomposition after this re-review APPROVES — review recommends 6 sub-tickets on a child board (drift-investigation note, CNPG import alone, harbor_creds + ArgoCD batch, pal-e-mail decision, SOP updates, capstone zero-diff).

Re-running /review-ticket next.

Body refined per `/review-ticket` round 1 verdict NEEDS_REFINEMENT (review note `review-1064-2026-04-20`). Changes applied to body: 1. **[BODY]** Phase reference fixed: `phase-platform-28-keycloak-declarative-onboarding` (didn't exist) → `phase-pal-e-platform-28-keycloak-smtp`. Added `phase-platform-17b-tf-state-governance` to Related as the canonical parent phase. 2. **[BODY]** Added "Execution Repos" subsection clarifying SOP updates land in pal-e-platform, code lands in pal-e-services. Branch naming guidance included. 3. **[BODY]** Tightened AC #1/#2: per-resource decisions go in a new pal-e-docs note `drift-investigation-2026-04-20`, with a concrete decision table. 4. **[BODY]** Strengthened AC #3: import + plan + verify, **one resource at a time**. CNPG cluster goes first and alone. No batch imports. 5. **[BODY]** Added new AC: no pod restart on `pal-e-postgres-1` during reconciliation. If a step would cause restart, pause and run `sop-postgres-restore` dry-run first. Hard gate. 6. **[BODY]** Split AC #5/#6 into two explicit deliverables: `service-onboarding-sop` Pre-Deploy Validation Checklist gets a "Plan-diff check" row; `sop-platform-tf-changes` gets a "What NOT to Do" bullet. Both, not one-or-the-other. 7. **[BODY]** Added CNPG operator reconcile semantics as a distinct "Why P0" item (now item 2). Names the technical source of perpetual-drift risk (operator re-projects fields tf doesn't declare). 8. **[BODY]** Added explicit supersedes-statement to Interim Safety Protocol: this protocol overrides `sop-platform-tf-changes` for pal-e-services until zero-diff is restored. 9. **[BODY]** Flagged `woodpecker-db` CNPG cluster (in `woodpecker` ns) for verification — investigation must confirm it is NOT in pal-e-services drift list. **[LABEL]** Board item #1064 label fixed: `story:platform-bootstrap` (didn't exist) → `story:superuser-deploy` (the user story that owns the "tofu plan/apply succeeds without manual intervention" success metric — the metric this ticket is restoring). **[SCOPE]** Open question stated explicitly in Related section: does this absorb `phase-platform-17b-tf-state-governance` or stand alongside? My read: standalone. 17b's primary scope is remote-state-backend migration; this is import/reconciliation. Documented as such; will refine if challenged. **[DECOMPOSE]** Will route to decomposition after this re-review APPROVES — review recommends 6 sub-tickets on a child board (drift-investigation note, CNPG import alone, harbor_creds + ArgoCD batch, pal-e-mail decision, SOP updates, capstone zero-diff). Re-running `/review-ticket` next.

forgejo_admin commented

2026-04-21 03:11:23 +00:00

Author

Contributor

Scope Review (R2): APPROVED

Review note: review-1064-2026-04-20-r2

All 10 round-1 items resolved (1 [LABEL] + 9 [BODY]):

[LABEL] story key now story:superuser-deploy (verified on project-pal-e-platform user-stories table)
[BODY] phase ref fixed to phase-pal-e-platform-28-keycloak-smtp; phase-platform-17b-tf-state-governance in Related (standalone, not absorbing 17b)
[BODY] Execution Repos subsection added (pal-e-platform = SOP, pal-e-services = code)
[BODY] AC #1/#2 anchor on drift-investigation-2026-04-20 pal-e-docs note
[BODY] AC #3 requires per-resource one-at-a-time import + plan gate (CNPG first and alone)
[BODY] new AC #4: no pal-e-postgres-1 pod restart, sop-postgres-restore dry-run gate
[BODY] AC #5 split into two explicit ACs (service-onboarding-sop AND sop-platform-tf-changes)
[BODY] CNPG operator reconcile semantics now distinct P0 item (#2 in Why P0)
[BODY] Interim Safety Protocol explicitly supersedes sop-platform-tf-changes
[BODY] woodpecker-db CNPG cluster flagged in Environment as out-of-scope-pending-verification

No new issues surfaced in r2. Ready for post-approval pipeline.

Next step (per round 1 [DECOMPOSE]): caller routes #1064 to skill-decompose-ticket to create child board with the 6 sub-tickets recommended in review-1064-2026-04-20. Parent #1064 stays on board-pal-e-platform as the P0 tracker.

## Scope Review (R2): APPROVED Review note: `review-1064-2026-04-20-r2` All 10 round-1 items resolved (1 [LABEL] + 9 [BODY]): - [LABEL] story key now `story:superuser-deploy` (verified on project-pal-e-platform user-stories table) - [BODY] phase ref fixed to `phase-pal-e-platform-28-keycloak-smtp`; `phase-platform-17b-tf-state-governance` in Related (standalone, not absorbing 17b) - [BODY] Execution Repos subsection added (pal-e-platform = SOP, pal-e-services = code) - [BODY] AC #1/#2 anchor on `drift-investigation-2026-04-20` pal-e-docs note - [BODY] AC #3 requires per-resource one-at-a-time import + plan gate (CNPG first and alone) - [BODY] new AC #4: no `pal-e-postgres-1` pod restart, sop-postgres-restore dry-run gate - [BODY] AC #5 split into two explicit ACs (service-onboarding-sop AND sop-platform-tf-changes) - [BODY] CNPG operator reconcile semantics now distinct P0 item (#2 in Why P0) - [BODY] Interim Safety Protocol explicitly supersedes sop-platform-tf-changes - [BODY] woodpecker-db CNPG cluster flagged in Environment as out-of-scope-pending-verification No new issues surfaced in r2. Ready for post-approval pipeline. Next step (per round 1 [DECOMPOSE]): caller routes #1064 to `skill-decompose-ticket` to create child board with the 6 sub-tickets recommended in `review-1064-2026-04-20`. Parent #1064 stays on board-pal-e-platform as the P0 tracker.

forgejo_admin referenced this issue

2026-04-21 03:16:27 +00:00

P1: validate sop-postgres-restore via dry-run drill — backup we've never tested = no backup #298

forgejo_admin referenced this issue

2026-04-21 03:17:05 +00:00

P2: off-cluster postgres backup destination — same-cluster MinIO is not DR #299

forgejo_admin referenced this issue

2026-04-21 12:07:18 +00:00

P1: validate sop-postgres-restore via dry-run drill — backup we've never tested = no backup #298

forgejo_admin referenced this issue

2026-04-21 12:08:57 +00:00

P2: off-cluster postgres backup destination — same-cluster MinIO is not DR #299

forgejo_admin referenced this issue

2026-04-21 12:26:54 +00:00

P2: off-cluster postgres backup destination — same-cluster MinIO is not DR #299

forgejo_admin commented

2026-04-21 17:58:38 +00:00

Author

Contributor

Blocking gate satisfied — restore-drill PASS

pal-e-platform#298 restore drill completed PASS on 2026-04-21T17:53Z. Full results in validation-postgres-restore-2026-04-21.

The #297 hard gate ("if any step would cause pal-e-postgres-1 pod restart, run sop-postgres-restore dry-run first") is now satisfied:

Dry-run works. Scratch restore cluster bootstraps, WAL replays, verification queries return correct data.
SOP is accurate. 10 gaps were found during the drill; 9 fixed in-place on sop-postgres-restore. 1 spun out as pal-e-platform#300 (real-DR swap hardening — does not block drift-reconcile).
Restore timing is 55s for current paledocs+twitch2kwager-scale data. An in-place pal-e-postgres-1 restart during drift reconcile now has a known recovery path if things go sideways.

#297 is UNBLOCKED. Proceed with terraform drift reconcile when ready.

cc: @pal-e-platform#298 for the full verdict, execution log, and gap list.

## Blocking gate satisfied — restore-drill PASS `pal-e-platform#298` restore drill completed PASS on 2026-04-21T17:53Z. Full results in `validation-postgres-restore-2026-04-21`. The #297 hard gate ("if any step would cause `pal-e-postgres-1` pod restart, run sop-postgres-restore dry-run first") is now satisfied: - **Dry-run works.** Scratch restore cluster bootstraps, WAL replays, verification queries return correct data. - **SOP is accurate.** 10 gaps were found during the drill; 9 fixed in-place on `sop-postgres-restore`. 1 spun out as `pal-e-platform#300` (real-DR swap hardening — does not block drift-reconcile). - **Restore timing is 55s** for current paledocs+twitch2kwager-scale data. An in-place `pal-e-postgres-1` restart during drift reconcile now has a known recovery path if things go sideways. **#297 is UNBLOCKED.** Proceed with terraform drift reconcile when ready. cc: `@pal-e-platform#298` for the full verdict, execution log, and gap list.