P0: pal-e-services terraform state drifted from cluster reality — prod postgres in blast radius #297
Labels
No labels
domain:backend
domain:devops
domain:frontend
status:approved
status:in-progress
status:needs-fix
status:qa
type:bug
type:devops
type:feature
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
ldraney/pal-e-platform#297
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Type
Bug
Lineage
Discovered during
tofu plan -var-file=k3s.tfvarsonpal-e-serviceswhile adding thepal-e-docsKeycloak realm (pal-e-services#58 / #59). Plan showed 10 to add, 11 to change, 0 to destroy — but 5 of the "creates" already exist in the cluster, including the prod CNPG postgres cluster. Applying the full plan could destroy or duplicate production data.Repo
Tracking lives on
forgejo_admin/pal-e-platform. Code lands across two repos:Execution Repos:
forgejo_admin/pal-e-platform— SOP and convention updates (service-onboarding-sop,sop-platform-tf-changes).forgejo_admin/pal-e-services— terraformtofu import, configuration removal, k3s.tfvars adjustments. Branch naming should reflect the repo, e.g.297-tf-drift-cnpg-importon pal-e-services and297-tf-drift-sop-updateon pal-e-platform.What Broke
Terraform state for
forgejo_admin/pal-e-serviceshas diverged from live cluster reality on at least 5 resources. Something (Helm?kubectl apply? ArgoCD auto-sync? manualkubectl create?) provisioned these resources outside terraform, and they were never imported into state.Resources terraform plan wants to CREATE that already exist:
kubernetes_manifest.cnpg_clusterclusters.postgresql.cnpg.io/pal-e-postgresinpostgresns, 49d old, healthy, primarypal-e-postgres-1serving traffickubernetes_manifest.cnpg_scheduled_backupscheduledbackups.postgresql.cnpg.io/pal-e-postgres-daily, last backup ~63min agoargocd_application.service["pal-e-mail"]argocd/pal-e-mailApplication, Synced + Healthykubernetes_secret_v1.harbor_creds["pal-e-app"]pal-e-appns, 8d oldharbor_robot_account.service_ci["playme2k"]Resources terraform plan wants to UPDATE (mostly cosmetic + credential cleanup):
kubernetes_secret_v1.harbor_creds[*]— stripargocd.argoproj.io/instancelabel (ArgoCD re-adds on next sync → churn loop)kubernetes_ingress_v1.service_funnel["pal-e-app"]— same label stripplayme2k,westside-ai-assistant) include credential data rotation — currently placeholder, tf would fill real robot credsWhy this is P0
kubernetes_manifestresource for a cluster that already exists typically no-ops if specs match, but if the tf-rendered spec differs from the live cluster (operator version drift, field defaults, etc.), the CNPG operator could reconcile — worst case: cluster recreation, data loss.kubernetes_manifestresource against an operator-reconciled CRD (CNPG Cluster) does NOT have the same import semantics as a plain k8s resource. The CNPG operator re-projects fields the terraform manifest does not declare. The header comment atcnpg.tf:58-60already acknowledges this ("CNPG operator manages all other fields... including them here would cause perpetual plan drift"). After import, expect non-trivial plan output until either the terraform manifest precisely matches operator-projected state for the managed fields, OR we explicitly accept perpetual no-op drift on operator-managed subtrees. This is the technical source of "import worked but we now have a forever-drift loop" risk.tofu applywithout-targetto make an unrelated change lights this fuse. Phase 28 AC explicitly required "tofu plan on existing state shows zero changes (import complete)" — that guarantee has been lost.service-onboarding-sop) will bring this drift along by default.Repro Steps
cd ~/pal-e-services/terraform && tofu plan -lock=false -var-file=k3s.tfvarsPlan: 10 to add, 11 to change, 0 to destroy.kubectl get <kind> -A | grep <name>→ at least 4 of the 5 creates already exist as live resources.argocd.argoproj.io/instancelabel that ArgoCD re-adds automatically.Expected Behavior
tofu planshows0 to add, 0 to change, 0 to destroyon a clean state. Every live cluster resource that should be terraform-managed is imported into terraform state. Every stale resource in terraform state that has been archived is removed from configuration.Environment
forgejo_admin/pal-e-servicespostgres,argocd,pal-e-app,playme2k,westside-ai-assistant,basketball-api,gcal-scheduler,mcd-tracker,mcd-tracker-app,pal-e-docs,pal-e-mail,platform-validation,westsidekingsandqueensclusters.postgresql.cnpg.io/woodpecker-db(inwoodpeckerns, 37d old, healthy). Likely managed by pal-e-platform (not pal-e-services). Investigation must confirm this is NOT in the pal-e-services drift list before proceeding.Acceptance Criteria
drift-investigation-2026-04-20(note_type:doc, tags:drift,investigation) is published with a per-resource decision table: for each of the 5 "create" drifts, decide (a)tofu import(adopt the live resource into tf state), (b) remove from configuration (resource is archived/should not exist), or (c)kubectl delete+tofu apply create(tf should own fresh). Decision recorded with rationale per row.tofu import, runtofu planand confirm the imported resource shows zero diff OR only operator-managed-field drift before moving to the next import. CNPG cluster import goes FIRST and ALONE.pal-e-postgres-1during reconciliation. If any step would cause restart, pause and confirm backup restorability viasop-postgres-restoredry-run BEFORE proceeding. This is a hard gate.tofu plan -var-file=k3s.tfvarsreturnsNo changes. Your infrastructure matches the configuration.(zero-diff gate per Phase 28).service-onboarding-sopupdated. Add a new "Plan-diff check" row to the Pre-Deploy Validation Checklist: "Runtofu planand confirm zero-diff OR the diff is exclusively the intended change. If unintended changes appear, STOP and file a drift ticket before applying."sop-platform-tf-changesupdated. Add a new bullet under "What NOT to Do": "Notofu applywhen plan diff includes unintended resources. Use-targetto isolate or file a drift ticket."pal-e-services#58) → confirm 5-resource plan shows 5 creates + 0 other changes.Scope Boundary
This ticket is investigation + import/cleanup PRs. Does NOT include:
phase-platform-17b-tf-state-governancemay absorb this in the future)Interim Safety Protocol (in effect NOW)
This protocol supersedes
sop-platform-tf-changesforpal-e-servicesuntil the zero-diff gate is restored.sop-platform-tf-changesdefines pal-e-services as plan-and-apply-before-merge — DO NOT follow that pattern until this ticket closes.Until the zero-diff gate is restored:
tofu applyon pal-e-services without-target. Every apply must isolate the specific intended resource.pal-e-services#59(pal-e-docs Keycloak realm) will be applied with-targetper Path B documented in that PR.pal-e-servicesshould pause and link this ticket as a blocker, OR follow the same-targetworkaround pattern with explicit Lucas approval.Related
pal-e-services#58/#59— the PR that surfaced this; currently using-targetworkaround per Interim Safety Protocolphase-pal-e-platform-28-keycloak-smtp— Phase 28 AC ("zero-diff plan gate") now violatedphase-platform-17b-tf-state-governance— canonical phase for tf-state-governance work; this P0 ticket may be the trigger that kicks off 17b execution. Open scope decision (Ava): does this ticket absorb 17b, or does 17b remain as the remote-state-backend epic and this stays standalone? Likely standalone — 17b's primary scope is remote backend migration, this is import/reconciliation.service-onboarding-sop— step 5 says "Requires Lucas approval before apply" — needs stronger drift-detection clause (see AC above)sop-platform-tf-changes— needs "No apply when plan diff includes unintended resources" bullet (see AC above)sop-postgres-restore— required for the no-pod-restart AC (backup-restore dry-run before any CNPG step)feedback_never_alter_prod_directly.md— no direct prod writesfeedback_never_write_prod_db.md— postgres is in blast radiusfeedback_enterprise_no_workarounds.md— fix it right, don't hack arounddeployment-lessons— past terraform-drift fixes in the lessons-learned noteScope Review: NEEDS_REFINEMENT
Review note:
review-1064-2026-04-20Scope is fundamentally correct and P0 severity is justified — live cluster verification confirms every "create" drift claimed. Ticket is well-written on the whole. Refinements required before todo→next_up, then route to decomposition.
Required fixes
story:platform-bootstrapis not a valid story key onproject-pal-e-platform. Replace withstory:superuser-deploy(strongest fit — success metric for that story is exactly what's broken) orstory:superuser-onboard-service.phase-platform-28-keycloak-declarative-onboardingdoes not exist. Actual slug:phase-pal-e-platform-28-keycloak-smtp. Also addphase-platform-17b-tf-state-governanceto Related — this ticket is essentially its P0 trigger.drift-investigation-2026-04-20pal-e-docs note.pal-e-postgres-1during reconciliation; if any step would cause restart, confirm backup restorability viasop-postgres-restoredry-run first."kubernetes_manifestagainst operator-reconciled CRDs.cnpg.tf:58-60header comment already acknowledges this.sop-platform-tf-changes: the protocol supersedes the SOP's plan-and-apply-before-merge pattern for pal-e-services until zero-diff is restored — state that explicitly.woodpecker-dbCNPG cluster in scope: verify during investigation that it's pal-e-platform-managed and NOT in the drift list.Decomposition
[DECOMPOSE] 5-minute-rule fails (>5 file targets across 2 repos, 6 ACs, CNPG import is careful verify-after-each-step work). After BODY fixes, route to
skill-decompose-ticketfor a child board. Recommended 6 sub-tickets: (1) drift investigation note, (2) CNPG cluster+backup import alone, (3) harbor_creds+argocd_application drift bulk, (4) pal-e-mail keep-or-remove decision, (5) SOP updates, (6) zero-diff verification + lift safety protocol.Open scope question
[SCOPE] Ava to decide: is this ticket the P0 expression of
phase-platform-17b-tf-state-governance, or does 17b remain as the remote-state-backend epic separately? If former, update 17b to reference #297.Full review in note
review-1064-2026-04-20.Body refined per
/review-ticketround 1 verdict NEEDS_REFINEMENT (review notereview-1064-2026-04-20). Changes applied to body:phase-platform-28-keycloak-declarative-onboarding(didn't exist) →phase-pal-e-platform-28-keycloak-smtp. Addedphase-platform-17b-tf-state-governanceto Related as the canonical parent phase.drift-investigation-2026-04-20, with a concrete decision table.pal-e-postgres-1during reconciliation. If a step would cause restart, pause and runsop-postgres-restoredry-run first. Hard gate.service-onboarding-sopPre-Deploy Validation Checklist gets a "Plan-diff check" row;sop-platform-tf-changesgets a "What NOT to Do" bullet. Both, not one-or-the-other.sop-platform-tf-changesfor pal-e-services until zero-diff is restored.woodpecker-dbCNPG cluster (inwoodpeckerns) for verification — investigation must confirm it is NOT in pal-e-services drift list.[LABEL] Board item #1064 label fixed:
story:platform-bootstrap(didn't exist) →story:superuser-deploy(the user story that owns the "tofu plan/apply succeeds without manual intervention" success metric — the metric this ticket is restoring).[SCOPE] Open question stated explicitly in Related section: does this absorb
phase-platform-17b-tf-state-governanceor stand alongside? My read: standalone. 17b's primary scope is remote-state-backend migration; this is import/reconciliation. Documented as such; will refine if challenged.[DECOMPOSE] Will route to decomposition after this re-review APPROVES — review recommends 6 sub-tickets on a child board (drift-investigation note, CNPG import alone, harbor_creds + ArgoCD batch, pal-e-mail decision, SOP updates, capstone zero-diff).
Re-running
/review-ticketnext.Scope Review (R2): APPROVED
Review note:
review-1064-2026-04-20-r2All 10 round-1 items resolved (1 [LABEL] + 9 [BODY]):
story:superuser-deploy(verified on project-pal-e-platform user-stories table)phase-pal-e-platform-28-keycloak-smtp;phase-platform-17b-tf-state-governancein Related (standalone, not absorbing 17b)drift-investigation-2026-04-20pal-e-docs notepal-e-postgres-1pod restart, sop-postgres-restore dry-run gateNo new issues surfaced in r2. Ready for post-approval pipeline.
Next step (per round 1 [DECOMPOSE]): caller routes #1064 to
skill-decompose-ticketto create child board with the 6 sub-tickets recommended inreview-1064-2026-04-20. Parent #1064 stays on board-pal-e-platform as the P0 tracker.Blocking gate satisfied — restore-drill PASS
pal-e-platform#298restore drill completed PASS on 2026-04-21T17:53Z. Full results invalidation-postgres-restore-2026-04-21.The #297 hard gate ("if any step would cause
pal-e-postgres-1pod restart, run sop-postgres-restore dry-run first") is now satisfied:sop-postgres-restore. 1 spun out aspal-e-platform#300(real-DR swap hardening — does not block drift-reconcile).pal-e-postgres-1restart during drift reconcile now has a known recovery path if things go sideways.#297 is UNBLOCKED. Proceed with terraform drift reconcile when ready.
cc:
@pal-e-platform#298for the full verdict, execution log, and gap list.