[POST-INCIDENT] postgres NetworkPolicy missing pal-e-docs after pal-e-production→pal-e-app rename (#287) #334
Labels
No labels
domain:backend
domain:devops
domain:frontend
status:approved
status:in-progress
status:needs-fix
status:qa
type:bug
type:devops
type:feature
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
ldraney/pal-e-platform#334
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Type
Bug
Repo
forgejo_admin/pal-e-platformLineage
Latent regression introduced by commit
c6a138d("infra: rename pal-e-production → pal-e-app in monitoring + network policies (#287)", merged 2026-04-29). Surfaced 2026-05-04 ~12:25 UTC when ArgoCD reconciled and the pal-e-docs pod restarted, breaking the long-lived postgres connection that had been masking the bug for 5 days. Filed retroactively perfeedback_never_edit_without_ticketafter a P0 hotfix at 12:38 UTC.What Broke
The
default-deny-ingressNetworkPolicy in thepostgresnamespace did not include thepal-e-docsnamespace in its ingress allow list. After the rename in #287 swappedpal-e-production→pal-e-app, the implicit coverage ofpal-e-docs(which lives in its own namespace, separate frompal-e-app) was dropped.Symptoms when the bug surfaced:
pal-e-docs-6c7fdd96d7-fll8hpod entered CrashLoopBackOff withpsycopg2.OperationalError: connection to server at "pal-e-postgres-rw.postgres.svc.cluster.local" (10.43.239.89), port 5432 failed: Connection refusedpal-e-docs-6b55489545-v4jvvstuckPendingbecause archbox was at 110/110 pod cap (kubelet--max-pods=110), so the new pod could not preempt the failing onehttps://pal-e-docs.tail5b443a.ts.net/returned HTTP 502 from the Tailscale funnelmcp__pal-e-docs__*MCP call returned 502, blocking all docs reads/writes platform-widegate-validation-done.shandcheck-board-advancehooks (which depend on pal-e-docs) effectively offlineNetworkPolicies are stateful at the connection level — they only block NEW connections. The running pod kept its long-lived asyncpg connection alive across the NP change, masking the regression for 5 days until the pod restart.
Repro Steps
pal-e-postgres-rw.postgres.svc.cluster.local:5432from a namespace NOT in the postgres NP allow list.kubectl rollout restart deploy/<name> -n <ns>).Connection refusedon the postgres connect.To repro the original chain end-to-end (DON'T do this on prod — for documentation only):
pal-e-docsfromkubernetes_manifest.netpol_postgresingress interraform/network-policies.tfandtofu apply.pal-e-docsdeployment.Expected Behavior
The pal-e-docs pod connects to
pal-e-postgres-rw.postgres.svc.cluster.local:5432successfully and serves the API. Pod restarts (rolling, ArgoCD reconcile, OOM, etc.) do not cause platform-wide docs outages.Environment
pal-e-postgres-rwinpostgresnamespace (CNPG clusterpal-e-postgres)kubernetes_manifest.netpol_postgresinterraform/network-policies.tfc6a138don pal-e-platform mainreview-ticketagent (board item 1145, westside-admin#28) attemptedmcp__pal-e-docs__create_note, got 502, debugged through the connection-refused chain, escalated as P0Fix Applied (P0 hotfix — done at 12:38 UTC)
One line added to
terraform/network-policies.tf:netpol_postgresingress (between basketball-api and cnpg-system):Applied via
tofu apply -lock=false -target=kubernetes_manifest.netpol_postgres -auto-approve. Plan reported0 to add, 1 to change, 0 to destroy. Post-apply verified: NP allow list now has 5 entries (pal-e-app, basketball-api, pal-e-docs, cnpg-system, monitoring). Thenkubectl -n pal-e-docs delete pod pal-e-docs-6c7fdd96d7-fll8hto free the pod-cap slot. New podpal-e-docs-6b55489545-v4jvvReady in ~60s. API roundtrip viamcp__pal-e-docs__get_note_toc(sop-validation)confirmed.The .tf edit is currently on a dirty working tree; needs to be committed + pushed as a follow-up PR (separate from this issue) so the change persists across
tofure-runs.Acceptance Criteria
fix(netpol): re-add pal-e-docs to postgres NP allow list (closes #N)where N = this issue.terraform/network-policies.tf— does each consumer namespace appear in the right allow list? Look for any namespace that consumes a backing service whose allow list was last touched by #287. Capture findings in this issue.feedback_netpol_stateful_connections.md(or extendfeedback_argocd_sync_lag.md): "NetworkPolicies are stateful — broken rules don't surface until the next pod restart. Test post-apply by deliberately restarting a pod in each namespace touched by the change."--max-pods, (b) add a second node, or (c) prune low-priority pods. Open as separate ticket if scope grows.Lessons (for
lessons-learnednote when this is closed)Related
c6a138d("infra: rename pal-e-production → pal-e-app in monitoring + network policies (#287)")feedback_argocd_sync_lag(related: post-apply verification rigor)sop-incident-response