[POST-INCIDENT] postgres NetworkPolicy missing pal-e-docs after pal-e-production→pal-e-app rename (#287) #334

Open
opened 2026-05-05 01:29:59 +00:00 by forgejo_admin · 0 comments
Contributor

Type

Bug

Repo

forgejo_admin/pal-e-platform

Lineage

Latent regression introduced by commit c6a138d ("infra: rename pal-e-production → pal-e-app in monitoring + network policies (#287)", merged 2026-04-29). Surfaced 2026-05-04 ~12:25 UTC when ArgoCD reconciled and the pal-e-docs pod restarted, breaking the long-lived postgres connection that had been masking the bug for 5 days. Filed retroactively per feedback_never_edit_without_ticket after a P0 hotfix at 12:38 UTC.

What Broke

The default-deny-ingress NetworkPolicy in the postgres namespace did not include the pal-e-docs namespace in its ingress allow list. After the rename in #287 swapped pal-e-productionpal-e-app, the implicit coverage of pal-e-docs (which lives in its own namespace, separate from pal-e-app) was dropped.

Symptoms when the bug surfaced:

  • pal-e-docs-6c7fdd96d7-fll8h pod entered CrashLoopBackOff with psycopg2.OperationalError: connection to server at "pal-e-postgres-rw.postgres.svc.cluster.local" (10.43.239.89), port 5432 failed: Connection refused
  • Replacement pod pal-e-docs-6b55489545-v4jvv stuck Pending because archbox was at 110/110 pod cap (kubelet --max-pods=110), so the new pod could not preempt the failing one
  • https://pal-e-docs.tail5b443a.ts.net/ returned HTTP 502 from the Tailscale funnel
  • Every mcp__pal-e-docs__* MCP call returned 502, blocking all docs reads/writes platform-wide
  • gate-validation-done.sh and check-board-advance hooks (which depend on pal-e-docs) effectively offline

NetworkPolicies are stateful at the connection level — they only block NEW connections. The running pod kept its long-lived asyncpg connection alive across the NP change, masking the regression for 5 days until the pod restart.

Repro Steps

  1. Have a namespace consume pal-e-postgres-rw.postgres.svc.cluster.local:5432 from a namespace NOT in the postgres NP allow list.
  2. Confirm the existing connection works (long-lived asyncpg pool stays open).
  3. Restart the consumer pod (kubectl rollout restart deploy/<name> -n <ns>).
  4. Observe the new pod fail with Connection refused on the postgres connect.

To repro the original chain end-to-end (DON'T do this on prod — for documentation only):

  1. Remove pal-e-docs from kubernetes_manifest.netpol_postgres ingress in terraform/network-policies.tf and tofu apply.
  2. Restart the pal-e-docs deployment.
  3. New pod hits Connection refused → CrashLoopBackOff.
  4. If pod cap is saturated, the failing replicaset cannot be evicted by the new one, escalating outage duration.

Expected Behavior

The pal-e-docs pod connects to pal-e-postgres-rw.postgres.svc.cluster.local:5432 successfully and serves the API. Pod restarts (rolling, ArgoCD reconcile, OOM, etc.) do not cause platform-wide docs outages.

Environment

  • Cluster: k3s on archbox
  • Affected namespace: pal-e-docs
  • Affected service: pal-e-postgres-rw in postgres namespace (CNPG cluster pal-e-postgres)
  • NetworkPolicy resource: kubernetes_manifest.netpol_postgres in terraform/network-policies.tf
  • Suspect commit: c6a138d on pal-e-platform main
  • Detected by: review-ticket agent (board item 1145, westside-admin#28) attempted mcp__pal-e-docs__create_note, got 502, debugged through the connection-refused chain, escalated as P0
  • Hotfix applied at: 2026-05-04 12:38 UTC
  • Outage duration: ~7 minutes (12:31 first 502 to 12:38 first 200)

Fix Applied (P0 hotfix — done at 12:38 UTC)

One line added to terraform/network-policies.tf:netpol_postgres ingress (between basketball-api and cnpg-system):

{ from = [{ namespaceSelector = { matchLabels = { "kubernetes.io/metadata.name" = "pal-e-docs" } } }] },

Applied via tofu apply -lock=false -target=kubernetes_manifest.netpol_postgres -auto-approve. Plan reported 0 to add, 1 to change, 0 to destroy. Post-apply verified: NP allow list now has 5 entries (pal-e-app, basketball-api, pal-e-docs, cnpg-system, monitoring). Then kubectl -n pal-e-docs delete pod pal-e-docs-6c7fdd96d7-fll8h to free the pod-cap slot. New pod pal-e-docs-6b55489545-v4jvv Ready in ~60s. API roundtrip via mcp__pal-e-docs__get_note_toc(sop-validation) confirmed.

The .tf edit is currently on a dirty working tree; needs to be committed + pushed as a follow-up PR (separate from this issue) so the change persists across tofu re-runs.

Acceptance Criteria

  • Hotfix: add pal-e-docs to postgres NP allow list (applied via tofu)
  • Recovery: delete crashlooping pod to free pod-cap slot
  • Verification: confirm pal-e-docs API roundtrip works
  • Commit + PR the .tf edit so the NP allow list change persists. PR title: fix(netpol): re-add pal-e-docs to postgres NP allow list (closes #N) where N = this issue.
  • Audit other NPs for similar gaps: review every NP in terraform/network-policies.tf — does each consumer namespace appear in the right allow list? Look for any namespace that consumes a backing service whose allow list was last touched by #287. Capture findings in this issue.
  • Add a CI check: lint/test that every namespace defined in this terraform repo appears in at least one NP allow list, OR explicitly opts out via a comment marker. Catches future "rename misses a namespace" bugs at PR time, not 5 days later in prod.
  • Document the failure mode: write feedback_netpol_stateful_connections.md (or extend feedback_argocd_sync_lag.md): "NetworkPolicies are stateful — broken rules don't surface until the next pod restart. Test post-apply by deliberately restarting a pod in each namespace touched by the change."
  • Investigate node pod-cap pressure: archbox at 110/110 turned a fast-recoverable bug into a longer outage because the new pod couldn't schedule until the failing one was deleted manually. Decide between (a) raise kubelet --max-pods, (b) add a second node, or (c) prune low-priority pods. Open as separate ticket if scope grows.

Lessons (for lessons-learned note when this is closed)

  • NPs are connection-stateful. A broken rule may be invisible for days. Test post-apply by deliberately restarting a pod in each namespace touched by the change.
  • Mass renames in NP files need a checklist of all consumers per service, not just the renamed namespace.
  • Pod-cap saturation makes self-healing fragile. Even a routine restart can fail to find a slot.
  • The review-ticket agent pipeline served as an effective monitoring substitute — it surfaced the platform issue before any human noticed.
  • Suspect commit: c6a138d ("infra: rename pal-e-production → pal-e-app in monitoring + network policies (#287)")
  • Recovery review session: 2026-05-04 PM review by Ava (multi-step session)
  • Detection agent: review-ticket spawn for board item 1145 (westside-admin#28)
  • feedback_argocd_sync_lag (related: post-apply verification rigor)
  • sop-incident-response
### Type Bug ### Repo `forgejo_admin/pal-e-platform` ### Lineage Latent regression introduced by **commit c6a138d** ("infra: rename pal-e-production → pal-e-app in monitoring + network policies (#287)", merged 2026-04-29). Surfaced 2026-05-04 ~12:25 UTC when ArgoCD reconciled and the pal-e-docs pod restarted, breaking the long-lived postgres connection that had been masking the bug for 5 days. Filed retroactively per `feedback_never_edit_without_ticket` after a P0 hotfix at 12:38 UTC. ### What Broke The `default-deny-ingress` NetworkPolicy in the `postgres` namespace did not include the `pal-e-docs` namespace in its ingress allow list. After the rename in #287 swapped `pal-e-production` → `pal-e-app`, the implicit coverage of `pal-e-docs` (which lives in its own namespace, separate from `pal-e-app`) was dropped. Symptoms when the bug surfaced: - `pal-e-docs-6c7fdd96d7-fll8h` pod entered CrashLoopBackOff with `psycopg2.OperationalError: connection to server at "pal-e-postgres-rw.postgres.svc.cluster.local" (10.43.239.89), port 5432 failed: Connection refused` - Replacement pod `pal-e-docs-6b55489545-v4jvv` stuck `Pending` because archbox was at 110/110 pod cap (kubelet `--max-pods=110`), so the new pod could not preempt the failing one - `https://pal-e-docs.tail5b443a.ts.net/` returned HTTP 502 from the Tailscale funnel - Every `mcp__pal-e-docs__*` MCP call returned 502, blocking all docs reads/writes platform-wide - `gate-validation-done.sh` and `check-board-advance` hooks (which depend on pal-e-docs) effectively offline NetworkPolicies are stateful at the connection level — they only block NEW connections. The running pod kept its long-lived asyncpg connection alive across the NP change, masking the regression for 5 days until the pod restart. ### Repro Steps 1. Have a namespace consume `pal-e-postgres-rw.postgres.svc.cluster.local:5432` from a namespace NOT in the postgres NP allow list. 2. Confirm the existing connection works (long-lived asyncpg pool stays open). 3. Restart the consumer pod (`kubectl rollout restart deploy/<name> -n <ns>`). 4. Observe the new pod fail with `Connection refused` on the postgres connect. To repro the original chain end-to-end (DON'T do this on prod — for documentation only): 1. Remove `pal-e-docs` from `kubernetes_manifest.netpol_postgres` ingress in `terraform/network-policies.tf` and `tofu apply`. 2. Restart the `pal-e-docs` deployment. 3. New pod hits Connection refused → CrashLoopBackOff. 4. If pod cap is saturated, the failing replicaset cannot be evicted by the new one, escalating outage duration. ### Expected Behavior The pal-e-docs pod connects to `pal-e-postgres-rw.postgres.svc.cluster.local:5432` successfully and serves the API. Pod restarts (rolling, ArgoCD reconcile, OOM, etc.) do not cause platform-wide docs outages. ### Environment - **Cluster:** k3s on archbox - **Affected namespace:** pal-e-docs - **Affected service:** `pal-e-postgres-rw` in `postgres` namespace (CNPG cluster `pal-e-postgres`) - **NetworkPolicy resource:** `kubernetes_manifest.netpol_postgres` in `terraform/network-policies.tf` - **Suspect commit:** c6a138d on pal-e-platform main - **Detected by:** `review-ticket` agent (board item 1145, westside-admin#28) attempted `mcp__pal-e-docs__create_note`, got 502, debugged through the connection-refused chain, escalated as P0 - **Hotfix applied at:** 2026-05-04 12:38 UTC - **Outage duration:** ~7 minutes (12:31 first 502 to 12:38 first 200) ### Fix Applied (P0 hotfix — done at 12:38 UTC) One line added to `terraform/network-policies.tf:netpol_postgres` ingress (between basketball-api and cnpg-system): ```hcl { from = [{ namespaceSelector = { matchLabels = { "kubernetes.io/metadata.name" = "pal-e-docs" } } }] }, ``` Applied via `tofu apply -lock=false -target=kubernetes_manifest.netpol_postgres -auto-approve`. Plan reported `0 to add, 1 to change, 0 to destroy`. Post-apply verified: NP allow list now has 5 entries (pal-e-app, basketball-api, **pal-e-docs**, cnpg-system, monitoring). Then `kubectl -n pal-e-docs delete pod pal-e-docs-6c7fdd96d7-fll8h` to free the pod-cap slot. New pod `pal-e-docs-6b55489545-v4jvv` Ready in ~60s. API roundtrip via `mcp__pal-e-docs__get_note_toc(sop-validation)` confirmed. The .tf edit is currently on a dirty working tree; needs to be committed + pushed as a follow-up PR (separate from this issue) so the change persists across `tofu` re-runs. ### Acceptance Criteria - [x] **Hotfix:** add pal-e-docs to postgres NP allow list (applied via tofu) - [x] **Recovery:** delete crashlooping pod to free pod-cap slot - [x] **Verification:** confirm pal-e-docs API roundtrip works - [ ] **Commit + PR the .tf edit** so the NP allow list change persists. PR title: `fix(netpol): re-add pal-e-docs to postgres NP allow list (closes #N)` where N = this issue. - [ ] **Audit other NPs for similar gaps:** review every NP in `terraform/network-policies.tf` — does each consumer namespace appear in the right allow list? Look for any namespace that consumes a backing service whose allow list was last touched by #287. Capture findings in this issue. - [ ] **Add a CI check:** lint/test that every namespace defined in this terraform repo appears in at least one NP allow list, OR explicitly opts out via a comment marker. Catches future "rename misses a namespace" bugs at PR time, not 5 days later in prod. - [ ] **Document the failure mode:** write `feedback_netpol_stateful_connections.md` (or extend `feedback_argocd_sync_lag.md`): "NetworkPolicies are stateful — broken rules don't surface until the next pod restart. Test post-apply by deliberately restarting a pod in each namespace touched by the change." - [ ] **Investigate node pod-cap pressure:** archbox at 110/110 turned a fast-recoverable bug into a longer outage because the new pod couldn't schedule until the failing one was deleted manually. Decide between (a) raise kubelet `--max-pods`, (b) add a second node, or (c) prune low-priority pods. Open as separate ticket if scope grows. ### Lessons (for `lessons-learned` note when this is closed) - NPs are connection-stateful. A broken rule may be invisible for days. Test post-apply by deliberately restarting a pod in each namespace touched by the change. - Mass renames in NP files need a checklist of all consumers per service, not just the renamed namespace. - Pod-cap saturation makes self-healing fragile. Even a routine restart can fail to find a slot. - The review-ticket agent pipeline served as an effective monitoring substitute — it surfaced the platform issue before any human noticed. ### Related - Suspect commit: c6a138d ("infra: rename pal-e-production → pal-e-app in monitoring + network policies (#287)") - Recovery review session: 2026-05-04 PM review by Ava (multi-step session) - Detection agent: review-ticket spawn for board item 1145 (westside-admin#28) - `feedback_argocd_sync_lag` (related: post-apply verification rigor) - `sop-incident-response`
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
ldraney/pal-e-platform#334
No description provided.