Platform cleanup: resolve 15 active alerts + stabilize CI #109

Closed
opened 2026-03-18 16:39:45 +00:00 by forgejo_admin · 2 comments

Lineage

plan-pal-e-platform — platform hardening

Repo

forgejo_admin/pal-e-platform + forgejo_admin/pal-e-deployments

User Story

As a platform operator
I want zero non-Watchdog alerts firing
So that real incidents are visible and the platform doesn't cry wolf

Context

As of 2026-03-18, 15 alerts are active. Most are from stale deployments, non-critical dev namespaces, and metric collection gaps — not real outages. Alert fatigue trains us to ignore real incidents.

Active Alerts (triage)

Fix immediately:

  • KubeDeploymentRolloutStuck: westsidekingsandqueens — old SSR image can't pull
  • KubeDeploymentReplicasMismatch: westsidekingsandqueens — same root cause
  • KubePodNotReady: westside-app-84cd9ff5f6 — stale pod
  • EndpointDown: westside-app — blackbox fails on broken prod

Investigate:

  • EndpointDown: keycloak — pod running, OIDC returns 200, probe may check wrong path
  • OOMKilled: argocd-repo-server — needs memory limit bump
  • TargetDown: postgres (9187) — CNPG metrics exporter, PodMonitor fix pending

Clean up:

  • capacitor-dev crash loop + restart storm — stale dev namespace
  • palworld job failed — game server cron
  • KubeJobFailed: postgres — backup/WAL jobs

File Targets

  • pal-e-deployments/overlays/westsidekingsandqueens/prod/kustomization.yaml — update image tag to SPA build
  • terraform/main.tf — ArgoCD repo-server memory limits, blackbox probe paths
  • pal-e-deployments/overlays/capacitor-dev/ — clean up or fix

Acceptance Criteria

  • Only Watchdog alert firing
  • westside-app prod serves SPA
  • ArgoCD repo-server not OOMing
  • Postgres metrics target UP
  • CI clone succeeds first attempt (PR #108)

Test Expectations

  • curl -s https://alertmanager.tail5b443a.ts.net/api/v2/alerts?active=true | jq '.[].labels.alertname' returns only Watchdog
  • curl -s https://westsidekingsandqueens.tail5b443a.ts.net returns SPA HTML
  • 5 consecutive CI pipelines clone successfully

Constraints

  • Don't delete namespaces with persistent data (postgres PVCs)
  • Don't change Forgejo external URL
  • Westside prod update must use the squash-merged SHA from PR #37

Checklist

  • Westside prod image updated
  • Stale pods cleaned
  • ArgoCD memory bumped
  • Keycloak probe investigated
  • capacitor-dev stabilized
  • CI TLS fix merged (PR #108)
  • Alert count verified
  • Bug #107 — TLS clone fix (PR #108 in review)
  • Issue #99 — observability
  • PR #106 — blackbox probe for westside-dev
  • OTel evaluation deferred — add prometheus-fastapi-instrumentator first
### Lineage `plan-pal-e-platform` — platform hardening ### Repo `forgejo_admin/pal-e-platform` + `forgejo_admin/pal-e-deployments` ### User Story As a platform operator I want zero non-Watchdog alerts firing So that real incidents are visible and the platform doesn't cry wolf ### Context As of 2026-03-18, 15 alerts are active. Most are from stale deployments, non-critical dev namespaces, and metric collection gaps — not real outages. Alert fatigue trains us to ignore real incidents. ### Active Alerts (triage) **Fix immediately:** - `KubeDeploymentRolloutStuck: westsidekingsandqueens` — old SSR image can't pull - `KubeDeploymentReplicasMismatch: westsidekingsandqueens` — same root cause - `KubePodNotReady: westside-app-84cd9ff5f6` — stale pod - `EndpointDown: westside-app` — blackbox fails on broken prod **Investigate:** - `EndpointDown: keycloak` — pod running, OIDC returns 200, probe may check wrong path - `OOMKilled: argocd-repo-server` — needs memory limit bump - `TargetDown: postgres (9187)` — CNPG metrics exporter, PodMonitor fix pending **Clean up:** - `capacitor-dev` crash loop + restart storm — stale dev namespace - `palworld` job failed — game server cron - `KubeJobFailed: postgres` — backup/WAL jobs ### File Targets - `pal-e-deployments/overlays/westsidekingsandqueens/prod/kustomization.yaml` — update image tag to SPA build - `terraform/main.tf` — ArgoCD repo-server memory limits, blackbox probe paths - `pal-e-deployments/overlays/capacitor-dev/` — clean up or fix ### Acceptance Criteria - [ ] Only Watchdog alert firing - [ ] westside-app prod serves SPA - [ ] ArgoCD repo-server not OOMing - [ ] Postgres metrics target UP - [ ] CI clone succeeds first attempt (PR #108) ### Test Expectations - [ ] `curl -s https://alertmanager.tail5b443a.ts.net/api/v2/alerts?active=true | jq '.[].labels.alertname'` returns only Watchdog - [ ] `curl -s https://westsidekingsandqueens.tail5b443a.ts.net` returns SPA HTML - [ ] 5 consecutive CI pipelines clone successfully ### Constraints - Don't delete namespaces with persistent data (postgres PVCs) - Don't change Forgejo external URL - Westside prod update must use the squash-merged SHA from PR #37 ### Checklist - [ ] Westside prod image updated - [ ] Stale pods cleaned - [ ] ArgoCD memory bumped - [ ] Keycloak probe investigated - [ ] capacitor-dev stabilized - [ ] CI TLS fix merged (PR #108) - [ ] Alert count verified ### Related - Bug #107 — TLS clone fix (PR #108 in review) - Issue #99 — observability - PR #106 — blackbox probe for westside-dev - OTel evaluation deferred — add `prometheus-fastapi-instrumentator` first
Author
Owner

Priority Blocker: CI Bootstrap Problem

PR #108 (TLS clone fix) is merged to main but NOT applied. The fix changes Woodpecker Helm values which need tofu apply. But CI can't run tofu apply because CI itself can't clone (the TLS bug).

First action for next session: manually apply the Woodpecker Helm release to break the bootstrap loop. This requires the woodpecker_db_password and woodpecker_encryption_key terraform variables which are CI-only secrets (stored in Woodpecker CI secrets, not local env). Options:

  1. Extract from k8s: kubectl get secret -n woodpecker woodpecker-db-credentials -o jsonpath='{.data.password}' | base64 -d
  2. Or: kubectl edit deployment woodpecker-server -n woodpecker to patch the FORGEJO_URL env var directly as a temporary fix

Once CI is unblocked, the remaining alert cleanup can flow through normal PRs.

## Priority Blocker: CI Bootstrap Problem PR #108 (TLS clone fix) is **merged to main** but NOT applied. The fix changes Woodpecker Helm values which need `tofu apply`. But CI can't run `tofu apply` because CI itself can't clone (the TLS bug). **First action for next session:** manually apply the Woodpecker Helm release to break the bootstrap loop. This requires the `woodpecker_db_password` and `woodpecker_encryption_key` terraform variables which are CI-only secrets (stored in Woodpecker CI secrets, not local env). Options: 1. Extract from k8s: `kubectl get secret -n woodpecker woodpecker-db-credentials -o jsonpath='{.data.password}' | base64 -d` 2. Or: `kubectl edit deployment woodpecker-server -n woodpecker` to patch the FORGEJO_URL env var directly as a temporary fix Once CI is unblocked, the remaining alert cleanup can flow through normal PRs.
Author
Owner

Update: CI Bootstrap Partially Resolved

What worked

  • Manually patched Woodpecker server StatefulSet with kubectl set env to point WOODPECKER_FORGEJO_URL at http://forgejo-http.forgejo.svc.cluster.local:80
  • Server restarted, API calls to Forgejo now go internal
  • pal-e-platform manual pipeline #119: clone succeeded on first attempt

What still fails

  • basketball-api pipelines #33, #34: clone still fails with TLS EOF
  • The clone STEP uses the git URL from Forgejo's API (clone_url field), which is based on Forgejo's ROOT_URL — still the external funnel URL
  • PR #108 fix only affects server-side API traffic, NOT the clone plugin's git operations

Root cause

Two separate traffic paths:

  1. Server → Forgejo API (OAuth, PR queries, repo sync) — FIXED by internal URL
  2. Clone plugin → Forgejo git (git fetch) — STILL external, STILL TLS flaky

Next actions (for this issue)

  1. Add custom clone step in .woodpecker.yaml using internal URL: http://forgejo-http.forgejo.svc.cluster.local:80/forgejo_admin/{repo}.git
  2. OR change Forgejo ROOT_URL to internal (breaks external git clone/push — needs careful evaluation)
  3. OR add CoreDNS rewrite: forgejo.tail5b443a.ts.net → internal service IP (transparent to all consumers)
## Update: CI Bootstrap Partially Resolved ### What worked - Manually patched Woodpecker server StatefulSet with `kubectl set env` to point `WOODPECKER_FORGEJO_URL` at `http://forgejo-http.forgejo.svc.cluster.local:80` - Server restarted, API calls to Forgejo now go internal - pal-e-platform manual pipeline #119: **clone succeeded on first attempt** ### What still fails - basketball-api pipelines #33, #34: **clone still fails with TLS EOF** - The clone STEP uses the git URL from Forgejo's API (`clone_url` field), which is based on Forgejo's `ROOT_URL` — still the external funnel URL - PR #108 fix only affects server-side API traffic, NOT the clone plugin's git operations ### Root cause Two separate traffic paths: 1. **Server → Forgejo API** (OAuth, PR queries, repo sync) — FIXED by internal URL 2. **Clone plugin → Forgejo git** (git fetch) — STILL external, STILL TLS flaky ### Next actions (for this issue) 1. Add custom clone step in `.woodpecker.yaml` using internal URL: `http://forgejo-http.forgejo.svc.cluster.local:80/forgejo_admin/{repo}.git` 2. OR change Forgejo `ROOT_URL` to internal (breaks external git clone/push — needs careful evaluation) 3. OR add CoreDNS rewrite: `forgejo.tail5b443a.ts.net` → internal service IP (transparent to all consumers)
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
forgejo_admin/pal-e-platform#109
No description provided.