Platform cleanup: resolve 15 active alerts + stabilize CI #109
Labels
No labels
domain:backend
domain:devops
domain:frontend
status:approved
status:in-progress
status:needs-fix
status:qa
type:bug
type:devops
type:feature
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
forgejo_admin/pal-e-platform#109
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Lineage
plan-pal-e-platform— platform hardeningRepo
forgejo_admin/pal-e-platform+forgejo_admin/pal-e-deploymentsUser Story
As a platform operator
I want zero non-Watchdog alerts firing
So that real incidents are visible and the platform doesn't cry wolf
Context
As of 2026-03-18, 15 alerts are active. Most are from stale deployments, non-critical dev namespaces, and metric collection gaps — not real outages. Alert fatigue trains us to ignore real incidents.
Active Alerts (triage)
Fix immediately:
KubeDeploymentRolloutStuck: westsidekingsandqueens— old SSR image can't pullKubeDeploymentReplicasMismatch: westsidekingsandqueens— same root causeKubePodNotReady: westside-app-84cd9ff5f6— stale podEndpointDown: westside-app— blackbox fails on broken prodInvestigate:
EndpointDown: keycloak— pod running, OIDC returns 200, probe may check wrong pathOOMKilled: argocd-repo-server— needs memory limit bumpTargetDown: postgres (9187)— CNPG metrics exporter, PodMonitor fix pendingClean up:
capacitor-devcrash loop + restart storm — stale dev namespacepalworldjob failed — game server cronKubeJobFailed: postgres— backup/WAL jobsFile Targets
pal-e-deployments/overlays/westsidekingsandqueens/prod/kustomization.yaml— update image tag to SPA buildterraform/main.tf— ArgoCD repo-server memory limits, blackbox probe pathspal-e-deployments/overlays/capacitor-dev/— clean up or fixAcceptance Criteria
Test Expectations
curl -s https://alertmanager.tail5b443a.ts.net/api/v2/alerts?active=true | jq '.[].labels.alertname'returns only Watchdogcurl -s https://westsidekingsandqueens.tail5b443a.ts.netreturns SPA HTMLConstraints
Checklist
Related
prometheus-fastapi-instrumentatorfirstPriority Blocker: CI Bootstrap Problem
PR #108 (TLS clone fix) is merged to main but NOT applied. The fix changes Woodpecker Helm values which need
tofu apply. But CI can't runtofu applybecause CI itself can't clone (the TLS bug).First action for next session: manually apply the Woodpecker Helm release to break the bootstrap loop. This requires the
woodpecker_db_passwordandwoodpecker_encryption_keyterraform variables which are CI-only secrets (stored in Woodpecker CI secrets, not local env). Options:kubectl get secret -n woodpecker woodpecker-db-credentials -o jsonpath='{.data.password}' | base64 -dkubectl edit deployment woodpecker-server -n woodpeckerto patch the FORGEJO_URL env var directly as a temporary fixOnce CI is unblocked, the remaining alert cleanup can flow through normal PRs.
Update: CI Bootstrap Partially Resolved
What worked
kubectl set envto pointWOODPECKER_FORGEJO_URLathttp://forgejo-http.forgejo.svc.cluster.local:80What still fails
clone_urlfield), which is based on Forgejo'sROOT_URL— still the external funnel URLRoot cause
Two separate traffic paths:
Next actions (for this issue)
.woodpecker.yamlusing internal URL:http://forgejo-http.forgejo.svc.cluster.local:80/forgejo_admin/{repo}.gitROOT_URLto internal (breaks external git clone/push — needs careful evaluation)forgejo.tail5b443a.ts.net→ internal service IP (transparent to all consumers)