ldraney/pal-e-platform

Fork

You've already forked pal-e-platform

Code Issues 91 Pull requests 5 Projects Releases Packages Wiki Activity Actions

Platform cleanup: resolve 15 active alerts + stabilize CI #109

New issue

Closed

opened 2026-03-18 16:39:45 +00:00 by forgejo_admin · 2 comments

forgejo_admin commented

2026-03-18 16:39:45 +00:00

Contributor

Copy link

Lineage

plan-pal-e-platform — platform hardening

Repo

forgejo_admin/pal-e-platform + forgejo_admin/pal-e-deployments

User Story

As a platform operator
I want zero non-Watchdog alerts firing
So that real incidents are visible and the platform doesn't cry wolf

Context

As of 2026-03-18, 15 alerts are active. Most are from stale deployments, non-critical dev namespaces, and metric collection gaps — not real outages. Alert fatigue trains us to ignore real incidents.

Active Alerts (triage)

Fix immediately:

KubeDeploymentRolloutStuck: westsidekingsandqueens — old SSR image can't pull
KubeDeploymentReplicasMismatch: westsidekingsandqueens — same root cause
KubePodNotReady: westside-app-84cd9ff5f6 — stale pod
EndpointDown: westside-app — blackbox fails on broken prod

Investigate:

EndpointDown: keycloak — pod running, OIDC returns 200, probe may check wrong path
OOMKilled: argocd-repo-server — needs memory limit bump
TargetDown: postgres (9187) — CNPG metrics exporter, PodMonitor fix pending

Clean up:

capacitor-dev crash loop + restart storm — stale dev namespace
palworld job failed — game server cron
KubeJobFailed: postgres — backup/WAL jobs

File Targets

pal-e-deployments/overlays/westsidekingsandqueens/prod/kustomization.yaml — update image tag to SPA build
terraform/main.tf — ArgoCD repo-server memory limits, blackbox probe paths
pal-e-deployments/overlays/capacitor-dev/ — clean up or fix

Acceptance Criteria

Only Watchdog alert firing
westside-app prod serves SPA
ArgoCD repo-server not OOMing
Postgres metrics target UP
CI clone succeeds first attempt (PR #108)

Test Expectations

curl -s https://alertmanager.tail5b443a.ts.net/api/v2/alerts?active=true | jq '.[].labels.alertname' returns only Watchdog
curl -s https://westsidekingsandqueens.tail5b443a.ts.net returns SPA HTML
5 consecutive CI pipelines clone successfully

Constraints

Don't delete namespaces with persistent data (postgres PVCs)
Don't change Forgejo external URL
Westside prod update must use the squash-merged SHA from PR #37

Checklist

Westside prod image updated
Stale pods cleaned
ArgoCD memory bumped
Keycloak probe investigated
capacitor-dev stabilized
CI TLS fix merged (PR #108)
Alert count verified

Bug #107 — TLS clone fix (PR #108 in review)
Issue #99 — observability
PR #106 — blackbox probe for westside-dev
OTel evaluation deferred — add prometheus-fastapi-instrumentator first

### Lineage `plan-pal-e-platform` — platform hardening ### Repo `forgejo_admin/pal-e-platform` + `forgejo_admin/pal-e-deployments` ### User Story As a platform operator I want zero non-Watchdog alerts firing So that real incidents are visible and the platform doesn't cry wolf ### Context As of 2026-03-18, 15 alerts are active. Most are from stale deployments, non-critical dev namespaces, and metric collection gaps — not real outages. Alert fatigue trains us to ignore real incidents. ### Active Alerts (triage) **Fix immediately:** - `KubeDeploymentRolloutStuck: westsidekingsandqueens` — old SSR image can't pull - `KubeDeploymentReplicasMismatch: westsidekingsandqueens` — same root cause - `KubePodNotReady: westside-app-84cd9ff5f6` — stale pod - `EndpointDown: westside-app` — blackbox fails on broken prod **Investigate:** - `EndpointDown: keycloak` — pod running, OIDC returns 200, probe may check wrong path - `OOMKilled: argocd-repo-server` — needs memory limit bump - `TargetDown: postgres (9187)` — CNPG metrics exporter, PodMonitor fix pending **Clean up:** - `capacitor-dev` crash loop + restart storm — stale dev namespace - `palworld` job failed — game server cron - `KubeJobFailed: postgres` — backup/WAL jobs ### File Targets - `pal-e-deployments/overlays/westsidekingsandqueens/prod/kustomization.yaml` — update image tag to SPA build - `terraform/main.tf` — ArgoCD repo-server memory limits, blackbox probe paths - `pal-e-deployments/overlays/capacitor-dev/` — clean up or fix ### Acceptance Criteria - [ ] Only Watchdog alert firing - [ ] westside-app prod serves SPA - [ ] ArgoCD repo-server not OOMing - [ ] Postgres metrics target UP - [ ] CI clone succeeds first attempt (PR #108) ### Test Expectations - [ ] `curl -s https://alertmanager.tail5b443a.ts.net/api/v2/alerts?active=true | jq '.[].labels.alertname'` returns only Watchdog - [ ] `curl -s https://westsidekingsandqueens.tail5b443a.ts.net` returns SPA HTML - [ ] 5 consecutive CI pipelines clone successfully ### Constraints - Don't delete namespaces with persistent data (postgres PVCs) - Don't change Forgejo external URL - Westside prod update must use the squash-merged SHA from PR #37 ### Checklist - [ ] Westside prod image updated - [ ] Stale pods cleaned - [ ] ArgoCD memory bumped - [ ] Keycloak probe investigated - [ ] capacitor-dev stabilized - [ ] CI TLS fix merged (PR #108) - [ ] Alert count verified ### Related - Bug #107 — TLS clone fix (PR #108 in review) - Issue #99 — observability - PR #106 — blackbox probe for westside-dev - OTel evaluation deferred — add `prometheus-fastapi-instrumentator` first

forgejo_admin commented

2026-03-18 16:46:48 +00:00

Author

Contributor

Copy link

Priority Blocker: CI Bootstrap Problem

PR #108 (TLS clone fix) is merged to main but NOT applied. The fix changes Woodpecker Helm values which need tofu apply. But CI can't run tofu apply because CI itself can't clone (the TLS bug).

First action for next session: manually apply the Woodpecker Helm release to break the bootstrap loop. This requires the woodpecker_db_password and woodpecker_encryption_key terraform variables which are CI-only secrets (stored in Woodpecker CI secrets, not local env). Options:

Extract from k8s: kubectl get secret -n woodpecker woodpecker-db-credentials -o jsonpath='{.data.password}' | base64 -d
Or: kubectl edit deployment woodpecker-server -n woodpecker to patch the FORGEJO_URL env var directly as a temporary fix

Once CI is unblocked, the remaining alert cleanup can flow through normal PRs.

## Priority Blocker: CI Bootstrap Problem PR #108 (TLS clone fix) is **merged to main** but NOT applied. The fix changes Woodpecker Helm values which need `tofu apply`. But CI can't run `tofu apply` because CI itself can't clone (the TLS bug). **First action for next session:** manually apply the Woodpecker Helm release to break the bootstrap loop. This requires the `woodpecker_db_password` and `woodpecker_encryption_key` terraform variables which are CI-only secrets (stored in Woodpecker CI secrets, not local env). Options: 1. Extract from k8s: `kubectl get secret -n woodpecker woodpecker-db-credentials -o jsonpath='{.data.password}' | base64 -d` 2. Or: `kubectl edit deployment woodpecker-server -n woodpecker` to patch the FORGEJO_URL env var directly as a temporary fix Once CI is unblocked, the remaining alert cleanup can flow through normal PRs.

forgejo_admin referenced this issue

2026-03-18 17:03:23 +00:00

Fix westside-app prod: Harbor robot account scope mismatch (4 alerts) #110

forgejo_admin referenced this issue

2026-03-18 17:03:36 +00:00

Fix Keycloak blackbox probe: NetworkPolicy + internal URL (1 alert) #111

forgejo_admin referenced this issue

2026-03-18 17:03:48 +00:00

Bump ArgoCD repo-server memory limit (1 alert) #112

forgejo_admin referenced this issue

2026-03-18 17:04:02 +00:00

Apply Terraform state drift: 5+ merged PRs unapplied (3 alerts) #113

forgejo_admin referenced this issue

2026-03-18 17:04:13 +00:00

Remove palworld namespace from cluster (1 alert) #114

forgejo_admin referenced this issue

2026-03-18 17:04:21 +00:00

Remove capacitor-dev namespace from cluster (3 alerts) #115

forgejo_admin commented

2026-03-18 17:24:08 +00:00

Author

Contributor

Copy link

Update: CI Bootstrap Partially Resolved

What worked

Manually patched Woodpecker server StatefulSet with kubectl set env to point WOODPECKER_FORGEJO_URL at http://forgejo-http.forgejo.svc.cluster.local:80
Server restarted, API calls to Forgejo now go internal
pal-e-platform manual pipeline #119: clone succeeded on first attempt

What still fails

basketball-api pipelines #33, #34: clone still fails with TLS EOF
The clone STEP uses the git URL from Forgejo's API (clone_url field), which is based on Forgejo's ROOT_URL — still the external funnel URL
PR #108 fix only affects server-side API traffic, NOT the clone plugin's git operations

Root cause

Two separate traffic paths:

Server → Forgejo API (OAuth, PR queries, repo sync) — FIXED by internal URL
Clone plugin → Forgejo git (git fetch) — STILL external, STILL TLS flaky

Next actions (for this issue)

Add custom clone step in .woodpecker.yaml using internal URL: http://forgejo-http.forgejo.svc.cluster.local:80/forgejo_admin/{repo}.git
OR change Forgejo ROOT_URL to internal (breaks external git clone/push — needs careful evaluation)
OR add CoreDNS rewrite: forgejo.tail5b443a.ts.net → internal service IP (transparent to all consumers)

## Update: CI Bootstrap Partially Resolved ### What worked - Manually patched Woodpecker server StatefulSet with `kubectl set env` to point `WOODPECKER_FORGEJO_URL` at `http://forgejo-http.forgejo.svc.cluster.local:80` - Server restarted, API calls to Forgejo now go internal - pal-e-platform manual pipeline #119: **clone succeeded on first attempt** ### What still fails - basketball-api pipelines #33, #34: **clone still fails with TLS EOF** - The clone STEP uses the git URL from Forgejo's API (`clone_url` field), which is based on Forgejo's `ROOT_URL` — still the external funnel URL - PR #108 fix only affects server-side API traffic, NOT the clone plugin's git operations ### Root cause Two separate traffic paths: 1. **Server → Forgejo API** (OAuth, PR queries, repo sync) — FIXED by internal URL 2. **Clone plugin → Forgejo git** (git fetch) — STILL external, STILL TLS flaky ### Next actions (for this issue) 1. Add custom clone step in `.woodpecker.yaml` using internal URL: `http://forgejo-http.forgejo.svc.cluster.local:80/forgejo_admin/{repo}.git` 2. OR change Forgejo `ROOT_URL` to internal (breaks external git clone/push — needs careful evaluation) 3. OR add CoreDNS rewrite: `forgejo.tail5b443a.ts.net` → internal service IP (transparent to all consumers)

forgejo_admin referenced this issue

2026-03-18 19:01:50 +00:00

fix: add monitoring ingress to Keycloak NetworkPolicy + use internal probe URL #117

forgejo_admin referenced this issue

2026-03-21 03:25:09 +00:00

fix: add monitoring ingress to Keycloak NetworkPolicy + use internal probe URL #117

forgejo_admin referenced this issue

2026-03-21 14:03:55 +00:00

Bug: tofu-state backup CronJob intermittent failures (2 alerts) #123

forgejo_admin referenced this issue

2026-03-22 18:21:44 +00:00

Bug: Woodpecker agent secret drift — 3 conflicting values across tfvars/k8s/statefulset #137

forgejo_admin closed this issue

2026-03-25 00:55:47 +00:00

No Branch/Tag specified

main

fix-gem-bin-path

netpol-westside-ror

fix-woodpecker-multi-pipeline

345-harbor-mobile-css

290-payment-pipeline-observability

spec-board-centric-renderer

284-fix-add-pal-e-docs-namespace-to-postgres

284-add-pal-e-docs-postgres-netpol

188-cross-repo-worktree-isolation-for-parall

111-fix-keycloak-probe

86-fix-rotate-woodpecker-api-token-in-salt

71-feat-deploy-pyrra-slo-manager-with-servi

64-hotfix-woodpecker-oauth-login-broken-for

18-fix-grafana-duplicate-default-datasource-v2

18-fix-grafana-duplicate-default-datasource

13-fix-cnpg-all-parameters

No results found.

Labels

Clear labels

QA passed, awaiting merge approval

status:in-progress

Dev agent is actively working

status:needs-fix

QA found issues, back to dev

status:qa

PR submitted, awaiting QA review

type:bug

Bug fix

type:devops

Infrastructure/CI/config work

No labels

Milestone

Clear milestone

No items

No milestone

Projects

Clear projects

No items

No project

Assignees

Clear assignees

No assignees

1 participant

Notifications

Due date

The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference

ldraney/pal-e-platform#109

Reference in a new issue

Repository

ldraney/pal-e-platform

Title

Body

No description provided.

Delete branch "%!s()"

Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?

Rows
Columns