Bug: platform-validation OOMKilled at 64Mi + stale OOMKilled alert rule #171
Labels
No labels
domain:backend
domain:devops
domain:frontend
status:approved
status:in-progress
status:needs-fix
status:qa
type:bug
type:devops
type:feature
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
forgejo_admin/pal-e-platform#171
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Type
Bug
Lineage
standalone — discovered during AlertManager triage 2026-03-26
Repo
forgejo_admin/pal-e-platform(resource limits in terraform or kustomize), alert rule in PrometheusRuleWhat Broke
Two related issues:
OOMKilled: platform-validation container was OOM-killed on 2026-03-21 with a 64Mi memory limit. Pod restarted and has been running stable for 4 days, but the limit is too tight and will likely recur.
Stale alert rule: The
OOMKilledalert rule fires onkube_pod_container_status_last_terminated_reason{reason="OOMKilled"} > 0withfor: 0m. This means it fires forever as long as the pod's last termination was OOM — even after the pod recovers. The alert has been firing since 2026-03-24 (critical) despite the pod being healthy for 4 days.Repro Steps
kubectl get pod -n platform-validation -o jsonpath='{.items[0].spec.containers[0].resources}'→ limits.memory: 64Mikubectl describe pod -n platform-validation -l app=platform-validation→ Last State: OOMKilledOOMKilledalert (critical) still firingExpected Behavior
Memory limit is sufficient to prevent OOMKill. Alert rule auto-resolves after pod recovers.
Environment
OOMKilled(critical), firing since 2026-03-24Acceptance Criteria
increase(kube_pod_container_status_restarts_total{reason="OOMKilled"}[15m]) > 0)OOMKilledalert clearsRelated
project-pal-e-platform— projectstory:superuser-observe— user story (alert rule),story:superuser-deploy— user story (resource limits)arch:prometheus— architecture component (alert rule)Scope Review: NEEDS_REFINEMENT
Review note:
review-388-2026-03-26Three issues found that must be resolved before this ticket is agent-ready:
pal-e-deployments/overlays/platform-validation/prod/deployment-patch.yaml(line 26-27), not in this repo. File Targets section missing entirely.increase(kube_pod_container_status_restarts_total{reason="OOMKilled"}[15m]) > 0is invalid —kube_pod_container_status_restarts_totaldoes not have areasonlabel. Would silently match nothing.Scope Correction (post-review)
Per review
review-388-2026-03-26, correcting repo, file targets, and PromQL.Repo Correction
Two repos, not one:
forgejo_admin/pal-e-deployments(kustomize overlay, not terraform)forgejo_admin/pal-e-platform(terraform PrometheusRule)File Targets
Memory limit (pal-e-deployments):
overlays/platform-validation/prod/deployment-patch.yamllines 22-27limits.memoryfrom64Mito128MiAlert rule (pal-e-platform):
terraform/main.tflines 230-242 (OOMKilled rule inadditionalPrometheusRules)kube_pod_container_status_last_terminated_reason{reason="OOMKilled"} > 0withfor: 0mfor: 15mso the alert auto-resolves after pod recovery instead of firing forever on historical statePromQL Correction
The originally proposed replacement expression is INVALID:
increase(kube_pod_container_status_restarts_total{reason="OOMKilled"}[15m]) > 0—kube_pod_container_status_restarts_totaldoes NOT carry areasonlabel in kube-state-metrics.Correct fix: Keep the existing expression, just add
for: 15m. This means the alert fires on OOMKill but auto-resolves 15 minutes after recovery (when the pod's last termination reason is no longer OOMKilled due to successful running). Standard Prometheus pattern.Acceptance Criteria (updated)
for: 15min pal-e-platformOOMKilledalert clears within 15 minutestofu planshows only the for: duration changeScope Review v2: NEEDS_REFINEMENT
Review note:
review-388-2026-03-26(updated)Second review pass after scope correction comment. File targets verified, PromQL correction validated. Three issues remain:
pal-e-platformonly. Agent reading the body would miss thepal-e-deploymentsmemory bump entirely. Either update the body or create a companion issue.kube_pod_container_status_last_terminated_reasonpersists after pod recovery. The alert clears in this case only because the memory bump triggers a deployment rollout (new pod, no OOM history).for: 15malone does not resolve staleness.tofu plan -lock=false, two separate PRs, ArgoCD sync), and Checklist still absent.Refinement: alert behavior correction + cross-repo clarity
Per review
review-388-2026-03-26:Alert Behavior Correction
kube_pod_container_status_last_terminated_reasonpersists indefinitely after recovery —for: 15malone does NOT clear staleness. In THIS case the alert clears because bumping memory triggers a deployment rollout (new pod = no OOM history). For future OOMKills without a redeploy, the alert would still persist.Consider alternative: change expression to track recent OOM events only, e.g.
changes(kube_pod_container_status_last_terminated_reason{reason="OOMKilled"}[1h]) > 0— fires only when a NEW OOMKill occurs in the last hour. Dev agent should evaluate both approaches.Cross-Repo Clarity
Two separate PRs needed:
overlays/platform-validation/prod/deployment-patch.yamlterraform/main.tflines 230-241Test Expectations (added)
kustomize build overlays/platform-validation/prod/shows 128Mitofu plan -lock=falseshows only the alert rule changekubectl describe pod -n platform-validationshows limits.memory=128MiScope Review v3: NEEDS_REFINEMENT
Review note:
review-388-2026-03-26-v2Re-review after two scope correction comments. The corrections address the original concerns (wrong repo, invalid PromQL, alert staleness). Three issues remain before this is agent-ready:
pal-e-platformonly. An agent reading the body would miss thepal-e-deploymentsmemory bump entirely. Consolidate corrections into the body.terraform/main.tflines 230-241 but the OOMKilled rule is actually at lines 263-274. Verified: line 264 (alert name), line 265 (expr), line 266 (for = "0m").changes()expression approach which is order-independent.Optional: add User Story, Constraints (
tofu plan -lock=false, ArgoCD sync ordering), Checklist sections. Notekubectl kustomizenotkustomizefor test commands.Refinement v3: line numbers + PR ordering + consolidated scope
Per review
review-388-2026-03-26-v2:Line Number Correction
OOMKilled alert rule is at lines 263-274, NOT 230-241. Verified:
alert = "OOMKilled"expr = "kube_pod_container_status_last_terminated_reason{reason=\"OOMKilled\"} > 0"for = "0m"← change tofor = "15m"PR Ordering
pal-e-deployments PR (memory bump) must merge first. The rollout from 64Mi→128Mi creates a new pod with clean OOM history, which clears the stale alert. If the alert rule PR merges first without the rollout, the alert persists until the next deployment.
Sequence:
for: "0m"→for: "15m"→ merge → tofu applyConsolidated Scope (replaces all previous comments)
Two repos, two PRs, ordered:
PR 1 (pal-e-deployments, merge first):
overlays/platform-validation/prod/deployment-patch.yamllines 22-27limits.memory: 64Mi→limits.memory: 128Mikustomize build overlays/platform-validation/prod/shows 128Mikubectl describe pod -n platform-validationshows 128Mi after ArgoCD syncPR 2 (pal-e-platform, merge second):
terraform/main.tflines 263-274 (OOMKilled rule)for = "0m"→for = "15m"tofu plan -lock=falseshows only the for: duration changeScope Review v3: NEEDS_REFINEMENT
Review note:
review-388-2026-03-26-v3v3 correction comment fixes line numbers (263-274 verified), PR ordering (pal-e-deployments first), and alert staleness explanation. All file targets verified against codebase. One blocking issue remains:
pal-e-platformonly, has no File Targets, lists invalid PromQL in acceptance criteria, and references no line numbers. An agent reading the body would produce incorrect work. Consolidate v3 correction into the body.kustomize buildshould bekubectl kustomize-- standalone kustomize is not installed.tofu plan -lock=false, two separate PRs, ArgoCD sync ordering should be documented.PodRestartStormrule (line 252-262) also usesfor = "0m"-- same staleness class. Not in scope but worth a follow-up ticket.Agent Instructions: Consolidated Scope
IGNORE THE ISSUE BODY ABOVE — it contains stale information (wrong repo, wrong line numbers, invalid PromQL).
USE THIS AS THE AUTHORITATIVE SPEC:
Two PRs, ordered:
PR 1 (pal-e-deployments, merge first):
overlays/platform-validation/prod/deployment-patch.yamllines 22-27limits.memory: 64Mi→limits.memory: 128Mikubectl kustomize overlays/platform-validation/prod/shows 128Mikubectl describe pod -n platform-validationshows 128Mi after ArgoCD syncPR 2 (pal-e-platform, merge second):
terraform/main.tflines 263-274 (OOMKilled alert rule)for = "0m"→for = "15m"tofu plan -lock=falseshows only the for: duration changeConstraints
tofu plan -lock=false(don't hold state lock)Discovered Scope (separate ticket)
PodRestartStormrule at lines 252-262 also usesfor = "0m"— same staleness class