fix: OOMKilled alert for: 0m -> 15m #185
No reviewers
Labels
No labels
domain:backend
domain:devops
domain:frontend
status:approved
status:in-progress
status:needs-fix
status:qa
type:bug
type:devops
type:feature
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
forgejo_admin/pal-e-platform!185
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "171-oomkilled-alert-for-duration"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
The OOMKilled alert rule fires on
kube_pod_container_status_last_terminated_reason{reason="OOMKilled"}which persists indefinitely after pod recovery. Withfor: 0m, the alert fires forever on historical OOM state even after the pod is healthy for days. Changing tofor: 15mallows the alert to auto-resolve after a deployment rollout clears the stale metric.Changes
terraform/main.tfline 266:for = "0m"changed tofor = "15m"on the OOMKilled alert rule inhelm_release.kube_prometheus_stackTest Plan
tofu fmtpassestofu validatepassestofu plan -lock=falseshows only thefor:duration change onhelm_release.kube_prometheus_stack(0 to add, 1 to change, 0 to destroy)tofu plan Output
Review Checklist
Discovered Scope
PodRestartStormrule at lines 252-262 also usesfor = "0m"-- same staleness class. Separate ticket needed.Related
Review: APPROVED
Diff: 1 file, 1 line changed.
for = "0m"tofor = "15m"on the OOMKilled alert rule at line 266 ofterraform/main.tf.Verified:
for:duration -- no other lines touchedtofu fmtpassestofu validatepassestofu plan -lock=falseshows 0 to add, 1 to change, 0 to destroy -- only thehelm_release.kube_prometheus_stackvalues diffNo findings. Clean single-line fix matching the issue spec exactly.
PR #185 Review
DOMAIN REVIEW
Tech stack: Terraform (OpenTofu) / Helm / Prometheus alerting rules
This is a single-line change to the
kube_prometheus_stackHelm release values interraform/main.tf. The OOMKilled alert rule'sforfield is changed from"0m"to"15m".Technical assessment:
kube_pod_container_status_last_terminated_reason{reason="OOMKilled"}metric persists the last termination reason even after pod recovery. Withfor: 0m, this fires instantly and never resolves -- the PR body correctly diagnoses this staleness class.for: 15mmeans the condition must hold for 15 continuous minutes before firing. After a deployment rollout clears the stale metric (pod restarts resetlast_terminated_reason), the alert resolves within one evaluation cycle.tofu planoutput confirms the change is isolated to theforduration on the OOMKilled rule with 0 to add, 1 to change, 0 to destroy.Discovered scope properly tracked: The PR body correctly identifies that
PodRestartStorm(line 252-254) has the samefor = "0m"pattern and defers it to a separate ticket. Good discipline.BLOCKERS
None.
NITS
None. The change is minimal and well-scoped.
SOP COMPLIANCE
171-oomkilled-alert-for-durationreferences #171)tofu plan -lock=falseoutput included per PR conventionstofu fmtandtofu validateconfirmed passingPROCESS OBSERVATIONS
VERDICT: APPROVED