Bug: platform-validation OOMKilled at 64Mi + stale OOMKilled alert rule #171

Closed
opened 2026-03-26 15:22:53 +00:00 by forgejo_admin · 8 comments

Type

Bug

Lineage

standalone — discovered during AlertManager triage 2026-03-26

Repo

forgejo_admin/pal-e-platform (resource limits in terraform or kustomize), alert rule in PrometheusRule

What Broke

Two related issues:

  1. OOMKilled: platform-validation container was OOM-killed on 2026-03-21 with a 64Mi memory limit. Pod restarted and has been running stable for 4 days, but the limit is too tight and will likely recur.

  2. Stale alert rule: The OOMKilled alert rule fires on kube_pod_container_status_last_terminated_reason{reason="OOMKilled"} > 0 with for: 0m. This means it fires forever as long as the pod's last termination was OOM — even after the pod recovers. The alert has been firing since 2026-03-24 (critical) despite the pod being healthy for 4 days.

Repro Steps

  1. kubectl get pod -n platform-validation -o jsonpath='{.items[0].spec.containers[0].resources}' → limits.memory: 64Mi
  2. kubectl describe pod -n platform-validation -l app=platform-validation → Last State: OOMKilled
  3. Check AlertManager: OOMKilled alert (critical) still firing

Expected Behavior

Memory limit is sufficient to prevent OOMKill. Alert rule auto-resolves after pod recovers.

Environment

  • Cluster/namespace: platform-validation
  • Current resources: requests 32Mi, limits 64Mi
  • Related alerts: OOMKilled (critical), firing since 2026-03-24

Acceptance Criteria

  • Memory limit bumped to 128Mi (or appropriate based on actual usage)
  • OOMKilled alert rule redesigned to check recent restart count rather than historical termination reason (e.g. increase(kube_pod_container_status_restarts_total{reason="OOMKilled"}[15m]) > 0)
  • Current OOMKilled alert clears
  • No regression on platform-validation functionality
  • project-pal-e-platform — project
  • story:superuser-observe — user story (alert rule), story:superuser-deploy — user story (resource limits)
  • arch:prometheus — architecture component (alert rule)
### Type Bug ### Lineage standalone — discovered during AlertManager triage 2026-03-26 ### Repo `forgejo_admin/pal-e-platform` (resource limits in terraform or kustomize), alert rule in PrometheusRule ### What Broke Two related issues: 1. **OOMKilled**: platform-validation container was OOM-killed on 2026-03-21 with a 64Mi memory limit. Pod restarted and has been running stable for 4 days, but the limit is too tight and will likely recur. 2. **Stale alert rule**: The `OOMKilled` alert rule fires on `kube_pod_container_status_last_terminated_reason{reason="OOMKilled"} > 0` with `for: 0m`. This means it fires **forever** as long as the pod's *last* termination was OOM — even after the pod recovers. The alert has been firing since 2026-03-24 (critical) despite the pod being healthy for 4 days. ### Repro Steps 1. `kubectl get pod -n platform-validation -o jsonpath='{.items[0].spec.containers[0].resources}'` → limits.memory: 64Mi 2. `kubectl describe pod -n platform-validation -l app=platform-validation` → Last State: OOMKilled 3. Check AlertManager: `OOMKilled` alert (critical) still firing ### Expected Behavior Memory limit is sufficient to prevent OOMKill. Alert rule auto-resolves after pod recovers. ### Environment - Cluster/namespace: platform-validation - Current resources: requests 32Mi, limits 64Mi - Related alerts: `OOMKilled` (critical), firing since 2026-03-24 ### Acceptance Criteria - [ ] Memory limit bumped to 128Mi (or appropriate based on actual usage) - [ ] OOMKilled alert rule redesigned to check recent restart count rather than historical termination reason (e.g. `increase(kube_pod_container_status_restarts_total{reason="OOMKilled"}[15m]) > 0`) - [ ] Current `OOMKilled` alert clears - [ ] No regression on platform-validation functionality ### Related - `project-pal-e-platform` — project - `story:superuser-observe` — user story (alert rule), `story:superuser-deploy` — user story (resource limits) - `arch:prometheus` — architecture component (alert rule)
Author
Owner

Scope Review: NEEDS_REFINEMENT

Review note: review-388-2026-03-26

Three issues found that must be resolved before this ticket is agent-ready:

  • Cross-repo mismatch: Memory limit lives in pal-e-deployments/overlays/platform-validation/prod/deployment-patch.yaml (line 26-27), not in this repo. File Targets section missing entirely.
  • Invalid PromQL: The proposed replacement increase(kube_pod_container_status_restarts_total{reason="OOMKilled"}[15m]) > 0 is invalid — kube_pod_container_status_restarts_total does not have a reason label. Would silently match nothing.
  • Missing template sections: User Story, File Targets, Test Expectations, Constraints, and Checklist are all absent.
## Scope Review: NEEDS_REFINEMENT Review note: `review-388-2026-03-26` Three issues found that must be resolved before this ticket is agent-ready: - **Cross-repo mismatch**: Memory limit lives in `pal-e-deployments/overlays/platform-validation/prod/deployment-patch.yaml` (line 26-27), not in this repo. File Targets section missing entirely. - **Invalid PromQL**: The proposed replacement `increase(kube_pod_container_status_restarts_total{reason="OOMKilled"}[15m]) > 0` is invalid — `kube_pod_container_status_restarts_total` does not have a `reason` label. Would silently match nothing. - **Missing template sections**: User Story, File Targets, Test Expectations, Constraints, and Checklist are all absent.
Author
Owner

Scope Correction (post-review)

Per review review-388-2026-03-26, correcting repo, file targets, and PromQL.

Repo Correction

Two repos, not one:

  1. Memory limitforgejo_admin/pal-e-deployments (kustomize overlay, not terraform)
  2. Alert ruleforgejo_admin/pal-e-platform (terraform PrometheusRule)

File Targets

Memory limit (pal-e-deployments):

  • overlays/platform-validation/prod/deployment-patch.yaml lines 22-27
  • Change limits.memory from 64Mi to 128Mi
  • platform-validation is the only service at 64Mi — all others are 128-256Mi

Alert rule (pal-e-platform):

  • terraform/main.tf lines 230-242 (OOMKilled rule in additionalPrometheusRules)
  • Current: kube_pod_container_status_last_terminated_reason{reason="OOMKilled"} > 0 with for: 0m
  • Fix: Add for: 15m so the alert auto-resolves after pod recovery instead of firing forever on historical state

PromQL Correction

The originally proposed replacement expression is INVALID:
increase(kube_pod_container_status_restarts_total{reason="OOMKilled"}[15m]) > 0kube_pod_container_status_restarts_total does NOT carry a reason label in kube-state-metrics.

Correct fix: Keep the existing expression, just add for: 15m. This means the alert fires on OOMKill but auto-resolves 15 minutes after recovery (when the pod's last termination reason is no longer OOMKilled due to successful running). Standard Prometheus pattern.

Acceptance Criteria (updated)

  • Memory limit bumped to 128Mi in pal-e-deployments
  • OOMKilled alert rule updated with for: 15m in pal-e-platform
  • ArgoCD syncs platform-validation successfully
  • Current OOMKilled alert clears within 15 minutes
  • tofu plan shows only the for: duration change
## Scope Correction (post-review) Per review `review-388-2026-03-26`, correcting repo, file targets, and PromQL. ### Repo Correction **Two repos, not one:** 1. **Memory limit** → `forgejo_admin/pal-e-deployments` (kustomize overlay, not terraform) 2. **Alert rule** → `forgejo_admin/pal-e-platform` (terraform PrometheusRule) ### File Targets **Memory limit (pal-e-deployments):** - `overlays/platform-validation/prod/deployment-patch.yaml` lines 22-27 - Change `limits.memory` from `64Mi` to `128Mi` - platform-validation is the only service at 64Mi — all others are 128-256Mi **Alert rule (pal-e-platform):** - `terraform/main.tf` lines 230-242 (OOMKilled rule in `additionalPrometheusRules`) - Current: `kube_pod_container_status_last_terminated_reason{reason="OOMKilled"} > 0` with `for: 0m` - Fix: Add `for: 15m` so the alert auto-resolves after pod recovery instead of firing forever on historical state ### PromQL Correction **The originally proposed replacement expression is INVALID:** `increase(kube_pod_container_status_restarts_total{reason="OOMKilled"}[15m]) > 0` — `kube_pod_container_status_restarts_total` does NOT carry a `reason` label in kube-state-metrics. **Correct fix:** Keep the existing expression, just add `for: 15m`. This means the alert fires on OOMKill but auto-resolves 15 minutes after recovery (when the pod's last termination reason is no longer OOMKilled due to successful running). Standard Prometheus pattern. ### Acceptance Criteria (updated) - [ ] Memory limit bumped to 128Mi in pal-e-deployments - [ ] OOMKilled alert rule updated with `for: 15m` in pal-e-platform - [ ] ArgoCD syncs platform-validation successfully - [ ] Current `OOMKilled` alert clears within 15 minutes - [ ] `tofu plan` shows only the for: duration change
Author
Owner

Scope Review v2: NEEDS_REFINEMENT

Review note: review-388-2026-03-26 (updated)

Second review pass after scope correction comment. File targets verified, PromQL correction validated. Three issues remain:

  • Cross-repo scope not agent-readable: Corrections live only in a comment. Issue body still says pal-e-platform only. Agent reading the body would miss the pal-e-deployments memory bump entirely. Either update the body or create a companion issue.
  • Misleading acceptance criterion: "Alert clears within 15 minutes" is incorrect — kube_pod_container_status_last_terminated_reason persists after pod recovery. The alert clears in this case only because the memory bump triggers a deployment rollout (new pod, no OOM history). for: 15m alone does not resolve staleness.
  • Missing template sections: Test Expectations, Constraints (tofu plan -lock=false, two separate PRs, ArgoCD sync), and Checklist still absent.
## Scope Review v2: NEEDS_REFINEMENT Review note: `review-388-2026-03-26` (updated) Second review pass after scope correction comment. File targets verified, PromQL correction validated. Three issues remain: - **Cross-repo scope not agent-readable**: Corrections live only in a comment. Issue body still says `pal-e-platform` only. Agent reading the body would miss the `pal-e-deployments` memory bump entirely. Either update the body or create a companion issue. - **Misleading acceptance criterion**: "Alert clears within 15 minutes" is incorrect — `kube_pod_container_status_last_terminated_reason` persists after pod recovery. The alert clears in this case only because the memory bump triggers a deployment rollout (new pod, no OOM history). `for: 15m` alone does not resolve staleness. - **Missing template sections**: Test Expectations, Constraints (`tofu plan -lock=false`, two separate PRs, ArgoCD sync), and Checklist still absent.
Author
Owner

Refinement: alert behavior correction + cross-repo clarity

Per review review-388-2026-03-26:

Alert Behavior Correction

kube_pod_container_status_last_terminated_reason persists indefinitely after recovery — for: 15m alone does NOT clear staleness. In THIS case the alert clears because bumping memory triggers a deployment rollout (new pod = no OOM history). For future OOMKills without a redeploy, the alert would still persist.

Consider alternative: change expression to track recent OOM events only, e.g. changes(kube_pod_container_status_last_terminated_reason{reason="OOMKilled"}[1h]) > 0 — fires only when a NEW OOMKill occurs in the last hour. Dev agent should evaluate both approaches.

Cross-Repo Clarity

Two separate PRs needed:

  1. pal-e-deployments PR: memory limit 64Mi → 128Mi in overlays/platform-validation/prod/deployment-patch.yaml
  2. pal-e-platform PR: OOMKilled alert rule update in terraform/main.tf lines 230-241

Test Expectations (added)

  • pal-e-deployments: kustomize build overlays/platform-validation/prod/ shows 128Mi
  • pal-e-platform: tofu plan -lock=false shows only the alert rule change
  • Post-merge: kubectl describe pod -n platform-validation shows limits.memory=128Mi
  • Post-merge: OOMKilled alert clears in AlertManager (due to rollout)
## Refinement: alert behavior correction + cross-repo clarity Per review `review-388-2026-03-26`: ### Alert Behavior Correction `kube_pod_container_status_last_terminated_reason` persists indefinitely after recovery — `for: 15m` alone does NOT clear staleness. In THIS case the alert clears because bumping memory triggers a deployment rollout (new pod = no OOM history). For future OOMKills without a redeploy, the alert would still persist. Consider alternative: change expression to track recent OOM events only, e.g. `changes(kube_pod_container_status_last_terminated_reason{reason="OOMKilled"}[1h]) > 0` — fires only when a NEW OOMKill occurs in the last hour. Dev agent should evaluate both approaches. ### Cross-Repo Clarity Two separate PRs needed: 1. **pal-e-deployments PR**: memory limit 64Mi → 128Mi in `overlays/platform-validation/prod/deployment-patch.yaml` 2. **pal-e-platform PR**: OOMKilled alert rule update in `terraform/main.tf` lines 230-241 ### Test Expectations (added) - pal-e-deployments: `kustomize build overlays/platform-validation/prod/` shows 128Mi - pal-e-platform: `tofu plan -lock=false` shows only the alert rule change - Post-merge: `kubectl describe pod -n platform-validation` shows limits.memory=128Mi - Post-merge: OOMKilled alert clears in AlertManager (due to rollout)
Author
Owner

Scope Review v3: NEEDS_REFINEMENT

Review note: review-388-2026-03-26-v2

Re-review after two scope correction comments. The corrections address the original concerns (wrong repo, invalid PromQL, alert staleness). Three issues remain before this is agent-ready:

  • Issue body is stale: All corrections live in comments only. The body still says repo is pal-e-platform only. An agent reading the body would miss the pal-e-deployments memory bump entirely. Consolidate corrections into the body.
  • Wrong line numbers: Comment #2 says terraform/main.tf lines 230-241 but the OOMKilled rule is actually at lines 263-274. Verified: line 264 (alert name), line 265 (expr), line 266 (for = "0m").
  • PR ordering dependency undocumented: The memory bump triggers a rollout that clears the stale alert metric. If the alert rule PR merges first without the memory bump, the alert stays stale. Either document the ordering or pick the changes() expression approach which is order-independent.

Optional: add User Story, Constraints (tofu plan -lock=false, ArgoCD sync ordering), Checklist sections. Note kubectl kustomize not kustomize for test commands.

## Scope Review v3: NEEDS_REFINEMENT Review note: `review-388-2026-03-26-v2` Re-review after two scope correction comments. The corrections address the original concerns (wrong repo, invalid PromQL, alert staleness). Three issues remain before this is agent-ready: - **Issue body is stale**: All corrections live in comments only. The body still says repo is `pal-e-platform` only. An agent reading the body would miss the `pal-e-deployments` memory bump entirely. Consolidate corrections into the body. - **Wrong line numbers**: Comment #2 says `terraform/main.tf` lines 230-241 but the OOMKilled rule is actually at lines 263-274. Verified: line 264 (alert name), line 265 (expr), line 266 (`for = "0m"`). - **PR ordering dependency undocumented**: The memory bump triggers a rollout that clears the stale alert metric. If the alert rule PR merges first without the memory bump, the alert stays stale. Either document the ordering or pick the `changes()` expression approach which is order-independent. Optional: add User Story, Constraints (`tofu plan -lock=false`, ArgoCD sync ordering), Checklist sections. Note `kubectl kustomize` not `kustomize` for test commands.
Author
Owner

Refinement v3: line numbers + PR ordering + consolidated scope

Per review review-388-2026-03-26-v2:

Line Number Correction

OOMKilled alert rule is at lines 263-274, NOT 230-241. Verified:

  • Line 264: alert = "OOMKilled"
  • Line 265: expr = "kube_pod_container_status_last_terminated_reason{reason=\"OOMKilled\"} > 0"
  • Line 266: for = "0m" ← change to for = "15m"

PR Ordering

pal-e-deployments PR (memory bump) must merge first. The rollout from 64Mi→128Mi creates a new pod with clean OOM history, which clears the stale alert. If the alert rule PR merges first without the rollout, the alert persists until the next deployment.

Sequence:

  1. PR #1: pal-e-deployments — memory 64Mi→128Mi → merge → ArgoCD deploys → new pod
  2. PR #2: pal-e-platform — OOMKilled for: "0m"for: "15m" → merge → tofu apply

Consolidated Scope (replaces all previous comments)

Two repos, two PRs, ordered:

PR 1 (pal-e-deployments, merge first):

  • File: overlays/platform-validation/prod/deployment-patch.yaml lines 22-27
  • Change: limits.memory: 64Milimits.memory: 128Mi
  • Test: kustomize build overlays/platform-validation/prod/ shows 128Mi
  • Verify: kubectl describe pod -n platform-validation shows 128Mi after ArgoCD sync

PR 2 (pal-e-platform, merge second):

  • File: terraform/main.tf lines 263-274 (OOMKilled rule)
  • Change: for = "0m"for = "15m"
  • Test: tofu plan -lock=false shows only the for: duration change
  • Verify: OOMKilled alert clears in AlertManager
## Refinement v3: line numbers + PR ordering + consolidated scope Per review `review-388-2026-03-26-v2`: ### Line Number Correction OOMKilled alert rule is at **lines 263-274**, NOT 230-241. Verified: - Line 264: `alert = "OOMKilled"` - Line 265: `expr = "kube_pod_container_status_last_terminated_reason{reason=\"OOMKilled\"} > 0"` - Line 266: `for = "0m"` ← change to `for = "15m"` ### PR Ordering **pal-e-deployments PR (memory bump) must merge first.** The rollout from 64Mi→128Mi creates a new pod with clean OOM history, which clears the stale alert. If the alert rule PR merges first without the rollout, the alert persists until the next deployment. Sequence: 1. PR #1: pal-e-deployments — memory 64Mi→128Mi → merge → ArgoCD deploys → new pod 2. PR #2: pal-e-platform — OOMKilled `for: "0m"` → `for: "15m"` → merge → tofu apply ### Consolidated Scope (replaces all previous comments) **Two repos, two PRs, ordered:** **PR 1 (pal-e-deployments, merge first):** - File: `overlays/platform-validation/prod/deployment-patch.yaml` lines 22-27 - Change: `limits.memory: 64Mi` → `limits.memory: 128Mi` - Test: `kustomize build overlays/platform-validation/prod/` shows 128Mi - Verify: `kubectl describe pod -n platform-validation` shows 128Mi after ArgoCD sync **PR 2 (pal-e-platform, merge second):** - File: `terraform/main.tf` lines 263-274 (OOMKilled rule) - Change: `for = "0m"` → `for = "15m"` - Test: `tofu plan -lock=false` shows only the for: duration change - Verify: OOMKilled alert clears in AlertManager
Author
Owner

Scope Review v3: NEEDS_REFINEMENT

Review note: review-388-2026-03-26-v3

v3 correction comment fixes line numbers (263-274 verified), PR ordering (pal-e-deployments first), and alert staleness explanation. All file targets verified against codebase. One blocking issue remains:

  • BLOCKING: Issue body is stale. Body still says repo is pal-e-platform only, has no File Targets, lists invalid PromQL in acceptance criteria, and references no line numbers. An agent reading the body would produce incorrect work. Consolidate v3 correction into the body.
  • Minor: Test command wrong. kustomize build should be kubectl kustomize -- standalone kustomize is not installed.
  • Minor: Missing Constraints section. tofu plan -lock=false, two separate PRs, ArgoCD sync ordering should be documented.
  • Blast radius note: PodRestartStorm rule (line 252-262) also uses for = "0m" -- same staleness class. Not in scope but worth a follow-up ticket.
## Scope Review v3: NEEDS_REFINEMENT Review note: `review-388-2026-03-26-v3` v3 correction comment fixes line numbers (263-274 verified), PR ordering (pal-e-deployments first), and alert staleness explanation. All file targets verified against codebase. One blocking issue remains: - **BLOCKING: Issue body is stale.** Body still says repo is `pal-e-platform` only, has no File Targets, lists invalid PromQL in acceptance criteria, and references no line numbers. An agent reading the body would produce incorrect work. Consolidate v3 correction into the body. - **Minor: Test command wrong.** `kustomize build` should be `kubectl kustomize` -- standalone kustomize is not installed. - **Minor: Missing Constraints section.** `tofu plan -lock=false`, two separate PRs, ArgoCD sync ordering should be documented. - **Blast radius note:** `PodRestartStorm` rule (line 252-262) also uses `for = "0m"` -- same staleness class. Not in scope but worth a follow-up ticket.
Author
Owner

Agent Instructions: Consolidated Scope

IGNORE THE ISSUE BODY ABOVE — it contains stale information (wrong repo, wrong line numbers, invalid PromQL).

USE THIS AS THE AUTHORITATIVE SPEC:

Two PRs, ordered:

PR 1 (pal-e-deployments, merge first):

  • File: overlays/platform-validation/prod/deployment-patch.yaml lines 22-27
  • Change: limits.memory: 64Milimits.memory: 128Mi
  • Test: kubectl kustomize overlays/platform-validation/prod/ shows 128Mi
  • Verify: kubectl describe pod -n platform-validation shows 128Mi after ArgoCD sync

PR 2 (pal-e-platform, merge second):

  • File: terraform/main.tf lines 263-274 (OOMKilled alert rule)
  • Change: for = "0m"for = "15m"
  • Test: tofu plan -lock=false shows only the for: duration change
  • Verify: OOMKilled alert clears in AlertManager

Constraints

  • tofu plan -lock=false (don't hold state lock)
  • Two separate PRs on two separate repos
  • PR 1 must merge and ArgoCD sync before PR 2 matters (rollout clears stale metric)

Discovered Scope (separate ticket)

  • PodRestartStorm rule at lines 252-262 also uses for = "0m" — same staleness class
## Agent Instructions: Consolidated Scope **IGNORE THE ISSUE BODY ABOVE** — it contains stale information (wrong repo, wrong line numbers, invalid PromQL). **USE THIS AS THE AUTHORITATIVE SPEC:** ### Two PRs, ordered: **PR 1 (pal-e-deployments, merge first):** - File: `overlays/platform-validation/prod/deployment-patch.yaml` lines 22-27 - Change: `limits.memory: 64Mi` → `limits.memory: 128Mi` - Test: `kubectl kustomize overlays/platform-validation/prod/` shows 128Mi - Verify: `kubectl describe pod -n platform-validation` shows 128Mi after ArgoCD sync **PR 2 (pal-e-platform, merge second):** - File: `terraform/main.tf` lines 263-274 (OOMKilled alert rule) - Change: `for = "0m"` → `for = "15m"` - Test: `tofu plan -lock=false` shows only the for: duration change - Verify: OOMKilled alert clears in AlertManager ### Constraints - `tofu plan -lock=false` (don't hold state lock) - Two separate PRs on two separate repos - PR 1 must merge and ArgoCD sync before PR 2 matters (rollout clears stale metric) ### Discovered Scope (separate ticket) - `PodRestartStorm` rule at lines 252-262 also uses `for = "0m"` — same staleness class
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
forgejo_admin/pal-e-platform#171
No description provided.