Re-add pod-failure coverage after disabling kubernetesApps default rules #325

New issue

Open

opened 2026-05-02 14:51:17 +00:00 by forgejo_admin · 0 comments

forgejo_admin commented

2026-05-02 14:51:17 +00:00

Contributor

Type

Bug

Lineage

Standalone — discovered 2026-05-01 during alert-state audit. Caused by forgejo_admin/pal-e-platform #290.

Repo

forgejo_admin/pal-e-platform

What Broke

PR #290 disabled the entire kubernetesApps default rule family (KubePodNotReady, KubeContainerWaiting, KubeJobFailed, KubeDeploymentReplicasMismatch) to cut duplicate-namespace noise. We threw out real signal with the noise: pods stuck in ImagePullBackOff, CrashLoopBackOff, or Init are now silent.

Currently silent failures:

westside-ai-assistant-8586c7c767-7xv6c — ImagePullBackOff for 27 days, no alert
default/basketball-api-65f46d6ddd-5gm4s — Init:0/1 for 20 days, no alert

Repro Steps

kubectl get pods -A | grep -E 'ImagePullBackOff|CrashLoopBackOff|Init:' → 2+ results
kubectl exec -n monitoring prometheus-... -- wget -qO- 'http://localhost:9090/api/v1/alerts' → no Kube* alerts firing
Confirm rule group disabled: kubectl get prometheusrule -n monitoring | grep kubernetes-apps → empty

Expected Behavior

A custom rule (or a tightly-scoped re-enable of a subset of the helm defaults) covers ImagePullBackOff / CrashLoopBackOff lasting longer than 15m, without re-introducing the cross-namespace duplicate flapping that motivated the original disable.

Environment

Cluster: pal-e, namespace monitoring
File: terraform/modules/monitoring/main.tf, kube-prometheus-stack-platform-alerts PrometheusRule, pod-health group
Helm release: kube-prometheus-stack with defaultRules.rules.kubernetesApps = false

Acceptance Criteria

At least one rule fires when a pod is in ImagePullBackOff for >15m (severity warning)
At least one rule fires when a pod is CrashLoopBackOff for >15m (severity warning)
Severity escalates to critical at 1h
No duplicate-namespace flapping — verified by running up{...} and kube_pod_container_status_waiting_reason{...} against current cluster state and confirming only the two known-bad pods would fire
Optional scoping: limit to known-good namespaces via label selector

pal-e-platform — project
forgejo_admin/pal-e-platform #290 — origin of the disable
alert-report-2026-05-01 — alert snapshot

### Type Bug ### Lineage Standalone — discovered 2026-05-01 during alert-state audit. Caused by `forgejo_admin/pal-e-platform #290`. ### Repo `forgejo_admin/pal-e-platform` ### What Broke PR #290 disabled the entire `kubernetesApps` default rule family (`KubePodNotReady`, `KubeContainerWaiting`, `KubeJobFailed`, `KubeDeploymentReplicasMismatch`) to cut duplicate-namespace noise. We threw out real signal with the noise: pods stuck in `ImagePullBackOff`, `CrashLoopBackOff`, or `Init` are now silent. Currently silent failures: - `westside-ai-assistant-8586c7c767-7xv6c` — `ImagePullBackOff` for 27 days, no alert - `default/basketball-api-65f46d6ddd-5gm4s` — `Init:0/1` for 20 days, no alert ### Repro Steps 1. `kubectl get pods -A | grep -E 'ImagePullBackOff|CrashLoopBackOff|Init:'` → 2+ results 2. `kubectl exec -n monitoring prometheus-... -- wget -qO- 'http://localhost:9090/api/v1/alerts'` → no `Kube*` alerts firing 3. Confirm rule group disabled: `kubectl get prometheusrule -n monitoring | grep kubernetes-apps` → empty ### Expected Behavior A custom rule (or a tightly-scoped re-enable of a subset of the helm defaults) covers `ImagePullBackOff` / `CrashLoopBackOff` lasting longer than 15m, without re-introducing the cross-namespace duplicate flapping that motivated the original disable. ### Environment - Cluster: pal-e, namespace `monitoring` - File: `terraform/modules/monitoring/main.tf`, `kube-prometheus-stack-platform-alerts` PrometheusRule, `pod-health` group - Helm release: `kube-prometheus-stack` with `defaultRules.rules.kubernetesApps = false` ### Acceptance Criteria - [ ] At least one rule fires when a pod is in `ImagePullBackOff` for >15m (severity warning) - [ ] At least one rule fires when a pod is `CrashLoopBackOff` for >15m (severity warning) - [ ] Severity escalates to critical at 1h - [ ] No duplicate-namespace flapping — verified by running `up{...}` and `kube_pod_container_status_waiting_reason{...}` against current cluster state and confirming only the two known-bad pods would fire - [ ] Optional scoping: limit to known-good namespaces via label selector ### Related - `pal-e-platform` — project - `forgejo_admin/pal-e-platform #290` — origin of the disable - `alert-report-2026-05-01` — alert snapshot