Re-add pod-failure coverage after disabling kubernetesApps default rules #325
Labels
No labels
domain:backend
domain:devops
domain:frontend
status:approved
status:in-progress
status:needs-fix
status:qa
type:bug
type:devops
type:feature
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
ldraney/pal-e-platform#325
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Type
Bug
Lineage
Standalone — discovered 2026-05-01 during alert-state audit. Caused by
forgejo_admin/pal-e-platform #290.Repo
forgejo_admin/pal-e-platformWhat Broke
PR #290 disabled the entire
kubernetesAppsdefault rule family (KubePodNotReady,KubeContainerWaiting,KubeJobFailed,KubeDeploymentReplicasMismatch) to cut duplicate-namespace noise. We threw out real signal with the noise: pods stuck inImagePullBackOff,CrashLoopBackOff, orInitare now silent.Currently silent failures:
westside-ai-assistant-8586c7c767-7xv6c—ImagePullBackOfffor 27 days, no alertdefault/basketball-api-65f46d6ddd-5gm4s—Init:0/1for 20 days, no alertRepro Steps
kubectl get pods -A | grep -E 'ImagePullBackOff|CrashLoopBackOff|Init:'→ 2+ resultskubectl exec -n monitoring prometheus-... -- wget -qO- 'http://localhost:9090/api/v1/alerts'→ noKube*alerts firingkubectl get prometheusrule -n monitoring | grep kubernetes-apps→ emptyExpected Behavior
A custom rule (or a tightly-scoped re-enable of a subset of the helm defaults) covers
ImagePullBackOff/CrashLoopBackOfflasting longer than 15m, without re-introducing the cross-namespace duplicate flapping that motivated the original disable.Environment
monitoringterraform/modules/monitoring/main.tf,kube-prometheus-stack-platform-alertsPrometheusRule,pod-healthgroupkube-prometheus-stackwithdefaultRules.rules.kubernetesApps = falseAcceptance Criteria
ImagePullBackOfffor >15m (severity warning)CrashLoopBackOfffor >15m (severity warning)up{...}andkube_pod_container_status_waiting_reason{...}against current cluster state and confirming only the two known-bad pods would fireRelated
pal-e-platform— projectforgejo_admin/pal-e-platform #290— origin of the disablealert-report-2026-05-01— alert snapshot