todo: alerting rules for pod CrashLoopBackOff and downtime #57
Labels
No labels
domain:backend
domain:devops
domain:frontend
status:approved
status:in-progress
status:needs-fix
status:qa
type:bug
type:devops
type:feature
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
forgejo_admin/basketball-api#57
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
$NEW_BODY
Scope Review: NEEDS_REFINEMENT
Review note:
review-70-2026-03-27Critical repo placement issue and several template gaps:
terraform/modules/monitoring/main.tf(existing pattern at lines 387+ and 684+). Should be re-filed or repo field updated.### Typeheader: Should beFeature.terraform/modules/monitoring/main.tf-- add PrometheusRule resource following existingblackbox_alertspattern.KubePodCrashLoopingalert by default. Verify whether it's already active and just not routing, before creating a custom rule.Issue body updated per scope review corrections.
Superseded -- Closing
This issue requested alerting rules for CrashLoopBackOff and downtime. All requested alerting already exists in
pal-e-platform/terraform/modules/monitoring/main.tf:PodRestartStorm(line 120) -- Fires on >3 restarts in 15 minutes. Covers the CrashLoopBackOff detection use case.OOMKilled(line 131) -- Fires when a container is OOMKilled.EndpointDown(line 405) -- Blackbox probe fires onprobe_success == 0for >2 minutes. Covers the downtime detection use case.EndpointSlowResponse-- Fires on probe_duration_seconds > 5s for >5 minutes.KubePodCrashLooping-- Built-in kube-prometheus-stack rule (defaultRules.create = true by default).Notification routing is fully configured: Telegram (primary) + Slack (secondary) via Alertmanager.
The scope review (
review-70-2026-03-27) also flagged this as:KubePodCrashLoopingrule.$NEW_BODYbug (session 2026-03-28).Action: Removing board item #70 from board-westside-basketball and closing this issue as superseded.