todo: alerting rules for pod CrashLoopBackOff and downtime #57

Closed
opened 2026-03-13 15:20:24 +00:00 by forgejo_admin · 3 comments

$NEW_BODY

$NEW_BODY
Author
Owner

Scope Review: NEEDS_REFINEMENT

Review note: review-70-2026-03-27

Critical repo placement issue and several template gaps:

  • WRONG REPO: Issue filed on basketball-api but all work is in pal-e-platform. PrometheusRule CRDs live in terraform/modules/monitoring/main.tf (existing pattern at lines 387+ and 684+). Should be re-filed or repo field updated.
  • Missing ### Type header: Should be Feature.
  • Vague file targets: Replace with terraform/modules/monitoring/main.tf -- add PrometheusRule resource following existing blackbox_alerts pattern.
  • Stale info: "notification channel TBD" -- Telegram is already configured via Alertmanager.
  • Potential duplicate: kube-prometheus-stack ships with KubePodCrashLooping alert by default. Verify whether it's already active and just not routing, before creating a custom rule.
  • Scope is platform-wide: A CrashLoopBackOff alert rule would fire for ALL pods, not just basketball-api. This is better, but the ticket should acknowledge it.
  • Board labels missing: Needs type:feature, arch:monitoring, repo:pal-e-platform.
## Scope Review: NEEDS_REFINEMENT Review note: `review-70-2026-03-27` Critical repo placement issue and several template gaps: - **WRONG REPO:** Issue filed on basketball-api but all work is in **pal-e-platform**. PrometheusRule CRDs live in `terraform/modules/monitoring/main.tf` (existing pattern at lines 387+ and 684+). Should be re-filed or repo field updated. - **Missing `### Type` header:** Should be `Feature`. - **Vague file targets:** Replace with `terraform/modules/monitoring/main.tf` -- add PrometheusRule resource following existing `blackbox_alerts` pattern. - **Stale info:** "notification channel TBD" -- Telegram is already configured via Alertmanager. - **Potential duplicate:** kube-prometheus-stack ships with `KubePodCrashLooping` alert by default. Verify whether it's already active and just not routing, before creating a custom rule. - **Scope is platform-wide:** A CrashLoopBackOff alert rule would fire for ALL pods, not just basketball-api. This is better, but the ticket should acknowledge it. - **Board labels missing:** Needs type:feature, arch:monitoring, repo:pal-e-platform.
Author
Owner

Issue body updated per scope review corrections.

Issue body updated per scope review corrections.
Author
Owner

Superseded -- Closing

This issue requested alerting rules for CrashLoopBackOff and downtime. All requested alerting already exists in pal-e-platform/terraform/modules/monitoring/main.tf:

  • PodRestartStorm (line 120) -- Fires on >3 restarts in 15 minutes. Covers the CrashLoopBackOff detection use case.
  • OOMKilled (line 131) -- Fires when a container is OOMKilled.
  • EndpointDown (line 405) -- Blackbox probe fires on probe_success == 0 for >2 minutes. Covers the downtime detection use case.
  • EndpointSlowResponse -- Fires on probe_duration_seconds > 5s for >5 minutes.
  • KubePodCrashLooping -- Built-in kube-prometheus-stack rule (defaultRules.create = true by default).

Notification routing is fully configured: Telegram (primary) + Slack (secondary) via Alertmanager.

The scope review (review-70-2026-03-27) also flagged this as:

  • WRONG REPO -- All monitoring work lives in pal-e-platform, not basketball-api.
  • Potential duplicate of built-in KubePodCrashLooping rule.
  • Issue body corrupted by the $NEW_BODY bug (session 2026-03-28).

Action: Removing board item #70 from board-westside-basketball and closing this issue as superseded.

## Superseded -- Closing This issue requested alerting rules for CrashLoopBackOff and downtime. All requested alerting already exists in `pal-e-platform/terraform/modules/monitoring/main.tf`: - **`PodRestartStorm`** (line 120) -- Fires on >3 restarts in 15 minutes. Covers the CrashLoopBackOff detection use case. - **`OOMKilled`** (line 131) -- Fires when a container is OOMKilled. - **`EndpointDown`** (line 405) -- Blackbox probe fires on `probe_success == 0` for >2 minutes. Covers the downtime detection use case. - **`EndpointSlowResponse`** -- Fires on probe_duration_seconds > 5s for >5 minutes. - **`KubePodCrashLooping`** -- Built-in kube-prometheus-stack rule (defaultRules.create = true by default). Notification routing is fully configured: Telegram (primary) + Slack (secondary) via Alertmanager. The scope review (`review-70-2026-03-27`) also flagged this as: - **WRONG REPO** -- All monitoring work lives in pal-e-platform, not basketball-api. - Potential duplicate of built-in `KubePodCrashLooping` rule. - Issue body corrupted by the `$NEW_BODY` bug (session 2026-03-28). **Action:** Removing board item #70 from board-westside-basketball and closing this issue as superseded.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
forgejo_admin/basketball-api#57
No description provided.