Disable default kube-prometheus-stack alerting rules #408

Merged
ldraney merged 1 commit from 95-disable-default-alert-rules into main 2026-06-05 03:28:25 +00:00
Owner

Summary

  • Disable all 17 default alerting rule groups (~95 rules) that are noise for a solo-dev cluster
  • Keep recording rules intact (dashboards depend on them)
  • Add Watchdog heartbeat to custom platform-alerts

Changes

  • terraform/modules/monitoring/main.tf: Set all alerting rule toggles to false in defaultRules.rules, add Watchdog rule group to additionalPrometheusRules

Test Plan

  • terraform plan shows only PrometheusRule changes
  • After apply: alert rule count drops from ~123 to ~30
  • Watchdog still fires in AlertManager
  • Grafana dashboards still load (recording rules preserved)
  • Custom PrometheusRules unaffected (blackbox, embedding, payment-pipeline, gmail-oauth)

Review Checklist

  • No secrets committed
  • No unnecessary file changes
  • Commit messages are descriptive
  • ldraney/landscaping-assistant #95 — parent tracking issue
  • ldraney/pal-e-platform #407 — platform-side issue
  • ldraney/landscaping-assistant #17 — follow-up: app-specific PrometheusRule alerts

Closes #407

## Summary - Disable all 17 default alerting rule groups (~95 rules) that are noise for a solo-dev cluster - Keep recording rules intact (dashboards depend on them) - Add Watchdog heartbeat to custom platform-alerts ## Changes - `terraform/modules/monitoring/main.tf`: Set all alerting rule toggles to `false` in `defaultRules.rules`, add Watchdog rule group to `additionalPrometheusRules` ## Test Plan - [ ] `terraform plan` shows only PrometheusRule changes - [ ] After apply: alert rule count drops from ~123 to ~30 - [ ] Watchdog still fires in AlertManager - [ ] Grafana dashboards still load (recording rules preserved) - [ ] Custom PrometheusRules unaffected (blackbox, embedding, payment-pipeline, gmail-oauth) ## Review Checklist - [ ] No secrets committed - [ ] No unnecessary file changes - [ ] Commit messages are descriptive ## Related Notes - `ldraney/landscaping-assistant #95` — parent tracking issue - `ldraney/pal-e-platform #407` — platform-side issue - `ldraney/landscaping-assistant #17` — follow-up: app-specific PrometheusRule alerts Closes #407
Disable default kube-prometheus-stack alerting rules, add Watchdog
All checks were successful
ci/woodpecker/push/terraform Pipeline was successful
ci/woodpecker/pr/terraform Pipeline was successful
ci/woodpecker/pull_request_closed/terraform Pipeline was successful
38106c8ee1
Disable all 17 default alerting rule groups (~95 rules) via individual
defaultRules.rules toggles. These are designed for multi-team k8s
operations and are noise for a solo-dev cluster. Recording rules
preserved (create = true) since Grafana dashboards depend on them.

Add Watchdog heartbeat to custom platform-alerts to verify the
alerting pipeline stays functional (was previously in the now-disabled
general rules group).

Custom PrometheusRule resources (blackbox, embedding, payment-pipeline,
gmail-oauth) are unaffected — they're separate Terraform resources.

Closes ldraney/landscaping-assistant#95
Refs ldraney/pal-e-platform#407

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Author
Owner

PR #408 Review

DOMAIN REVIEW

Tech stack: Terraform (HCL) -- Helm release configuration for kube-prometheus-stack. Domain checklist: Terraform style, k8s alerting patterns, Helm values correctness.

defaultRules toggle approach: The pattern of setting create = true (preserving recording rules for Grafana dashboards) while individually toggling alerting groups to false is the correct upstream approach. The kube-prometheus-stack chart uses per-group boolean toggles under defaultRules.rules, and this PR uses them as designed. Good.

Watchdog heartbeat replacement: The original Watchdog lives in the general rule group which is now disabled. Adding a replacement Watchdog with vector(1) and severity = "none" to platform-alerts is correct. This preserves AlertManager dead-man's-switch functionality. The expression, labels, and annotations all match the upstream Watchdog pattern.

Potential gap -- missing alerting groups: The PR disables 17 groups but the kube-prometheus-stack chart (v55+) defines additional alerting rule groups that are not toggled off here:

Group key Has alerting rules? Disabled in PR? Notes
kubeApiserverAvailability Yes No API server availability alerts
kubeApiserverBurnrate Yes No Burn-rate alerts for API server error budget
kubeApiserverHistogram Yes No API server latency alerts
node Yes No Distinct from nodeExporterAlerting -- contains NodeRAIDDegraded, NodeClockNotSynchronising, etc.
etcd Yes No Likely moot if etcd component disabled
kubeControllerManager Possibly No Likely moot if component disabled
kubeSchedulerAlerting Possibly No Moot -- scheduler component is disabled in context

The first four (kubeApiserverAvailability, kubeApiserverBurnrate, kubeApiserverHistogram, node) may still generate alerting rules since their parent components are not explicitly disabled. If the goal is truly "disable ALL default alerting rules," these are gaps. Worth verifying whether these fire in the current cluster -- if they do, they should be added; if they don't (e.g., because the API server metrics are not exposed in this cluster setup), it is a non-issue but still worth toggling for explicitness.

Recording rule groups correctly left alone: kubePrometheusNodeRecording, kubeSchedulerRecording, nodeExporterRecording are recording-only groups and are correctly omitted from the disable list since create = true keeps them active for dashboard queries.

Comment quality: The inline comment explaining the rationale ("designed for multi-team k8s operations, noise for solo-dev cluster") is clear and provides good context for future readers. The cross-references to additionalPrometheusRules and separate TF resources are helpful.

BLOCKERS

None. No secrets, no security regressions, no unvalidated input. This is pure Helm values configuration.

NITS

  1. Potentially incomplete disable list: As noted above, kubeApiserverAvailability, kubeApiserverBurnrate, kubeApiserverHistogram, and node are not toggled to false. If any of these fire alerts in the current cluster, they represent noise the PR intends to eliminate. Consider adding them for completeness, or document why they are intentionally left enabled.

  2. Alphabetical ordering: The 17 toggles are alphabetically sorted -- good. If the missing groups are added, maintain this ordering.

  3. Watchdog for duration: The upstream Watchdog typically does not use a for clause (fires immediately), which matches this implementation. No issue, just confirming intentional alignment.

SOP COMPLIANCE

  • Branch named after issue -- 95-disable-default-alert-rules references the parent tracking issue (#95 in landscaping-assistant). Acceptable since this is a cross-repo change tracked by a parent issue. The pal-e-platform issue is #407.
  • PR body follows template -- Summary, Changes, Test Plan, Review Checklist, Related Notes all present.
  • Related section references parent tracking issue and follow-up issue.
  • No secrets committed -- pure HCL configuration changes.
  • No unnecessary file changes -- single file, 1 changed file, tightly scoped.
  • Commit message is descriptive -- not visible in diff but PR title is clear.

PROCESS OBSERVATIONS

  • Change failure risk: Low. This is a configuration-only change to Helm values. The terraform plan test plan item is appropriate -- it should show only PrometheusRule resource updates, no destructive changes.
  • Rollback: Trivial -- revert the commit and re-apply. Recording rules and custom alerts are untouched.
  • Follow-up tracking: The PR correctly references landscaping-assistant #17 for app-specific PrometheusRule alerts, showing the noise reduction is part of a deliberate strategy (remove defaults, add targeted custom alerts).
  • Test plan coverage: The 5-item test plan covers the key verification points. Watchdog firing confirmation and Grafana dashboard validation are the critical checks.

VERDICT: APPROVED

Clean, well-scoped change. The potential gap in 4 additional alerting rule groups (kubeApiserverAvailability, kubeApiserverBurnrate, kubeApiserverHistogram, node) is a nit -- worth verifying but not a blocker since the PR achieves its stated goal of eliminating the primary noise sources.

## PR #408 Review ### DOMAIN REVIEW **Tech stack**: Terraform (HCL) -- Helm release configuration for kube-prometheus-stack. Domain checklist: Terraform style, k8s alerting patterns, Helm values correctness. **defaultRules toggle approach**: The pattern of setting `create = true` (preserving recording rules for Grafana dashboards) while individually toggling alerting groups to `false` is the correct upstream approach. The kube-prometheus-stack chart uses per-group boolean toggles under `defaultRules.rules`, and this PR uses them as designed. Good. **Watchdog heartbeat replacement**: The original Watchdog lives in the `general` rule group which is now disabled. Adding a replacement Watchdog with `vector(1)` and `severity = "none"` to `platform-alerts` is correct. This preserves AlertManager dead-man's-switch functionality. The expression, labels, and annotations all match the upstream Watchdog pattern. **Potential gap -- missing alerting groups**: The PR disables 17 groups but the kube-prometheus-stack chart (v55+) defines additional alerting rule groups that are not toggled off here: | Group key | Has alerting rules? | Disabled in PR? | Notes | |-----------|-------------------|-----------------|-------| | `kubeApiserverAvailability` | Yes | No | API server availability alerts | | `kubeApiserverBurnrate` | Yes | No | Burn-rate alerts for API server error budget | | `kubeApiserverHistogram` | Yes | No | API server latency alerts | | `node` | Yes | No | Distinct from `nodeExporterAlerting` -- contains NodeRAIDDegraded, NodeClockNotSynchronising, etc. | | `etcd` | Yes | No | Likely moot if etcd component disabled | | `kubeControllerManager` | Possibly | No | Likely moot if component disabled | | `kubeSchedulerAlerting` | Possibly | No | Moot -- scheduler component is disabled in context | The first four (`kubeApiserverAvailability`, `kubeApiserverBurnrate`, `kubeApiserverHistogram`, `node`) may still generate alerting rules since their parent components are not explicitly disabled. If the goal is truly "disable ALL default alerting rules," these are gaps. Worth verifying whether these fire in the current cluster -- if they do, they should be added; if they don't (e.g., because the API server metrics are not exposed in this cluster setup), it is a non-issue but still worth toggling for explicitness. **Recording rule groups correctly left alone**: `kubePrometheusNodeRecording`, `kubeSchedulerRecording`, `nodeExporterRecording` are recording-only groups and are correctly omitted from the disable list since `create = true` keeps them active for dashboard queries. **Comment quality**: The inline comment explaining the rationale ("designed for multi-team k8s operations, noise for solo-dev cluster") is clear and provides good context for future readers. The cross-references to `additionalPrometheusRules` and separate TF resources are helpful. ### BLOCKERS None. No secrets, no security regressions, no unvalidated input. This is pure Helm values configuration. ### NITS 1. **Potentially incomplete disable list**: As noted above, `kubeApiserverAvailability`, `kubeApiserverBurnrate`, `kubeApiserverHistogram`, and `node` are not toggled to `false`. If any of these fire alerts in the current cluster, they represent noise the PR intends to eliminate. Consider adding them for completeness, or document why they are intentionally left enabled. 2. **Alphabetical ordering**: The 17 toggles are alphabetically sorted -- good. If the missing groups are added, maintain this ordering. 3. **Watchdog `for` duration**: The upstream Watchdog typically does not use a `for` clause (fires immediately), which matches this implementation. No issue, just confirming intentional alignment. ### SOP COMPLIANCE - [x] Branch named after issue -- `95-disable-default-alert-rules` references the parent tracking issue (#95 in landscaping-assistant). Acceptable since this is a cross-repo change tracked by a parent issue. The pal-e-platform issue is #407. - [x] PR body follows template -- Summary, Changes, Test Plan, Review Checklist, Related Notes all present. - [x] Related section references parent tracking issue and follow-up issue. - [x] No secrets committed -- pure HCL configuration changes. - [x] No unnecessary file changes -- single file, 1 changed file, tightly scoped. - [x] Commit message is descriptive -- not visible in diff but PR title is clear. ### PROCESS OBSERVATIONS - **Change failure risk**: Low. This is a configuration-only change to Helm values. The `terraform plan` test plan item is appropriate -- it should show only PrometheusRule resource updates, no destructive changes. - **Rollback**: Trivial -- revert the commit and re-apply. Recording rules and custom alerts are untouched. - **Follow-up tracking**: The PR correctly references `landscaping-assistant #17` for app-specific PrometheusRule alerts, showing the noise reduction is part of a deliberate strategy (remove defaults, add targeted custom alerts). - **Test plan coverage**: The 5-item test plan covers the key verification points. Watchdog firing confirmation and Grafana dashboard validation are the critical checks. ### VERDICT: APPROVED Clean, well-scoped change. The potential gap in 4 additional alerting rule groups (`kubeApiserverAvailability`, `kubeApiserverBurnrate`, `kubeApiserverHistogram`, `node`) is a nit -- worth verifying but not a blocker since the PR achieves its stated goal of eliminating the primary noise sources.
ldraney deleted branch 95-disable-default-alert-rules 2026-06-05 03:28:25 +00:00
Sign in to join this conversation.
No description provided.