Disable default kube-prometheus-stack alerting rules

ldraney commented

2026-06-04 12:40:44 +00:00

Author

Owner

PR #408 Review

DOMAIN REVIEW

Tech stack: Terraform (HCL) -- Helm release configuration for kube-prometheus-stack. Domain checklist: Terraform style, k8s alerting patterns, Helm values correctness.

defaultRules toggle approach: The pattern of setting create = true (preserving recording rules for Grafana dashboards) while individually toggling alerting groups to false is the correct upstream approach. The kube-prometheus-stack chart uses per-group boolean toggles under defaultRules.rules, and this PR uses them as designed. Good.

Watchdog heartbeat replacement: The original Watchdog lives in the general rule group which is now disabled. Adding a replacement Watchdog with vector(1) and severity = "none" to platform-alerts is correct. This preserves AlertManager dead-man's-switch functionality. The expression, labels, and annotations all match the upstream Watchdog pattern.

Potential gap -- missing alerting groups: The PR disables 17 groups but the kube-prometheus-stack chart (v55+) defines additional alerting rule groups that are not toggled off here:

Group key	Has alerting rules?	Disabled in PR?	Notes
`kubeApiserverAvailability`	Yes	No	API server availability alerts
`kubeApiserverBurnrate`	Yes	No	Burn-rate alerts for API server error budget
`kubeApiserverHistogram`	Yes	No	API server latency alerts
`node`	Yes	No	Distinct from `nodeExporterAlerting` -- contains NodeRAIDDegraded, NodeClockNotSynchronising, etc.
`etcd`	Yes	No	Likely moot if etcd component disabled
`kubeControllerManager`	Possibly	No	Likely moot if component disabled
`kubeSchedulerAlerting`	Possibly	No	Moot -- scheduler component is disabled in context

The first four (kubeApiserverAvailability, kubeApiserverBurnrate, kubeApiserverHistogram, node) may still generate alerting rules since their parent components are not explicitly disabled. If the goal is truly "disable ALL default alerting rules," these are gaps. Worth verifying whether these fire in the current cluster -- if they do, they should be added; if they don't (e.g., because the API server metrics are not exposed in this cluster setup), it is a non-issue but still worth toggling for explicitness.

Recording rule groups correctly left alone: kubePrometheusNodeRecording, kubeSchedulerRecording, nodeExporterRecording are recording-only groups and are correctly omitted from the disable list since create = true keeps them active for dashboard queries.

Comment quality: The inline comment explaining the rationale ("designed for multi-team k8s operations, noise for solo-dev cluster") is clear and provides good context for future readers. The cross-references to additionalPrometheusRules and separate TF resources are helpful.

BLOCKERS

None. No secrets, no security regressions, no unvalidated input. This is pure Helm values configuration.

NITS

Potentially incomplete disable list: As noted above, kubeApiserverAvailability, kubeApiserverBurnrate, kubeApiserverHistogram, and node are not toggled to false. If any of these fire alerts in the current cluster, they represent noise the PR intends to eliminate. Consider adding them for completeness, or document why they are intentionally left enabled.
Alphabetical ordering: The 17 toggles are alphabetically sorted -- good. If the missing groups are added, maintain this ordering.
Watchdog for duration: The upstream Watchdog typically does not use a for clause (fires immediately), which matches this implementation. No issue, just confirming intentional alignment.

SOP COMPLIANCE

Branch named after issue -- 95-disable-default-alert-rules references the parent tracking issue (#95 in landscaping-assistant). Acceptable since this is a cross-repo change tracked by a parent issue. The pal-e-platform issue is #407.
PR body follows template -- Summary, Changes, Test Plan, Review Checklist, Related Notes all present.
Related section references parent tracking issue and follow-up issue.
No secrets committed -- pure HCL configuration changes.
No unnecessary file changes -- single file, 1 changed file, tightly scoped.
Commit message is descriptive -- not visible in diff but PR title is clear.

PROCESS OBSERVATIONS

Change failure risk: Low. This is a configuration-only change to Helm values. The terraform plan test plan item is appropriate -- it should show only PrometheusRule resource updates, no destructive changes.
Rollback: Trivial -- revert the commit and re-apply. Recording rules and custom alerts are untouched.
Follow-up tracking: The PR correctly references landscaping-assistant #17 for app-specific PrometheusRule alerts, showing the noise reduction is part of a deliberate strategy (remove defaults, add targeted custom alerts).
Test plan coverage: The 5-item test plan covers the key verification points. Watchdog firing confirmation and Grafana dashboard validation are the critical checks.

VERDICT: APPROVED

Clean, well-scoped change. The potential gap in 4 additional alerting rule groups (kubeApiserverAvailability, kubeApiserverBurnrate, kubeApiserverHistogram, node) is a nit -- worth verifying but not a blocker since the PR achieves its stated goal of eliminating the primary noise sources.

## PR #408 Review ### DOMAIN REVIEW **Tech stack**: Terraform (HCL) -- Helm release configuration for kube-prometheus-stack. Domain checklist: Terraform style, k8s alerting patterns, Helm values correctness. **defaultRules toggle approach**: The pattern of setting `create = true` (preserving recording rules for Grafana dashboards) while individually toggling alerting groups to `false` is the correct upstream approach. The kube-prometheus-stack chart uses per-group boolean toggles under `defaultRules.rules`, and this PR uses them as designed. Good. **Watchdog heartbeat replacement**: The original Watchdog lives in the `general` rule group which is now disabled. Adding a replacement Watchdog with `vector(1)` and `severity = "none"` to `platform-alerts` is correct. This preserves AlertManager dead-man's-switch functionality. The expression, labels, and annotations all match the upstream Watchdog pattern. **Potential gap -- missing alerting groups**: The PR disables 17 groups but the kube-prometheus-stack chart (v55+) defines additional alerting rule groups that are not toggled off here: | Group key | Has alerting rules? | Disabled in PR? | Notes | |-----------|-------------------|-----------------|-------| | `kubeApiserverAvailability` | Yes | No | API server availability alerts | | `kubeApiserverBurnrate` | Yes | No | Burn-rate alerts for API server error budget | | `kubeApiserverHistogram` | Yes | No | API server latency alerts | | `node` | Yes | No | Distinct from `nodeExporterAlerting` -- contains NodeRAIDDegraded, NodeClockNotSynchronising, etc. | | `etcd` | Yes | No | Likely moot if etcd component disabled | | `kubeControllerManager` | Possibly | No | Likely moot if component disabled | | `kubeSchedulerAlerting` | Possibly | No | Moot -- scheduler component is disabled in context | The first four (`kubeApiserverAvailability`, `kubeApiserverBurnrate`, `kubeApiserverHistogram`, `node`) may still generate alerting rules since their parent components are not explicitly disabled. If the goal is truly "disable ALL default alerting rules," these are gaps. Worth verifying whether these fire in the current cluster -- if they do, they should be added; if they don't (e.g., because the API server metrics are not exposed in this cluster setup), it is a non-issue but still worth toggling for explicitness. **Recording rule groups correctly left alone**: `kubePrometheusNodeRecording`, `kubeSchedulerRecording`, `nodeExporterRecording` are recording-only groups and are correctly omitted from the disable list since `create = true` keeps them active for dashboard queries. **Comment quality**: The inline comment explaining the rationale ("designed for multi-team k8s operations, noise for solo-dev cluster") is clear and provides good context for future readers. The cross-references to `additionalPrometheusRules` and separate TF resources are helpful. ### BLOCKERS None. No secrets, no security regressions, no unvalidated input. This is pure Helm values configuration. ### NITS 1. **Potentially incomplete disable list**: As noted above, `kubeApiserverAvailability`, `kubeApiserverBurnrate`, `kubeApiserverHistogram`, and `node` are not toggled to `false`. If any of these fire alerts in the current cluster, they represent noise the PR intends to eliminate. Consider adding them for completeness, or document why they are intentionally left enabled. 2. **Alphabetical ordering**: The 17 toggles are alphabetically sorted -- good. If the missing groups are added, maintain this ordering. 3. **Watchdog `for` duration**: The upstream Watchdog typically does not use a `for` clause (fires immediately), which matches this implementation. No issue, just confirming intentional alignment. ### SOP COMPLIANCE - [x] Branch named after issue -- `95-disable-default-alert-rules` references the parent tracking issue (#95 in landscaping-assistant). Acceptable since this is a cross-repo change tracked by a parent issue. The pal-e-platform issue is #407. - [x] PR body follows template -- Summary, Changes, Test Plan, Review Checklist, Related Notes all present. - [x] Related section references parent tracking issue and follow-up issue. - [x] No secrets committed -- pure HCL configuration changes. - [x] No unnecessary file changes -- single file, 1 changed file, tightly scoped. - [x] Commit message is descriptive -- not visible in diff but PR title is clear. ### PROCESS OBSERVATIONS - **Change failure risk**: Low. This is a configuration-only change to Helm values. The `terraform plan` test plan item is appropriate -- it should show only PrometheusRule resource updates, no destructive changes. - **Rollback**: Trivial -- revert the commit and re-apply. Recording rules and custom alerts are untouched. - **Follow-up tracking**: The PR correctly references `landscaping-assistant #17` for app-specific PrometheusRule alerts, showing the noise reduction is part of a deliberate strategy (remove defaults, add targeted custom alerts). - **Test plan coverage**: The 5-item test plan covers the key verification points. Watchdog firing confirmation and Grafana dashboard validation are the critical checks. ### VERDICT: APPROVED Clean, well-scoped change. The potential gap in 4 additional alerting rule groups (`kubeApiserverAvailability`, `kubeApiserverBurnrate`, `kubeApiserverHistogram`, `node`) is a nit -- worth verifying but not a blocker since the PR achieves its stated goal of eliminating the primary noise sources.

2026-06-04 12:39:18 +00:00

Summary

Disable all 17 default alerting rule groups (~95 rules) that are noise for a solo-dev cluster
Keep recording rules intact (dashboards depend on them)
Add Watchdog heartbeat to custom platform-alerts

Changes

terraform/modules/monitoring/main.tf: Set all alerting rule toggles to false in defaultRules.rules, add Watchdog rule group to additionalPrometheusRules

Test Plan

terraform plan shows only PrometheusRule changes
After apply: alert rule count drops from ~123 to ~30
Watchdog still fires in AlertManager
Grafana dashboards still load (recording rules preserved)
Custom PrometheusRules unaffected (blackbox, embedding, payment-pipeline, gmail-oauth)

Review Checklist

No secrets committed
No unnecessary file changes
Commit messages are descriptive

ldraney/landscaping-assistant #95 — parent tracking issue
ldraney/pal-e-platform #407 — platform-side issue
ldraney/landscaping-assistant #17 — follow-up: app-specific PrometheusRule alerts

Closes #407

## Summary - Disable all 17 default alerting rule groups (~95 rules) that are noise for a solo-dev cluster - Keep recording rules intact (dashboards depend on them) - Add Watchdog heartbeat to custom platform-alerts ## Changes - `terraform/modules/monitoring/main.tf`: Set all alerting rule toggles to `false` in `defaultRules.rules`, add Watchdog rule group to `additionalPrometheusRules` ## Test Plan - [ ] `terraform plan` shows only PrometheusRule changes - [ ] After apply: alert rule count drops from ~123 to ~30 - [ ] Watchdog still fires in AlertManager - [ ] Grafana dashboards still load (recording rules preserved) - [ ] Custom PrometheusRules unaffected (blackbox, embedding, payment-pipeline, gmail-oauth) ## Review Checklist - [ ] No secrets committed - [ ] No unnecessary file changes - [ ] Commit messages are descriptive ## Related Notes - `ldraney/landscaping-assistant #95` — parent tracking issue - `ldraney/pal-e-platform #407` — platform-side issue - `ldraney/landscaping-assistant #17` — follow-up: app-specific PrometheusRule alerts Closes #407

ldraney added 1 commit

Disable default kube-prometheus-stack alerting rules, add Watchdog

ci/woodpecker/push/terraform Pipeline was successful

Details

ci/woodpecker/pr/terraform Pipeline was successful

ci/woodpecker/pull_request_closed/terraform Pipeline was successful

38106c8ee1

Disable all 17 default alerting rule groups (~95 rules) via individual
defaultRules.rules toggles. These are designed for multi-team k8s
operations and are noise for a solo-dev cluster. Recording rules
preserved (create = true) since Grafana dashboards depend on them.

Add Watchdog heartbeat to custom platform-alerts to verify the
alerting pipeline stays functional (was previously in the now-disabled
general rules group).

Custom PrometheusRule resources (blackbox, embedding, payment-pipeline,
gmail-oauth) are unaffected — they're separate Terraform resources.

Closes ldraney/landscaping-assistant#95
Refs ldraney/pal-e-platform#407

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Group key	Has alerting rules?	Disabled in PR?	Notes
`kubeApiserverAvailability`	Yes	No	API server availability alerts
`kubeApiserverBurnrate`	Yes	No	Burn-rate alerts for API server error budget
`kubeApiserverHistogram`	Yes	No	API server latency alerts
`node`	Yes	No	Distinct from `nodeExporterAlerting` -- contains NodeRAIDDegraded, NodeClockNotSynchronising, etc.
`etcd`	Yes	No	Likely moot if etcd component disabled
`kubeControllerManager`	Possibly	No	Likely moot if component disabled
`kubeSchedulerAlerting`	Possibly	No	Moot -- scheduler component is disabled in context