Disable default kube-prometheus-stack alerting rules #408
No reviewers
Labels
No labels
domain:backend
domain:devops
domain:frontend
status:approved
status:in-progress
status:needs-fix
status:qa
type:bug
type:devops
type:feature
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
ldraney/pal-e-platform!408
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "95-disable-default-alert-rules"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
Changes
terraform/modules/monitoring/main.tf: Set all alerting rule toggles tofalseindefaultRules.rules, add Watchdog rule group toadditionalPrometheusRulesTest Plan
terraform planshows only PrometheusRule changesReview Checklist
Related Notes
ldraney/landscaping-assistant #95— parent tracking issueldraney/pal-e-platform #407— platform-side issueldraney/landscaping-assistant #17— follow-up: app-specific PrometheusRule alertsCloses #407
PR #408 Review
DOMAIN REVIEW
Tech stack: Terraform (HCL) -- Helm release configuration for kube-prometheus-stack. Domain checklist: Terraform style, k8s alerting patterns, Helm values correctness.
defaultRules toggle approach: The pattern of setting
create = true(preserving recording rules for Grafana dashboards) while individually toggling alerting groups tofalseis the correct upstream approach. The kube-prometheus-stack chart uses per-group boolean toggles underdefaultRules.rules, and this PR uses them as designed. Good.Watchdog heartbeat replacement: The original Watchdog lives in the
generalrule group which is now disabled. Adding a replacement Watchdog withvector(1)andseverity = "none"toplatform-alertsis correct. This preserves AlertManager dead-man's-switch functionality. The expression, labels, and annotations all match the upstream Watchdog pattern.Potential gap -- missing alerting groups: The PR disables 17 groups but the kube-prometheus-stack chart (v55+) defines additional alerting rule groups that are not toggled off here:
kubeApiserverAvailabilitykubeApiserverBurnratekubeApiserverHistogramnodenodeExporterAlerting-- contains NodeRAIDDegraded, NodeClockNotSynchronising, etc.etcdkubeControllerManagerkubeSchedulerAlertingThe first four (
kubeApiserverAvailability,kubeApiserverBurnrate,kubeApiserverHistogram,node) may still generate alerting rules since their parent components are not explicitly disabled. If the goal is truly "disable ALL default alerting rules," these are gaps. Worth verifying whether these fire in the current cluster -- if they do, they should be added; if they don't (e.g., because the API server metrics are not exposed in this cluster setup), it is a non-issue but still worth toggling for explicitness.Recording rule groups correctly left alone:
kubePrometheusNodeRecording,kubeSchedulerRecording,nodeExporterRecordingare recording-only groups and are correctly omitted from the disable list sincecreate = truekeeps them active for dashboard queries.Comment quality: The inline comment explaining the rationale ("designed for multi-team k8s operations, noise for solo-dev cluster") is clear and provides good context for future readers. The cross-references to
additionalPrometheusRulesand separate TF resources are helpful.BLOCKERS
None. No secrets, no security regressions, no unvalidated input. This is pure Helm values configuration.
NITS
Potentially incomplete disable list: As noted above,
kubeApiserverAvailability,kubeApiserverBurnrate,kubeApiserverHistogram, andnodeare not toggled tofalse. If any of these fire alerts in the current cluster, they represent noise the PR intends to eliminate. Consider adding them for completeness, or document why they are intentionally left enabled.Alphabetical ordering: The 17 toggles are alphabetically sorted -- good. If the missing groups are added, maintain this ordering.
Watchdog
forduration: The upstream Watchdog typically does not use aforclause (fires immediately), which matches this implementation. No issue, just confirming intentional alignment.SOP COMPLIANCE
95-disable-default-alert-rulesreferences the parent tracking issue (#95 in landscaping-assistant). Acceptable since this is a cross-repo change tracked by a parent issue. The pal-e-platform issue is #407.PROCESS OBSERVATIONS
terraform plantest plan item is appropriate -- it should show only PrometheusRule resource updates, no destructive changes.landscaping-assistant #17for app-specific PrometheusRule alerts, showing the noise reduction is part of a deliberate strategy (remove defaults, add targeted custom alerts).VERDICT: APPROVED
Clean, well-scoped change. The potential gap in 4 additional alerting rule groups (
kubeApiserverAvailability,kubeApiserverBurnrate,kubeApiserverHistogram,node) is a nit -- worth verifying but not a blocker since the PR achieves its stated goal of eliminating the primary noise sources.