Clean up AlertManager: disable default kube-prometheus-stack rules #95

Closed
opened 2026-06-04 12:12:28 +00:00 by ldraney · 0 comments
Owner

Type

Feature

Lineage

Standalone — discovered during observability audit session 2026-06-04. Prerequisite to #17.

Repo

ldraney/pal-e-platform

User Story

As the platform operator
I want to disable the ~100 default kube-prometheus-stack alert rules that are noise for a solo-dev cluster
So that AlertManager only fires for actionable, custom alerts I've written

Context

The cluster has 123 alert rules across 18 groups. ~95 are kube-prometheus-stack defaults designed for multi-team Kubernetes operations (alertmanager internals, kubelet health, 26 node-exporter rules, 23 Prometheus self-monitoring rules, API server SLOs). These never fire for real issues on this cluster and dilute attention from the ~28 custom rules that do.

Currently 11 alerts firing — all from custom rules or general health checks:

  • EndpointDown (2x): stale blackbox probes (platform-validation, playme2k)
  • GmailOAuthTokenExpired + ExpiringSoon: basketball-api, 86d old token
  • MacAgentDown + TargetDown (3x): Mac laptop offline
  • OOMKilled (3x): argocd-image-updater, woodpecker-db, harbor-portal-proxy
  • Watchdog: heartbeat (expected)

File Targets

Files the agent should modify:

  • pal-e-platform: kube-prometheus-stack Helm values — set defaultRules.create: false or selectively disable rule groups
  • pal-e-platform: review/remove stale blackbox probes for platform-validation and playme2k
  • pal-e-platform: review OOMKilled pods — raise memory limits or investigate root cause

Files the agent should NOT touch:

  • Custom PrometheusRule Terraform resources (blackbox-alerts, embedding-alerts, gmail-oauth, platform-alerts, payment-pipeline-alerts) — these stay
  • Anything in landscaping-assistant repo — app-specific alerts are #17

Acceptance Criteria

  • Default kube-prometheus-stack alert rules disabled (rule count drops from 123 to ~28)
  • Custom PrometheusRules still active and functioning
  • Stale EndpointDown alerts resolved (probes removed or endpoints fixed)
  • OOMKilled pods triaged (limits raised or root cause documented)
  • AlertManager shows only actionable alerts after cleanup

Test Expectations

  • kubectl get prometheusrules -n monitoring shows only custom rules
  • Prometheus /api/v1/rules alert count is ~28 (custom only)
  • Watchdog alert still fires (proves pipeline is healthy)
  • No regression in Telegram/Slack notification delivery

Constraints

  • Do NOT delete custom PrometheusRule resources — only disable the Helm-managed defaults
  • Keep Watchdog alert as the pipeline heartbeat
  • MacAgentDown: silence rather than delete if Mac agent will come back online

Checklist

  • PR opened
  • Tests pass
  • No unrelated changes
  • #17 — Add PrometheusRule alerts for error rate, latency, and availability (next step after cleanup)
  • #43 — Observability & DORA metrics stack [PARENT]
### Type Feature ### Lineage Standalone — discovered during observability audit session 2026-06-04. Prerequisite to #17. ### Repo `ldraney/pal-e-platform` ### User Story As the platform operator I want to disable the ~100 default kube-prometheus-stack alert rules that are noise for a solo-dev cluster So that AlertManager only fires for actionable, custom alerts I've written ### Context The cluster has 123 alert rules across 18 groups. ~95 are kube-prometheus-stack defaults designed for multi-team Kubernetes operations (alertmanager internals, kubelet health, 26 node-exporter rules, 23 Prometheus self-monitoring rules, API server SLOs). These never fire for real issues on this cluster and dilute attention from the ~28 custom rules that do. Currently 11 alerts firing — all from custom rules or general health checks: - EndpointDown (2x): stale blackbox probes (platform-validation, playme2k) - GmailOAuthTokenExpired + ExpiringSoon: basketball-api, 86d old token - MacAgentDown + TargetDown (3x): Mac laptop offline - OOMKilled (3x): argocd-image-updater, woodpecker-db, harbor-portal-proxy - Watchdog: heartbeat (expected) ### File Targets Files the agent should modify: - `pal-e-platform`: kube-prometheus-stack Helm values — set `defaultRules.create: false` or selectively disable rule groups - `pal-e-platform`: review/remove stale blackbox probes for platform-validation and playme2k - `pal-e-platform`: review OOMKilled pods — raise memory limits or investigate root cause Files the agent should NOT touch: - Custom PrometheusRule Terraform resources (blackbox-alerts, embedding-alerts, gmail-oauth, platform-alerts, payment-pipeline-alerts) — these stay - Anything in landscaping-assistant repo — app-specific alerts are #17 ### Acceptance Criteria - [ ] Default kube-prometheus-stack alert rules disabled (rule count drops from 123 to ~28) - [ ] Custom PrometheusRules still active and functioning - [ ] Stale EndpointDown alerts resolved (probes removed or endpoints fixed) - [ ] OOMKilled pods triaged (limits raised or root cause documented) - [ ] AlertManager shows only actionable alerts after cleanup ### Test Expectations - [ ] `kubectl get prometheusrules -n monitoring` shows only custom rules - [ ] Prometheus `/api/v1/rules` alert count is ~28 (custom only) - [ ] Watchdog alert still fires (proves pipeline is healthy) - [ ] No regression in Telegram/Slack notification delivery ### Constraints - Do NOT delete custom PrometheusRule resources — only disable the Helm-managed defaults - Keep Watchdog alert as the pipeline heartbeat - MacAgentDown: silence rather than delete if Mac agent will come back online ### Checklist - [ ] PR opened - [ ] Tests pass - [ ] No unrelated changes ### Related - #17 — Add PrometheusRule alerts for error rate, latency, and availability (next step after cleanup) - #43 — Observability & DORA metrics stack [PARENT]
Sign in to join this conversation.
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
ldraney/landscaping-assistant#95
No description provided.