Time-window MacAgentDown alert (laptop offline outside work hours is expected) #326

Open
opened 2026-05-02 14:51:30 +00:00 by forgejo_admin · 0 comments
Contributor

Type

Bug

Lineage

Standalone — discovered 2026-05-01 during alert-state audit.

Repo

forgejo_admin/pal-e-platform

What Broke

Three alerts fire continuously for one offline laptop, training oncall to ignore Mac alerts. This is dangerous because we'll also ignore them when the Mac is genuinely broken (Apple developer enrollment lapse, cert expiry, dead disk). Currently:

  • MacAgentDown (critical) firing 16d
  • TargetDown (warning, our custom rule, instance=lucass-macbook-air-1) firing 16d
  • TargetDown (warning, helm-default aggregate) firing 34d

The Mac is a personal laptop and is expected to be offline outside ~8am–9pm MST weekdays.

Repro Steps

  1. Close laptop overnight.
  2. Five minutes later: 3 alerts firing for the same root cause.
  3. Repeat every weekday for 16+ days.

Expected Behavior

  • Alerts fire only during expected-online windows (proposed: M–F 15:00–04:00 UTC = 8am–9pm MST), or
  • Alerts route to a non-paging Alertmanager receiver outside those windows, or
  • Alerts have for: 6h so transient overnight offlines never trigger.
  • The redundant TargetDown (helm-default vs our custom version) is de-duplicated.
  • The rule still fires immediately when the Mac is broken during work hours.

Environment

  • Cluster: pal-e, namespace monitoring
  • File: terraform/modules/monitoring/main.tf, kube-prometheus-stack-platform-alerts PrometheusRule, mac-agent-health group
  • Pattern reference: WebhookStale already uses hour()/day_of_week() filters

Acceptance Criteria

  • MacAgentDown does not fire outside work hours when laptop is closed
  • MacAgentDown still fires within 5 minutes when laptop is broken during work hours
  • Redundant TargetDown (custom or helm-default, pick one) is removed
  • PR description includes the test PromQL that proves both behaviors
  • pal-e-platform — project
  • alert-report-2026-05-01 — alert snapshot
### Type Bug ### Lineage Standalone — discovered 2026-05-01 during alert-state audit. ### Repo `forgejo_admin/pal-e-platform` ### What Broke Three alerts fire continuously for one offline laptop, training oncall to ignore Mac alerts. This is dangerous because we'll also ignore them when the Mac is genuinely broken (Apple developer enrollment lapse, cert expiry, dead disk). Currently: - `MacAgentDown` (critical) firing 16d - `TargetDown` (warning, our custom rule, instance=lucass-macbook-air-1) firing 16d - `TargetDown` (warning, helm-default aggregate) firing 34d The Mac is a personal laptop and is expected to be offline outside ~8am–9pm MST weekdays. ### Repro Steps 1. Close laptop overnight. 2. Five minutes later: 3 alerts firing for the same root cause. 3. Repeat every weekday for 16+ days. ### Expected Behavior - Alerts fire only during expected-online windows (proposed: M–F 15:00–04:00 UTC = 8am–9pm MST), or - Alerts route to a non-paging Alertmanager receiver outside those windows, or - Alerts have `for: 6h` so transient overnight offlines never trigger. - The redundant `TargetDown` (helm-default vs our custom version) is de-duplicated. - The rule still fires immediately when the Mac is broken during work hours. ### Environment - Cluster: pal-e, namespace `monitoring` - File: `terraform/modules/monitoring/main.tf`, `kube-prometheus-stack-platform-alerts` PrometheusRule, `mac-agent-health` group - Pattern reference: `WebhookStale` already uses `hour()`/`day_of_week()` filters ### Acceptance Criteria - [ ] `MacAgentDown` does not fire outside work hours when laptop is closed - [ ] `MacAgentDown` still fires within 5 minutes when laptop is broken during work hours - [ ] Redundant `TargetDown` (custom or helm-default, pick one) is removed - [ ] PR description includes the test PromQL that proves both behaviors ### Related - `pal-e-platform` — project - `alert-report-2026-05-01` — alert snapshot
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
ldraney/pal-e-platform#326
No description provided.