observability: reduce alert noise + add payment pipeline signals (#290) #329
No reviewers
Labels
No labels
domain:backend
domain:devops
domain:frontend
status:approved
status:in-progress
status:needs-fix
status:qa
type:bug
type:devops
type:feature
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
ldraney/pal-e-platform!329
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "290-payment-pipeline-observability"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
WebhookErrorRate,WebhookStale).pal-e-app-golden-signalspattern.Changes
terraform/modules/monitoring/main.tf— dropwestside-dev,pal-e-app,mac-agentfrom blackboxtargets(all expected-down).terraform/modules/monitoring/main.tf—EndpointDown: critical → warning,for2m → 5m.terraform/modules/monitoring/main.tf— disabledefaultRules.kubeStateMetricsandkubernetesApps.terraform/modules/monitoring/main.tf— add Alertmanagerinhibit_rulesso critical suppresses warning onalertname+namespace.terraform/modules/monitoring/main.tf— newpayment-pipeline-alertsPrometheusRule withWebhookErrorRateandWebhookStale.terraform/modules/monitoring/main.tf— new ConfigMap wiringbasketball-api-golden-signals.jsonfor Grafana.terraform/dashboards/basketball-api-golden-signals.json— new dashboard (372 lines).terraform/specs/2026-04-10-westside-admin-design.md— removed (one-shot spec, no longer relevant).Test Plan
tofu fmt -recursive -checkcleantofu validate→Success! The configuration is valid.payment-pipeline-alertsPrometheusRule already deployed to cluster —WebhookStaleis firing as designed (~16:00 UTC 2026-05-01)basketball-api-golden-signalsConfigMap renders in GrafanaEndpointDownno longer fires for any expected-down targetinhibit_rulesconfig reload picked up the changeReview Checklist
salt/pillar/secrets/platform.slsadmin_app_db_password change is unrelated work, left out of this PR)Related Notes
Closes #290
forgejo_admin/pal-e-platform #290— issue this PR closespal-e-platform— projectalert-report-2026-05-01— full alert state snapshot at PR time, identifies follow-up tickets #322–#328 andpal-e-apiOOM ticketDiscovered Scope (separate tickets follow)
pal-e-platform #322— Gmail OAuth auto-reauth cron not running (P1, westside email broken)pal-e-platform #323— Fix Alertmanager inhibit rule for Gmail OAuth duplicatespal-e-platform #324— Add probes for westside-contracts, westside-email, westside-ai-assistantpal-e-platform #325— Re-add tighter pod-failure coverage post-disablepal-e-platform #326— Time-window MacAgentDownpal-e-platform #327— Raise argocd memory limitpal-e-platform #328— Westside-unified Grafana dashboardpal-e-api #TBD— pal-e-docs container OOMKilled debugAddresses the 31-firing-alerts fatigue identified during the Apr 13 checkout outage by removing expected-down blackbox targets, silencing kube-state-metrics default duplicates, and adding revenue-critical Stripe webhook alerts + a basketball-api golden signals dashboard. Changes: - Remove blackbox targets westside-dev, pal-e-app, mac-agent (all expected-down; generated critical EndpointDown alerts indistinguishable from real outages). Mac-agent uptime still covered by MacAgentDown. - Downgrade EndpointDown critical -> warning, raise "for" 2m -> 5m to suppress flaps. - Disable kube-prometheus-stack defaultRules.kubeStateMetrics and kubernetesApps (duplicate namespace noise). Our PodRestartStorm, OOMKilled, TargetDown rules provide the real signal. - Add alertmanager inhibit rule so critical alerts suppress their warning counterparts on the same alertname+namespace. - New PrometheusRule payment-pipeline-alerts with: - WebhookErrorRate (warning, 5m) on increase(webhook_errors_total[5m]) - WebhookStale (warning, 10m) when checkout.session.completed timestamp is >30min stale during business hours (9am-9pm MST weekdays) - New basketball-api-golden-signals dashboard wired via ConfigMap, following the pal-e-app-golden-signals pattern. Uses only metrics basketball-api currently exposes (basketball_api_up, webhook_*). tofu validate passes. tofu plan -lock=false shows only the expected resource additions/updates under module.monitoring plus pre-existing drift from previously-merged PRs (database/ops). Closes #290 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>PR #329 Review
DOMAIN REVIEW
Stack identified: Terraform/OpenTofu + kube-prometheus-stack (Helm) + blackbox-exporter + Alertmanager + Grafana JSON dashboard + PrometheusRule CRDs against k3s.
1. WebhookStale PromQL business-hours filter — boundary bugs
Stated intent: "9am-9pm MST weekdays". MST = UTC-7, so business hours in UTC are 16:00 through 04:00 next day.
The hour clause
(hour() >= 16 or hour() < 4)is correct for the time-of-day wraparound, but the day clause(day_of_week() > 0 and day_of_week() < 6)is evaluated against the UTC day, while the business hours wrap past midnight UTC. This produces two off-by-one defects in the 8pm–9pm MST hour:6<6=false)0>0=false)So you'll silently miss the last hour of every Friday's checkout window, and you'll get false alerts during the last hour of every Sunday evening. Fix is to widen the day check to also cover the boundary: e.g.
((hour() >= 16 and day_of_week() > 0 and day_of_week() < 6) or (hour() < 4 and day_of_week() > 1 and day_of_week() < 7))(Mon morning–Sat morning UTC = Sun evening–Fri evening MST).Also flagged in PR body: the rule already fired around 16:00 UTC today on first deploy. That is the expected behavior of this expression at every business-day rollover whenever no checkout has happened in the last 30 minutes — i.e. every Monday morning will fire until the first webhook of the day, plus most early-morning weekdays. Consider whether the alert should require
webhook_last_received_timestampto be a recent value at all (e.g.webhook_last_received_timestamp > time() - 86400to avoid alerting when the metric has just become stale because the day rolled over).2. Inhibit rule scope — correct for stated purpose
equal = ["alertname", "namespace"]withseverity=critical→severity=warningcorrectly suppresses same-named warning copies of an alert that has a critical sibling firing in the same namespace. This handles the cross-namespace duplicate problem the PR sets out to fix. As acknowledged, it does not dedupeGmailOAuthTokenExpiredvsGmailOAuthTokenExpiringSoonbecause those are differentalertnamevalues — that's why ticket #323 exists. Scope is correct for what it was designed to do.One non-blocking note: alerts that don't carry a
namespacelabel (e.g.MacAgentDown,DiskPressure) won't participate in inhibition because theequaljoin requires a namespace match. That's fine; those alerts don't have warning duplicates anyway.3. defaultRules.kubernetesApps disable — acceptable bridge
Real signal lost:
KubeContainerWaiting(ImagePullBackOff stalls),KubePodCrashLooping(sustained crash detection longer-window thanPodRestartStorm),KubeDeploymentReplicasMismatch,KubeJobFailed, etc.Compensating controls retained: custom
PodRestartStorm(3 restarts/15m), customOOMKilled(15m sustained),TargetDown(up==0).PodRestartStormcovers most CrashLoopBackOff cases. Acceptable as a bridge with #325 tracked, but the gap should not linger — ImagePullBackOff stalls (e.g. Harbor 401, missing tag) are silent in this state until something else catches them. Recommend #325 prioritized within a sprint, not "eventually."4. Dashboard metrics references — all confirmed in basketball-api
Verified directly in
/home/ldraney/basketball-api/src/basketball_api/routes/webhooks.pyand/home/ldraney/basketball-api/src/basketball_api/routes/health.py:basketball_api_up(Gauge, set on every/metricsscrape)webhook_received_total{event_type}(Counter)webhook_processed_total{event_type}(Counter)webhook_errors_total{event_type, error_type}(Counter)webhook_last_received_timestamp{event_type}(Gauge)/metricsendpoint is mounted on the health router (no prefix) → exposeshttp://basketball-api.basketball-api.svc.cluster.local:8000/metrics. Panel queries are correct. The error-ratio panel correctly usesclamp_minto avoid div-by-zero. The "Webhook Error Ratio" panel's threshold steps are at unit "percent" with* 100already applied in the expression — display is correct.Minor dashboard observation (nit): "Seconds Since Last Webhook (any event)" panel uses
time() - max(webhook_last_received_timestamp)without filtering byevent_type. Since the gauge is per-event-type,maxover all event_types gives "most-recently-active event" which is what you want for a "did anything come in" signal. Correct.5. Spec file deletion in monitoring PR — scope creep nit
Deleting
terraform/specs/2026-04-10-westside-admin-design.mdin a monitoring PR violatesfeedback_stay_in_scope/ "no unnecessary file changes" PR convention. It's low-risk (one obsolete one-shot doc) but it muddles the changeset. Should have been a separate housekeeping PR or punted to a "spec graveyard cleanup" ticket. Not blocking.BLOCKERS
None that meet the BLOCKER criteria (no auth code, no unvalidated input, no secrets, no DRY violation in security paths). The WebhookStale boundary bugs (item 1 above) are correctness defects in new alerting logic but they fail-soft (under-alert one window, over-alert another) and not safety-critical. Recommend fixing in this PR or as an immediate follow-up before merge — see "CHANGES REQUESTED" below.
NITS
WebhookErrorRatealert usesincrease(webhook_errors_total[5m]) > 0 for 5m— that's effectively "any error sustained over a 10-minute window." Probably OK for a warning, but be aware it under-counts bursts: a single burst of 50 errors at t=0 followed by silence will fire at t=5m and self-resolve at t=10m. If you want sustained-error detection, this works; if you want any-burst detection, usefor: 0m.SOP COMPLIANCE
290-payment-pipeline-observability)tofu fmt -recursive -checkcleantofu validatecleanpayment-pipeline-alertsPrometheusRule already deployed and firing in cluster (perfeedback_tofu_validate_not_k8s_api, validate alone is insufficient — apply against state is required for new k8s resources, and that bar is met here)salt/pillar/secrets/platform.slsexplicitly excluded per PR body)PROCESS OBSERVATIONS
VERDICT: NOT APPROVED
WebhookStale has two correctness bugs in its day-of-week filter (Friday 8-9pm MST coverage gap, Sunday 8-9pm MST false-positive) and a guaranteed daily false-positive at every business-day morning rollover. These should be fixed before merge — either in this PR or a quick fix-up commit. Everything else (inhibit rule, defaultRules disable, dashboard, blackbox target removals, EndpointDown demotion) is sound. Re-request review after the PromQL is corrected.
Review-fix: applied QA's day-of-week split + cushion
Pushed
6501a2daddressing the verdict from comment 15860.Fix 1 — day-of-week boundary off-by-one (applied QA's exact suggestion):
Fix 2 — rollover cushion (
for: 10m → 60m):The 16:00 UTC rollover into business hours naturally has >30min staleness from overnight. A 60-min sustained-staleness check covers normal morning gaps for an active user base.
Not fully addressed in this PR — follow-up ticket created:
The "no first checkout yet today" case still flaps for genuinely low-traffic mornings. Proper fix requires a precondition like "webhook seen within the past 24h" rather than just an hour-of-day filter. Tracked as a separate ticket so this PR doesn't grow into a redesign.
Verification:
tofu fmt -recursive -checkcleantofu validate→ success6501a2dRe-requesting review.
PR #329 Re-Review (after
6501a2d)Re-review of the WebhookStale day-of-week and 16:00 UTC rollover findings from comment 15860.
DOMAIN REVIEW
Stack: Terraform (OpenTofu) + kube-prometheus-stack + Alertmanager + PrometheusRule CRDs + Grafana ConfigMap dashboards.
WebhookStale PromQL — verified correct:
Both half-day windows now match the intended business hours (Mon-Fri 9am-9pm MST):
day_of_week() in {1,2,3,4,5}(Mon-Fri). The> 0 and < 6strict-inequality form yields exactly that set. Correctly drops Sun (0) and Sat (6). ✓day_of_week() in {2,3,4,5,6}(Tue-Sat). The> 1 and < 7strict-inequality form yields exactly that set. Correctly drops Mon (1) — which would have been Sun MST evening, off-hours — and Sun (0). ✓This matches the suggested fix exactly. The previously flagged off-by-one defects (Mon UTC late-night incorrectly included; Fri MST late-evening incorrectly excluded) are both resolved.
for: 60mcushion — sufficient as a partial mitigation:At the daily 16:00 UTC rollover (9am MST = start of business hours), if
webhook_last_received_timestampis stale from overnight, the staleness condition will be true for the entire first hour. Withfor: 60m, the alert only pages if nocheckout.session.completedarrives by 17:00 UTC = 10am MST. On a normal-traffic day this almost never fires; on a genuinely-empty morning it is a true positive (revenue funnel verifiably idle for an hour during prime business hours). Acceptable for the noise-reduction goal, with #330 tracking the more sophisticated "first checkout of day" precondition.Adjacent rules — no regression:
WebhookErrorRate: unchanged, stillincrease(webhook_errors_total[5m]) > 0for 5m at warning. ✓inhibit_rules: unchanged, critical→warning suppression onalertname+namespace. ✓EndpointDown: unchanged from prior review (warning + 5m, expected-down targets removed). ✓Comment block quality — strong.
The inline comment documents:
MST = UTC-7, business hours UTC16:00-04:00)day_of_week()semantics (0=Sun, 1=Mon, ..., 6=Sat)for: 60mcushion purposeThis is the kind of comment future maintainers actually want — explains why the expression is non-obvious, not just what it does.
BLOCKERS
None. Both prior blockers (off-by-one on each half-day window) are fixed.
NITS
see #330) would let future maintainers find the follow-up without grep. Non-blocking; the discovered-scope section of the PR body lists adjacent tickets but stops at #328, and #330 was filed after this PR opened so its absence in the PR body is expected.for: 60mthe page only fires after 60 minutes of continuous staleness. Consider tightening to "60+ minutes" so on-call sees an accurate elapsed-time floor. Non-blocking; the staleness threshold (> 1800) is still 30min, so the message is technically true at the moment the staleness clock starts.SOP COMPLIANCE
290-payment-pipeline-observabilityfollows conventionsalt/pillar/secrets/platform.slsleft out as noted)tofu fmtandtofu validateclean per PR bodyPROCESS OBSERVATIONS
VERDICT: APPROVED
Applied both non-blocking nits from the approval:
Doc-only changes,
tofu validateclean. Approval should hold.View command line instructions
Checkout
From your project repository, check out a new branch and test the changes.