Reduce alert noise + add payment pipeline observability

forgejo_admin commented

2026-04-13 19:46:16 +00:00

Contributor

Type

Feature

Lineage

Standalone — discovered during investigation of 100% checkout failure rate on basketball-api /checkout/first-payment (Apr 13). 31 alerts firing, zero signal about the revenue-critical failure.

Repo

forgejo_admin/pal-e-platform

User Story

As a platform operator
I want alerts that surface revenue-critical failures and suppress known noise
So that a broken payment pipeline triggers a notification instead of hiding in 31 firing alerts

Context

The Westside monthly payment checkout has been returning 409 to every parent since at least 18:08 Apr 13. No alert fired because:

31 alerts are currently firing — complete alert fatigue
No payment pipeline alerts exist
No basketball-api golden signals dashboard exists
Expected-down endpoints (westside-dev, pal-e-app, mac-agent) generate critical alerts indistinguishable from real outages
kube-state-metrics produces duplicate alerts across namespaces

Basketball-api currently exposes only: basketball_api_up, webhook_received_total, webhook_processed_total, webhook_errors_total, webhook_last_received_timestamp. No per-endpoint HTTP metrics — that requires a separate basketball-api ticket to add Prometheus middleware.

File Targets

Files to modify:

terraform/modules/monitoring/main.tf — remove/inhibit noisy blackbox targets (westside-dev, pal-e-app, mac-agent), add payment webhook alert rules
terraform/dashboards/basketball-api-golden-signals.json — new file, Grafana dashboard for webhook metrics and API uptime

Files to NOT touch:

terraform/dashboards/pal-e-app-golden-signals.json — unrelated app
terraform/dashboards/dora-dashboard.json — unrelated
Any basketball-api code — separate repo, separate tickets

Acceptance Criteria

When westside-dev/pal-e-app/mac-agent are down, no critical alert fires (silenced or removed)
When webhook_errors_total increases for 5 minutes, a warning alert fires
When webhook_last_received_timestamp is stale for 30+ minutes during business hours, a warning alert fires
Grafana has a basketball-api dashboard showing webhook receive/process/error rates
Firing alert count drops from 31 to <5 real alerts
tofu plan shows only the expected changes

Test Expectations

tofu validate passes
tofu plan -lock=false shows expected resource changes (PrometheusRule updates, ConfigMap additions, blackbox target removals)
After apply: kubectl get prometheusrules -n monitoring shows updated rules
After apply: Grafana dashboard accessible and rendering data from basketball-api ServiceMonitor

Constraints

Must run tofu plan -lock=false (state lock blocks CI)
Blackbox target changes must not break existing uptime-dashboard.json references
Alert routing stays Telegram-only for now
Dashboard must use existing ServiceMonitor scrape (30s interval, port 8000, /metrics)
Per-endpoint HTTP metrics (status code by path) are NOT available yet — dashboard/alerts limited to webhook counters and uptime probe

Checklist

PR opened
tofu validate passes
tofu plan -lock=false reviewed
No unrelated changes
README roadmap updated if applicable

pal-e-platform — project this affects
basketball-api needs a separate ticket: add prometheus-fastapi-instrumentator for per-endpoint HTTP metrics (prerequisite for CheckoutErrorRate alert)
basketball-api needs a separate ticket: fix stale pending order 409 bug
basketball-api needs a separate ticket: exclude $0 monthly_fee from blast
basketball-api needs a separate ticket: implement recurring billing for May 1

### Type Feature ### Lineage Standalone — discovered during investigation of 100% checkout failure rate on basketball-api `/checkout/first-payment` (Apr 13). 31 alerts firing, zero signal about the revenue-critical failure. ### Repo `forgejo_admin/pal-e-platform` ### User Story As a platform operator I want alerts that surface revenue-critical failures and suppress known noise So that a broken payment pipeline triggers a notification instead of hiding in 31 firing alerts ### Context The Westside monthly payment checkout has been returning 409 to every parent since at least 18:08 Apr 13. No alert fired because: 1. 31 alerts are currently firing — complete alert fatigue 2. No payment pipeline alerts exist 3. No basketball-api golden signals dashboard exists 4. Expected-down endpoints (westside-dev, pal-e-app, mac-agent) generate critical alerts indistinguishable from real outages 5. kube-state-metrics produces duplicate alerts across namespaces Basketball-api currently exposes only: `basketball_api_up`, `webhook_received_total`, `webhook_processed_total`, `webhook_errors_total`, `webhook_last_received_timestamp`. No per-endpoint HTTP metrics — that requires a separate basketball-api ticket to add Prometheus middleware. ### File Targets Files to modify: - `terraform/modules/monitoring/main.tf` — remove/inhibit noisy blackbox targets (westside-dev, pal-e-app, mac-agent), add payment webhook alert rules - `terraform/dashboards/basketball-api-golden-signals.json` — new file, Grafana dashboard for webhook metrics and API uptime Files to NOT touch: - `terraform/dashboards/pal-e-app-golden-signals.json` — unrelated app - `terraform/dashboards/dora-dashboard.json` — unrelated - Any basketball-api code — separate repo, separate tickets ### Acceptance Criteria - [ ] When westside-dev/pal-e-app/mac-agent are down, no critical alert fires (silenced or removed) - [ ] When `webhook_errors_total` increases for 5 minutes, a warning alert fires - [ ] When `webhook_last_received_timestamp` is stale for 30+ minutes during business hours, a warning alert fires - [ ] Grafana has a basketball-api dashboard showing webhook receive/process/error rates - [ ] Firing alert count drops from 31 to <5 real alerts - [ ] `tofu plan` shows only the expected changes ### Test Expectations - [ ] `tofu validate` passes - [ ] `tofu plan -lock=false` shows expected resource changes (PrometheusRule updates, ConfigMap additions, blackbox target removals) - [ ] After apply: `kubectl get prometheusrules -n monitoring` shows updated rules - [ ] After apply: Grafana dashboard accessible and rendering data from basketball-api ServiceMonitor ### Constraints - Must run `tofu plan -lock=false` (state lock blocks CI) - Blackbox target changes must not break existing uptime-dashboard.json references - Alert routing stays Telegram-only for now - Dashboard must use existing ServiceMonitor scrape (30s interval, port 8000, `/metrics`) - Per-endpoint HTTP metrics (status code by path) are NOT available yet — dashboard/alerts limited to webhook counters and uptime probe ### Checklist - [ ] PR opened - [ ] `tofu validate` passes - [ ] `tofu plan -lock=false` reviewed - [ ] No unrelated changes - [ ] README roadmap updated if applicable ### Related - `pal-e-platform` — project this affects - basketball-api needs a separate ticket: add `prometheus-fastapi-instrumentator` for per-endpoint HTTP metrics (prerequisite for CheckoutErrorRate alert) - basketball-api needs a separate ticket: fix stale pending order 409 bug - basketball-api needs a separate ticket: exclude $0 monthly_fee from blast - basketball-api needs a separate ticket: implement recurring billing for May 1