Reduce alert noise + add payment pipeline observability #290

Closed
opened 2026-04-13 19:46:16 +00:00 by forgejo_admin · 0 comments
Contributor

Type

Feature

Lineage

Standalone — discovered during investigation of 100% checkout failure rate on basketball-api /checkout/first-payment (Apr 13). 31 alerts firing, zero signal about the revenue-critical failure.

Repo

forgejo_admin/pal-e-platform

User Story

As a platform operator
I want alerts that surface revenue-critical failures and suppress known noise
So that a broken payment pipeline triggers a notification instead of hiding in 31 firing alerts

Context

The Westside monthly payment checkout has been returning 409 to every parent since at least 18:08 Apr 13. No alert fired because:

  1. 31 alerts are currently firing — complete alert fatigue
  2. No payment pipeline alerts exist
  3. No basketball-api golden signals dashboard exists
  4. Expected-down endpoints (westside-dev, pal-e-app, mac-agent) generate critical alerts indistinguishable from real outages
  5. kube-state-metrics produces duplicate alerts across namespaces

Basketball-api currently exposes only: basketball_api_up, webhook_received_total, webhook_processed_total, webhook_errors_total, webhook_last_received_timestamp. No per-endpoint HTTP metrics — that requires a separate basketball-api ticket to add Prometheus middleware.

File Targets

Files to modify:

  • terraform/modules/monitoring/main.tf — remove/inhibit noisy blackbox targets (westside-dev, pal-e-app, mac-agent), add payment webhook alert rules
  • terraform/dashboards/basketball-api-golden-signals.json — new file, Grafana dashboard for webhook metrics and API uptime

Files to NOT touch:

  • terraform/dashboards/pal-e-app-golden-signals.json — unrelated app
  • terraform/dashboards/dora-dashboard.json — unrelated
  • Any basketball-api code — separate repo, separate tickets

Acceptance Criteria

  • When westside-dev/pal-e-app/mac-agent are down, no critical alert fires (silenced or removed)
  • When webhook_errors_total increases for 5 minutes, a warning alert fires
  • When webhook_last_received_timestamp is stale for 30+ minutes during business hours, a warning alert fires
  • Grafana has a basketball-api dashboard showing webhook receive/process/error rates
  • Firing alert count drops from 31 to <5 real alerts
  • tofu plan shows only the expected changes

Test Expectations

  • tofu validate passes
  • tofu plan -lock=false shows expected resource changes (PrometheusRule updates, ConfigMap additions, blackbox target removals)
  • After apply: kubectl get prometheusrules -n monitoring shows updated rules
  • After apply: Grafana dashboard accessible and rendering data from basketball-api ServiceMonitor

Constraints

  • Must run tofu plan -lock=false (state lock blocks CI)
  • Blackbox target changes must not break existing uptime-dashboard.json references
  • Alert routing stays Telegram-only for now
  • Dashboard must use existing ServiceMonitor scrape (30s interval, port 8000, /metrics)
  • Per-endpoint HTTP metrics (status code by path) are NOT available yet — dashboard/alerts limited to webhook counters and uptime probe

Checklist

  • PR opened
  • tofu validate passes
  • tofu plan -lock=false reviewed
  • No unrelated changes
  • README roadmap updated if applicable
  • pal-e-platform — project this affects
  • basketball-api needs a separate ticket: add prometheus-fastapi-instrumentator for per-endpoint HTTP metrics (prerequisite for CheckoutErrorRate alert)
  • basketball-api needs a separate ticket: fix stale pending order 409 bug
  • basketball-api needs a separate ticket: exclude $0 monthly_fee from blast
  • basketball-api needs a separate ticket: implement recurring billing for May 1
### Type Feature ### Lineage Standalone — discovered during investigation of 100% checkout failure rate on basketball-api `/checkout/first-payment` (Apr 13). 31 alerts firing, zero signal about the revenue-critical failure. ### Repo `forgejo_admin/pal-e-platform` ### User Story As a platform operator I want alerts that surface revenue-critical failures and suppress known noise So that a broken payment pipeline triggers a notification instead of hiding in 31 firing alerts ### Context The Westside monthly payment checkout has been returning 409 to every parent since at least 18:08 Apr 13. No alert fired because: 1. 31 alerts are currently firing — complete alert fatigue 2. No payment pipeline alerts exist 3. No basketball-api golden signals dashboard exists 4. Expected-down endpoints (westside-dev, pal-e-app, mac-agent) generate critical alerts indistinguishable from real outages 5. kube-state-metrics produces duplicate alerts across namespaces Basketball-api currently exposes only: `basketball_api_up`, `webhook_received_total`, `webhook_processed_total`, `webhook_errors_total`, `webhook_last_received_timestamp`. No per-endpoint HTTP metrics — that requires a separate basketball-api ticket to add Prometheus middleware. ### File Targets Files to modify: - `terraform/modules/monitoring/main.tf` — remove/inhibit noisy blackbox targets (westside-dev, pal-e-app, mac-agent), add payment webhook alert rules - `terraform/dashboards/basketball-api-golden-signals.json` — new file, Grafana dashboard for webhook metrics and API uptime Files to NOT touch: - `terraform/dashboards/pal-e-app-golden-signals.json` — unrelated app - `terraform/dashboards/dora-dashboard.json` — unrelated - Any basketball-api code — separate repo, separate tickets ### Acceptance Criteria - [ ] When westside-dev/pal-e-app/mac-agent are down, no critical alert fires (silenced or removed) - [ ] When `webhook_errors_total` increases for 5 minutes, a warning alert fires - [ ] When `webhook_last_received_timestamp` is stale for 30+ minutes during business hours, a warning alert fires - [ ] Grafana has a basketball-api dashboard showing webhook receive/process/error rates - [ ] Firing alert count drops from 31 to <5 real alerts - [ ] `tofu plan` shows only the expected changes ### Test Expectations - [ ] `tofu validate` passes - [ ] `tofu plan -lock=false` shows expected resource changes (PrometheusRule updates, ConfigMap additions, blackbox target removals) - [ ] After apply: `kubectl get prometheusrules -n monitoring` shows updated rules - [ ] After apply: Grafana dashboard accessible and rendering data from basketball-api ServiceMonitor ### Constraints - Must run `tofu plan -lock=false` (state lock blocks CI) - Blackbox target changes must not break existing uptime-dashboard.json references - Alert routing stays Telegram-only for now - Dashboard must use existing ServiceMonitor scrape (30s interval, port 8000, `/metrics`) - Per-endpoint HTTP metrics (status code by path) are NOT available yet — dashboard/alerts limited to webhook counters and uptime probe ### Checklist - [ ] PR opened - [ ] `tofu validate` passes - [ ] `tofu plan -lock=false` reviewed - [ ] No unrelated changes - [ ] README roadmap updated if applicable ### Related - `pal-e-platform` — project this affects - basketball-api needs a separate ticket: add `prometheus-fastapi-instrumentator` for per-endpoint HTTP metrics (prerequisite for CheckoutErrorRate alert) - basketball-api needs a separate ticket: fix stale pending order 409 bug - basketball-api needs a separate ticket: exclude $0 monthly_fee from blast - basketball-api needs a separate ticket: implement recurring billing for May 1
forgejo_admin 2026-04-14 02:52:44 +00:00
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
ldraney/pal-e-platform#290
No description provided.