Add Stripe webhook alerting rules and basketball-api Grafana dashboard #272

Open
opened 2026-04-06 15:31:35 +00:00 by forgejo_admin · 0 comments

Type

Bug

Lineage

Discovered during forgejo_admin/basketball-api#340 investigation. Stripe webhooks were silently failing for weeks with zero alerting. Depends on forgejo_admin/basketball-api#350 (webhook metrics must exist before platform can alert on them).

Repo

forgejo_admin/pal-e-platform

What Broke

No observability on Stripe webhook delivery. When webhooks failed (for multiple reasons — secret mismatch, SDK crash, deploy downtime), there was zero alerting. 7+ payments ($870+) went unrecorded. Issue only discovered when a user emailed about a different bug. basketball-api currently exposes only basketball_api_up on /metrics — no request rates, no webhook counters, no error tracking.

Repro Steps

  1. Stripe webhook delivery fails for any reason
  2. No PrometheusRule fires
  3. No Grafana panel shows the failure
  4. Nobody knows until a user complains

Expected Behavior

Platform detects and alerts within minutes when:

  • Webhook error rate > 0 sustained for 5 minutes
  • No webhooks received in 24 hours (staleness)
  • pending jersey orders exist > 24 hours without jersey_option set (stale checkout)
    Grafana dashboard shows webhook success/failure rate and event type breakdown.

Environment

  • Cluster/namespace: prod / monitoring (Prometheus, Grafana)
  • ServiceMonitor: basketball-api exists, scrapes /metrics
  • Current metrics: only basketball_api_up gauge
  • Depends on: forgejo_admin/basketball-api#350 adding webhook_received_total, webhook_processed_total, webhook_errors_total, webhook_last_received_timestamp

Acceptance Criteria

  • PrometheusRule: WebhookErrorRateHighrate(webhook_errors_total[5m]) > 0 for 5m, severity warning
  • PrometheusRule: WebhookStaletime() - webhook_last_received_timestamp > 86400 for 1h, severity warning
  • Grafana dashboard: basketball-api webhook panel (received/processed/errors over time, by event_type)
  • Alerts fire in test when metrics are simulated
  • project-westside-basketball — project this affects
  • forgejo_admin/basketball-api#350 — dependency (metrics must exist first)
  • platform-architecture — monitoring module
  • Key files: terraform/modules/monitoring/main.tf (PrometheusRules), terraform/dashboards/ (Grafana dashboards)
### Type Bug ### Lineage Discovered during forgejo_admin/basketball-api#340 investigation. Stripe webhooks were silently failing for weeks with zero alerting. Depends on forgejo_admin/basketball-api#350 (webhook metrics must exist before platform can alert on them). ### Repo `forgejo_admin/pal-e-platform` ### What Broke No observability on Stripe webhook delivery. When webhooks failed (for multiple reasons — secret mismatch, SDK crash, deploy downtime), there was zero alerting. 7+ payments ($870+) went unrecorded. Issue only discovered when a user emailed about a different bug. basketball-api currently exposes only `basketball_api_up` on `/metrics` — no request rates, no webhook counters, no error tracking. ### Repro Steps 1. Stripe webhook delivery fails for any reason 2. No PrometheusRule fires 3. No Grafana panel shows the failure 4. Nobody knows until a user complains ### Expected Behavior Platform detects and alerts within minutes when: - Webhook error rate > 0 sustained for 5 minutes - No webhooks received in 24 hours (staleness) - `pending` jersey orders exist > 24 hours without `jersey_option` set (stale checkout) Grafana dashboard shows webhook success/failure rate and event type breakdown. ### Environment - Cluster/namespace: prod / monitoring (Prometheus, Grafana) - ServiceMonitor: `basketball-api` exists, scrapes `/metrics` - Current metrics: only `basketball_api_up` gauge - Depends on: forgejo_admin/basketball-api#350 adding `webhook_received_total`, `webhook_processed_total`, `webhook_errors_total`, `webhook_last_received_timestamp` ### Acceptance Criteria - [ ] PrometheusRule: `WebhookErrorRateHigh` — `rate(webhook_errors_total[5m]) > 0` for 5m, severity warning - [ ] PrometheusRule: `WebhookStale` — `time() - webhook_last_received_timestamp > 86400` for 1h, severity warning - [ ] Grafana dashboard: basketball-api webhook panel (received/processed/errors over time, by event_type) - [ ] Alerts fire in test when metrics are simulated ### Related - `project-westside-basketball` — project this affects - `forgejo_admin/basketball-api#350` — dependency (metrics must exist first) - `platform-architecture` — monitoring module - Key files: `terraform/modules/monitoring/main.tf` (PrometheusRules), `terraform/dashboards/` (Grafana dashboards)
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
forgejo_admin/pal-e-platform#272
No description provided.