Add Stripe webhook delivery monitoring and alerting #347

Open
opened 2026-04-06 13:56:38 +00:00 by forgejo_admin · 0 comments

Type

Bug

Lineage

Discovered during forgejo_admin/basketball-api#343 and #346 investigation. Stripe silently stopped delivering webhooks for weeks — 7+ payments ($870+) went unrecorded. No alert, no monitoring, no awareness until a user emailed about a different issue.

Repo

forgejo_admin/basketball-api (health endpoint + metrics) and ldraney/pal-e-platform (monitoring/alerting rules)

What Broke

There is zero observability on Stripe webhook delivery. When Stripe gave up delivering webhooks after repeated failures, we had no way to know. This caused:

  • 7 jersey payments unrecorded in database
  • Parents charged but orders not tracked
  • Issue only discovered when a user reported a different bug

Repro Steps

  1. Stripe webhook delivery fails (for any reason — pod downtime, network, etc.)
  2. Stripe retries with exponential backoff over 72 hours
  3. After exhausting retries, Stripe stops delivering and may disable the endpoint
  4. No alert fires. No metric shows the failure. No one knows until a user complains.

Expected Behavior

Platform should detect and alert when:

  • Webhook endpoint returns non-2xx responses
  • pending_webhooks count is > 0 for events older than 1 hour
  • Stripe disables the webhook endpoint
  • Database has pending jersey orders older than 24 hours with no jersey_option set (stale checkout)

Environment

  • Cluster/namespace: prod / basketball-api
  • Stripe webhook: we_1T9I5sR9SdzWqVXM1WBWMDBv
  • Monitoring stack: Prometheus + Grafana (pal-e-platform)

Acceptance Criteria

  • Basketball-api exposes a /health/webhooks endpoint that checks Stripe webhook status (queries recent events for pending_webhooks > 0)
  • Prometheus scrapes the health endpoint
  • Grafana alert fires when webhook delivery is failing
  • Stale checkout detection: alert when pending orders exist > 24h with no jersey_option
  • Runbook: what to do when the alert fires (reset endpoint, manual sync from Stripe)
  • project-westside-basketball — project this affects
  • forgejo_admin/basketball-api#340 — original symptom
  • forgejo_admin/basketball-api#343 — funnel investigation
  • forgejo_admin/basketball-api#346 — RollingUpdate fix
  • Key files: basketball-api/src/basketball_api/routes/webhooks.py, pal-e-platform/terraform/modules/monitoring/
### Type Bug ### Lineage Discovered during forgejo_admin/basketball-api#343 and #346 investigation. Stripe silently stopped delivering webhooks for weeks — 7+ payments ($870+) went unrecorded. No alert, no monitoring, no awareness until a user emailed about a different issue. ### Repo `forgejo_admin/basketball-api` (health endpoint + metrics) and `ldraney/pal-e-platform` (monitoring/alerting rules) ### What Broke There is zero observability on Stripe webhook delivery. When Stripe gave up delivering webhooks after repeated failures, we had no way to know. This caused: - 7 jersey payments unrecorded in database - Parents charged but orders not tracked - Issue only discovered when a user reported a different bug ### Repro Steps 1. Stripe webhook delivery fails (for any reason — pod downtime, network, etc.) 2. Stripe retries with exponential backoff over 72 hours 3. After exhausting retries, Stripe stops delivering and may disable the endpoint 4. No alert fires. No metric shows the failure. No one knows until a user complains. ### Expected Behavior Platform should detect and alert when: - Webhook endpoint returns non-2xx responses - `pending_webhooks` count is > 0 for events older than 1 hour - Stripe disables the webhook endpoint - Database has `pending` jersey orders older than 24 hours with no `jersey_option` set (stale checkout) ### Environment - Cluster/namespace: prod / basketball-api - Stripe webhook: `we_1T9I5sR9SdzWqVXM1WBWMDBv` - Monitoring stack: Prometheus + Grafana (pal-e-platform) ### Acceptance Criteria - [ ] Basketball-api exposes a `/health/webhooks` endpoint that checks Stripe webhook status (queries recent events for `pending_webhooks > 0`) - [ ] Prometheus scrapes the health endpoint - [ ] Grafana alert fires when webhook delivery is failing - [ ] Stale checkout detection: alert when `pending` orders exist > 24h with no `jersey_option` - [ ] Runbook: what to do when the alert fires (reset endpoint, manual sync from Stripe) ### Related - `project-westside-basketball` — project this affects - `forgejo_admin/basketball-api#340` — original symptom - `forgejo_admin/basketball-api#343` — funnel investigation - `forgejo_admin/basketball-api#346` — RollingUpdate fix - Key files: `basketball-api/src/basketball_api/routes/webhooks.py`, `pal-e-platform/terraform/modules/monitoring/`
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
forgejo_admin/basketball-api#347
No description provided.