Alert rule + dashboard panel for pending orders with expired Stripe sessions #295

Open
opened 2026-04-17 16:31:53 +00:00 by forgejo_admin · 0 comments
Contributor

Type

Feature

Lineage

Part of the 2026-04-17 Utah Invitational stranded-orders response. Blocked by basketball-api ticket that emits basketball_api_pending_orders_with_expired_session gauge.

Repo

forgejo_admin/pal-e-platform

User Story

As the on-call human, I get paged (warning, not critical) when any pending payment order's Stripe session is expired for >15 minutes, so we can regen + resend before a parent complains.

Context

Existing payment-pipeline alerts live on branch 290-payment-pipeline-observability (commit 432e24e):

  • WebhookErrorRate — warning at 5m
  • WebhookStale — warning at 10m

Neither fires for the Utah Invitational failure mode. An expired session never triggers a webhook, so "no webhooks" looks normal. We need a direct gauge-based alert.

File Targets

  • monitoring/prometheusrules/payment-pipeline-alerts.yaml — add rule TournamentOrdersWithExpiredSessions (warning, for: 15m)
  • monitoring/dashboards/basketball-api-golden-signals.json (or its ConfigMap definition) — add a stat panel showing current value of basketball_api_pending_orders_with_expired_session, plus a time-series showing history
  • monitoring/prometheusrules/payment-pipeline-alerts.yaml — also consider a warning MonthlyOrdersWithExpiredSessions with its own threshold

Alert expression template:

expr: basketball_api_pending_orders_with_expired_session{category="tournament"} > 0
for: 15m
labels:
  severity: warning
annotations:
  summary: "{{ $value }} tournament orders have expired Stripe sessions"
  description: "Parents clicking these links see the 'checkout session expired' page. Regen via scripts/regenerate_tournament_orders.py --product-ids {relevant} --commit, then resend per blast SOP."
  runbook_url: "https://forgejo.pal-e.com/forgejo_admin/basketball-api/src/branch/main/docs/tournament-billing-runbook.md"

Files NOT to touch:

  • tenant-defaults or other Prometheus rules files unrelated to payment pipeline
  • Grafana config outside the basketball-api-golden-signals dashboard

Acceptance Criteria

  • TournamentOrdersWithExpiredSessions rule in payment-pipeline-alerts.yaml, warning severity, 15m window
  • MonthlyOrdersWithExpiredSessions rule, same pattern
  • Alerts inhibit correctly alongside existing inhibit rule (same namespace)
  • Dashboard panel on basketball-api-golden-signals shows current gauge value
  • Dashboard panel shows 24h trend
  • YAML parse validated (yaml.safe_load) before commit — see feedback_yaml_parse_validation.md
  • tofu plan -lock=false output in PR shows rule+dashboard changes only
  • tofu fmt + tofu validate clean

Test Expectations

  • Local promtool check rules monitoring/prometheusrules/payment-pipeline-alerts.yaml passes
  • Dashboard JSON passes Grafana validation (can port-forward Grafana pod + load locally)
  • After deploy, Prometheus UI shows rule in "OK" state initially, transitions to "pending" → "firing" when basketball-api metric exceeds 0

Constraints

  • Warning severity only — this is not page-wake-Lucas material
  • Follow existing pattern in payment-pipeline-alerts.yaml for consistency
  • Use same kube-prometheus-stack label set as existing rules so Alertmanager routes correctly
  • ArgoCD auto-sync applies — verify via argocd app sync monitoring or rely on branch merge → reconcile

Checklist

  • PR opened with tofu plan output
  • Promtool check passes
  • ArgoCD shows sync healthy post-merge
  • Alert fires in production (since we currently have 18 + 6 = 24 expired-session orders)
  • Alert clears after Ticket A recovery runs
  • project-pal-e-platform
  • Blocks: none
  • Blocked by: basketball-api ticket emitting basketball_api_pending_orders_with_expired_session
### Type Feature ### Lineage Part of the 2026-04-17 Utah Invitational stranded-orders response. Blocked by basketball-api ticket that emits `basketball_api_pending_orders_with_expired_session` gauge. ### Repo `forgejo_admin/pal-e-platform` ### User Story As the on-call human, I get paged (warning, not critical) when any pending payment order's Stripe session is expired for >15 minutes, so we can regen + resend before a parent complains. ### Context Existing payment-pipeline alerts live on branch `290-payment-pipeline-observability` (commit `432e24e`): - `WebhookErrorRate` — warning at 5m - `WebhookStale` — warning at 10m Neither fires for the Utah Invitational failure mode. An expired session never triggers a webhook, so "no webhooks" looks normal. We need a direct gauge-based alert. ### File Targets - `monitoring/prometheusrules/payment-pipeline-alerts.yaml` — add rule `TournamentOrdersWithExpiredSessions` (warning, for: 15m) - `monitoring/dashboards/basketball-api-golden-signals.json` (or its ConfigMap definition) — add a stat panel showing current value of `basketball_api_pending_orders_with_expired_session`, plus a time-series showing history - `monitoring/prometheusrules/payment-pipeline-alerts.yaml` — also consider a warning `MonthlyOrdersWithExpiredSessions` with its own threshold Alert expression template: ``` expr: basketball_api_pending_orders_with_expired_session{category="tournament"} > 0 for: 15m labels: severity: warning annotations: summary: "{{ $value }} tournament orders have expired Stripe sessions" description: "Parents clicking these links see the 'checkout session expired' page. Regen via scripts/regenerate_tournament_orders.py --product-ids {relevant} --commit, then resend per blast SOP." runbook_url: "https://forgejo.pal-e.com/forgejo_admin/basketball-api/src/branch/main/docs/tournament-billing-runbook.md" ``` Files NOT to touch: - `tenant-defaults` or other Prometheus rules files unrelated to payment pipeline - Grafana config outside the basketball-api-golden-signals dashboard ### Acceptance Criteria - [ ] `TournamentOrdersWithExpiredSessions` rule in payment-pipeline-alerts.yaml, warning severity, 15m window - [ ] `MonthlyOrdersWithExpiredSessions` rule, same pattern - [ ] Alerts inhibit correctly alongside existing inhibit rule (same namespace) - [ ] Dashboard panel on basketball-api-golden-signals shows current gauge value - [ ] Dashboard panel shows 24h trend - [ ] YAML parse validated (`yaml.safe_load`) before commit — see `feedback_yaml_parse_validation.md` - [ ] `tofu plan -lock=false` output in PR shows rule+dashboard changes only - [ ] `tofu fmt` + `tofu validate` clean ### Test Expectations - [ ] Local `promtool check rules monitoring/prometheusrules/payment-pipeline-alerts.yaml` passes - [ ] Dashboard JSON passes Grafana validation (can port-forward Grafana pod + load locally) - [ ] After deploy, Prometheus UI shows rule in "OK" state initially, transitions to "pending" → "firing" when basketball-api metric exceeds 0 ### Constraints - Warning severity only — this is not page-wake-Lucas material - Follow existing pattern in `payment-pipeline-alerts.yaml` for consistency - Use same kube-prometheus-stack label set as existing rules so Alertmanager routes correctly - ArgoCD auto-sync applies — verify via `argocd app sync monitoring` or rely on branch merge → reconcile ### Checklist - [ ] PR opened with tofu plan output - [ ] Promtool check passes - [ ] ArgoCD shows sync healthy post-merge - [ ] Alert fires in production (since we currently have 18 + 6 = 24 expired-session orders) - [ ] Alert clears after Ticket A recovery runs ### Related - `project-pal-e-platform` - Blocks: none - Blocked by: basketball-api ticket emitting `basketball_api_pending_orders_with_expired_session`
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
ldraney/pal-e-platform#295
No description provided.