Emit pending_orders_with_expired_session gauge metric #487

Open
opened 2026-04-17 16:31:31 +00:00 by forgejo_admin · 1 comment

Type

Feature

Lineage

Standalone — discovered 2026-04-17 during Utah Invitational stranded-orders investigation. First observability gap we want closed so we never learn about expired links from an angry parent again.

Repo

forgejo_admin/basketball-api

User Story

As the Ava main session, I see a Prometheus metric that counts Orders in pending status whose Stripe Checkout Session has status=expired, so an alert rule in pal-e-platform can page before parents complain.

Context

Current observability (branch 290-payment-pipeline-observability in pal-e-platform) added WebhookErrorRate and WebhookStale alerts, plus a basketball-api golden-signals dashboard. Neither would have caught the Utah Invitational incident: no webhook fires on a dead link, and WebhookStale only alerts on cluster-wide silence during business hours.

We need a direct signal: "there exist N pending orders whose checkout sessions are dead." That metric combined with an alert (Ticket C2) closes the gap.

File Targets

  • src/basketball_api/metrics.py (or wherever Prometheus metrics are currently registered — confirm pattern by reading existing basketball_api_up, webhook_errors_total, etc.) — register new Gauge basketball_api_pending_orders_with_expired_session
  • Background task or endpoint handler that periodically recomputes the gauge. Preferred: APScheduler job running every 5 minutes. If no scheduler exists, use FastAPI startup event + asyncio task.
  • Query pattern:
    1. Select all Order rows where status='pending' AND stripe_checkout_session_id IS NOT NULL
    2. For each, call stripe.checkout.Session.retrieve(session_id) OR consult a cached session table
    3. Count those where session.status == 'expired'
    4. Set gauge value; label by product.category if cheap

Optimization: direct Stripe retrieve per order is 18+ API calls every 5 min for our current volume — acceptable. Revisit if it grows.

Files NOT to touch:

  • Existing webhook handlers
  • Order model migrations

Acceptance Criteria

  • basketball_api_pending_orders_with_expired_session gauge exposed at /metrics
  • Gauge includes at minimum a category label (tournament, monthly, jersey)
  • Gauge refreshes at least every 5 minutes
  • Refresh is resilient: Stripe API failures log a warning, do not crash the service, leave the gauge at its last value
  • Local run: current production state surfaces 18 (tournament) + 6 (monthly) = 24 or close
  • Integration test with mocked Stripe client asserts the gauge increments for expired sessions

Test Expectations

  • Unit test: gauge refresh function with mocked DB + Stripe client — 2 pending orders, 1 expired → gauge=1
  • Unit test: Stripe API failure mid-refresh logs warning, does not crash
  • Run command: pytest tests/test_metrics_expired_sessions.py -v

Constraints

  • Do NOT add a new DB table unless truly necessary. Prefer in-memory refresh.
  • Prefer Gauge over Counter (count can go down as parents pay)
  • Match existing metrics style in basketball-api (module path, naming prefix)
  • Follow feedback_yaml_parse_validation.md — no YAML touched here, but any ServiceMonitor/PodMonitor CRD changes must be parse-validated

Checklist

  • PR opened
  • Tests pass in CI
  • Gauge visible at /metrics on deployed pod
  • Logged one clean refresh cycle in Kubernetes logs
  • project-pal-e-platform
  • Blocks: Ticket C2 (alert rule consuming this metric)
  • Blocked by: none
### Type Feature ### Lineage Standalone — discovered 2026-04-17 during Utah Invitational stranded-orders investigation. First observability gap we want closed so we never learn about expired links from an angry parent again. ### Repo `forgejo_admin/basketball-api` ### User Story As the Ava main session, I see a Prometheus metric that counts `Order`s in `pending` status whose Stripe Checkout Session has `status=expired`, so an alert rule in pal-e-platform can page before parents complain. ### Context Current observability (branch `290-payment-pipeline-observability` in pal-e-platform) added `WebhookErrorRate` and `WebhookStale` alerts, plus a basketball-api golden-signals dashboard. Neither would have caught the Utah Invitational incident: no webhook fires on a dead link, and `WebhookStale` only alerts on cluster-wide silence during business hours. We need a direct signal: "there exist N pending orders whose checkout sessions are dead." That metric combined with an alert (Ticket C2) closes the gap. ### File Targets - `src/basketball_api/metrics.py` (or wherever Prometheus metrics are currently registered — confirm pattern by reading existing `basketball_api_up`, `webhook_errors_total`, etc.) — register new Gauge `basketball_api_pending_orders_with_expired_session` - Background task or endpoint handler that periodically recomputes the gauge. Preferred: APScheduler job running every 5 minutes. If no scheduler exists, use FastAPI startup event + asyncio task. - Query pattern: 1. Select all `Order` rows where `status='pending'` AND `stripe_checkout_session_id IS NOT NULL` 2. For each, call `stripe.checkout.Session.retrieve(session_id)` OR consult a cached session table 3. Count those where `session.status == 'expired'` 4. Set gauge value; label by `product.category` if cheap Optimization: direct Stripe retrieve per order is 18+ API calls every 5 min for our current volume — acceptable. Revisit if it grows. Files NOT to touch: - Existing webhook handlers - Order model migrations ### Acceptance Criteria - [ ] `basketball_api_pending_orders_with_expired_session` gauge exposed at `/metrics` - [ ] Gauge includes at minimum a `category` label (tournament, monthly, jersey) - [ ] Gauge refreshes at least every 5 minutes - [ ] Refresh is resilient: Stripe API failures log a warning, do not crash the service, leave the gauge at its last value - [ ] Local run: current production state surfaces 18 (tournament) + 6 (monthly) = 24 or close - [ ] Integration test with mocked Stripe client asserts the gauge increments for expired sessions ### Test Expectations - [ ] Unit test: gauge refresh function with mocked DB + Stripe client — 2 pending orders, 1 expired → gauge=1 - [ ] Unit test: Stripe API failure mid-refresh logs warning, does not crash - Run command: `pytest tests/test_metrics_expired_sessions.py -v` ### Constraints - Do NOT add a new DB table unless truly necessary. Prefer in-memory refresh. - Prefer Gauge over Counter (count can go down as parents pay) - Match existing metrics style in basketball-api (module path, naming prefix) - Follow `feedback_yaml_parse_validation.md` — no YAML touched here, but any ServiceMonitor/PodMonitor CRD changes must be parse-validated ### Checklist - [ ] PR opened - [ ] Tests pass in CI - [ ] Gauge visible at `/metrics` on deployed pod - [ ] Logged one clean refresh cycle in Kubernetes logs ### Related - `project-pal-e-platform` - Blocks: Ticket C2 (alert rule consuming this metric) - Blocked by: none
Author
Owner

Scope Review: NEEDS_REFINEMENT

Review note: review-1024-2026-04-17

Ticket is structurally complete (all Feature-template sections present, AC verifiable, blocks pal-e-platform#295 which is open) but over-specifies implementation and has stale file targets. Per Lucas's 2026-04-17 "tickets are not solution specs" policy, the body should shrink toward User Story + Context + AC.

Blocking refinements ([BODY]):

  • src/basketball_api/metrics.py does not exist. Live metrics are module-local in routes/health.py (basketball_api_up) and routes/webhooks.py (webhook_errors_total etc.). Replace the metrics.py target with a pointer to the existing pattern.
  • "Preferred: APScheduler job every 5 minutes" — APScheduler is not a current dependency (grep confirms zero matches across the repo). The existing FastAPI lifespan in src/basketball_api/main.py:38 is the right home. Flip the preference: asyncio-in-lifespan primary, no new runtime deps.
  • Remove the 4-step numbered query algorithm from File Targets — AC already expresses the outcome.
  • Remove the contradictory "consult a cached session table" branch — conflicts with the "no new DB table" Constraint.
  • Drop redundant Constraints (Prefer Gauge over Counter, Match existing metrics style) — AC wording + the File Targets pointer cover them.
  • Add AC: metric name basketball_api_pending_orders_with_expired_session and label category (tournament/monthly/jersey) must match pal-e-platform#295 verbatim — cross-repo contract.
  • Add Context note: AC-5's "18 + 6 = 24" snapshot is valid against current main; re-measure after sibling #488 (30-day TTL) deploys, since #488 will drain this gauge for new orders.

Blocking label fix ([LABEL]):

  • Retag board item #1024: story:observabilitystory:WS-S11, arch:payment-pipelinearch:dataflow-westside-basketball. Matches the reconciliation applied to sibling #488 (board item #1023) and #486 (board item #1022) earlier today. Current labels are off-taxonomy — project-westside-basketball uses the WS-S{N} scheme, and no arch-payment-pipeline note exists in pal-e-docs.

Follow-up ([SCOPE], non-blocking):

  • Complete payment-pipeline story-cluster reconciliation: sibling #489 still carries story:payment-reliability + arch:stripe-checkout. One more board-item retag to close the loop.

Decomposition: not needed — 1-2 modules + 1 test file, ~4-5 min of agent work, within the 5-minute rule. Single Dev pass after refinement.

See review-1024-2026-04-17 in pal-e-docs for the full audit.

## Scope Review: NEEDS_REFINEMENT Review note: `review-1024-2026-04-17` Ticket is structurally complete (all Feature-template sections present, AC verifiable, blocks pal-e-platform#295 which is open) but over-specifies implementation and has stale file targets. Per Lucas's 2026-04-17 "tickets are not solution specs" policy, the body should shrink toward User Story + Context + AC. **Blocking refinements (`[BODY]`):** - `src/basketball_api/metrics.py` does not exist. Live metrics are module-local in `routes/health.py` (`basketball_api_up`) and `routes/webhooks.py` (`webhook_errors_total` etc.). Replace the `metrics.py` target with a pointer to the existing pattern. - "Preferred: APScheduler job every 5 minutes" — APScheduler is not a current dependency (`grep` confirms zero matches across the repo). The existing FastAPI lifespan in `src/basketball_api/main.py:38` is the right home. Flip the preference: asyncio-in-lifespan primary, no new runtime deps. - Remove the 4-step numbered query algorithm from File Targets — AC already expresses the outcome. - Remove the contradictory "consult a cached session table" branch — conflicts with the "no new DB table" Constraint. - Drop redundant Constraints (`Prefer Gauge over Counter`, `Match existing metrics style`) — AC wording + the File Targets pointer cover them. - Add AC: metric name `basketball_api_pending_orders_with_expired_session` and label `category` (tournament/monthly/jersey) must match pal-e-platform#295 verbatim — cross-repo contract. - Add Context note: AC-5's "18 + 6 = 24" snapshot is valid against current main; re-measure after sibling #488 (30-day TTL) deploys, since #488 will drain this gauge for new orders. **Blocking label fix (`[LABEL]`):** - Retag board item #1024: `story:observability` → `story:WS-S11`, `arch:payment-pipeline` → `arch:dataflow-westside-basketball`. Matches the reconciliation applied to sibling #488 (board item #1023) and #486 (board item #1022) earlier today. Current labels are off-taxonomy — `project-westside-basketball` uses the `WS-S{N}` scheme, and no `arch-payment-pipeline` note exists in pal-e-docs. **Follow-up (`[SCOPE]`, non-blocking):** - Complete payment-pipeline story-cluster reconciliation: sibling #489 still carries `story:payment-reliability` + `arch:stripe-checkout`. One more board-item retag to close the loop. **Decomposition:** not needed — 1-2 modules + 1 test file, ~4-5 min of agent work, within the 5-minute rule. Single Dev pass after refinement. See `review-1024-2026-04-17` in pal-e-docs for the full audit.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
forgejo_admin/basketball-api#487
No description provided.