Emit pending_orders_with_expired_session gauge metric #487
Labels
No labels
domain:backend
domain:devops
domain:frontend
status:approved
status:in-progress
status:needs-fix
status:qa
type:bug
type:devops
type:feature
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
forgejo_admin/basketball-api#487
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Type
Feature
Lineage
Standalone — discovered 2026-04-17 during Utah Invitational stranded-orders investigation. First observability gap we want closed so we never learn about expired links from an angry parent again.
Repo
forgejo_admin/basketball-apiUser Story
As the Ava main session, I see a Prometheus metric that counts
Orders inpendingstatus whose Stripe Checkout Session hasstatus=expired, so an alert rule in pal-e-platform can page before parents complain.Context
Current observability (branch
290-payment-pipeline-observabilityin pal-e-platform) addedWebhookErrorRateandWebhookStalealerts, plus a basketball-api golden-signals dashboard. Neither would have caught the Utah Invitational incident: no webhook fires on a dead link, andWebhookStaleonly alerts on cluster-wide silence during business hours.We need a direct signal: "there exist N pending orders whose checkout sessions are dead." That metric combined with an alert (Ticket C2) closes the gap.
File Targets
src/basketball_api/metrics.py(or wherever Prometheus metrics are currently registered — confirm pattern by reading existingbasketball_api_up,webhook_errors_total, etc.) — register new Gaugebasketball_api_pending_orders_with_expired_sessionOrderrows wherestatus='pending'ANDstripe_checkout_session_id IS NOT NULLstripe.checkout.Session.retrieve(session_id)OR consult a cached session tablesession.status == 'expired'product.categoryif cheapOptimization: direct Stripe retrieve per order is 18+ API calls every 5 min for our current volume — acceptable. Revisit if it grows.
Files NOT to touch:
Acceptance Criteria
basketball_api_pending_orders_with_expired_sessiongauge exposed at/metricscategorylabel (tournament, monthly, jersey)Test Expectations
pytest tests/test_metrics_expired_sessions.py -vConstraints
feedback_yaml_parse_validation.md— no YAML touched here, but any ServiceMonitor/PodMonitor CRD changes must be parse-validatedChecklist
/metricson deployed podRelated
project-pal-e-platformScope Review: NEEDS_REFINEMENT
Review note:
review-1024-2026-04-17Ticket is structurally complete (all Feature-template sections present, AC verifiable, blocks pal-e-platform#295 which is open) but over-specifies implementation and has stale file targets. Per Lucas's 2026-04-17 "tickets are not solution specs" policy, the body should shrink toward User Story + Context + AC.
Blocking refinements (
[BODY]):src/basketball_api/metrics.pydoes not exist. Live metrics are module-local inroutes/health.py(basketball_api_up) androutes/webhooks.py(webhook_errors_totaletc.). Replace themetrics.pytarget with a pointer to the existing pattern.grepconfirms zero matches across the repo). The existing FastAPI lifespan insrc/basketball_api/main.py:38is the right home. Flip the preference: asyncio-in-lifespan primary, no new runtime deps.Prefer Gauge over Counter,Match existing metrics style) — AC wording + the File Targets pointer cover them.basketball_api_pending_orders_with_expired_sessionand labelcategory(tournament/monthly/jersey) must match pal-e-platform#295 verbatim — cross-repo contract.Blocking label fix (
[LABEL]):story:observability→story:WS-S11,arch:payment-pipeline→arch:dataflow-westside-basketball. Matches the reconciliation applied to sibling #488 (board item #1023) and #486 (board item #1022) earlier today. Current labels are off-taxonomy —project-westside-basketballuses theWS-S{N}scheme, and noarch-payment-pipelinenote exists in pal-e-docs.Follow-up (
[SCOPE], non-blocking):story:payment-reliability+arch:stripe-checkout. One more board-item retag to close the loop.Decomposition: not needed — 1-2 modules + 1 test file, ~4-5 min of agent work, within the 5-minute rule. Single Dev pass after refinement.
See
review-1024-2026-04-17in pal-e-docs for the full audit.