WebhookStale: add "first webhook of day" precondition for low-traffic mornings #330
Labels
No labels
domain:backend
domain:devops
domain:frontend
status:approved
status:in-progress
status:needs-fix
status:qa
type:bug
type:devops
type:feature
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
ldraney/pal-e-platform#330
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Type
Feature
Lineage
Standalone — discovered 2026-05-02 during PR #329 review-fix loop.
Repo
forgejo_admin/pal-e-platformUser Story
As an oncall engineer, I want
WebhookStaleto fire only when there's evidence Stripe stopped delivering webhooks (vs. simply no checkout activity yet that day), so that low-traffic mornings don't generate false-positive alerts that train me to ignore real outages.Context
PR #329 added
WebhookStale. The first iteration had two off-by-one bugs (fixed in6501a2d) plus a fundamental design wrinkle:The rule fires when
time() - webhook_last_received_timestamp > 1800sAND business hours. On a low-traffic day (no checkout for hours overnight), staleness at 16:00 UTC (start of business hours) is automatically >30min. Thefor: 60mcushion in6501a2dmitigates this for active days, but doesn't fix it for days with literally no morning checkouts.Proper fix: only fire if a webhook has been seen recently AND there's a 30-min gap. That distinguishes "Stripe is sending events but we missed one" (real fire) from "no one's bought anything today yet" (expected).
File Targets
Files to modify:
terraform/modules/monitoring/main.tf—payment-pipeline-alertsPrometheusRule,WebhookStalerule (search forWebhookStale)Files NOT to touch:
WebhookErrorRaterule (different signal, different design)basketball-api-golden-signals.jsondashboard (panels are correct)Acceptance Criteria
WebhookStaledoes not fire at 16:00 UTC weekdays when there are simply no checkouts yet that day (verified by checking PromQL evaluation against historical data with low-traffic mornings)WebhookStaleSTILL fires within ~30 min of Stripe genuinely failing to deliver webhooks during business hoursTest Expectations
Constraints
webhook_received_total/webhook_processed_total/webhook_errors_total/webhook_last_received_timestampconsumers(time() - webhook_last_received_timestamp{...} < 86400)as a precondition, but verify this doesn't make the rule silent during a sustained 24h+ outageChecklist
Related
pal-e-platform— projectforgejo_admin/pal-e-platform #290— origin of the ruleforgejo_admin/pal-e-platform #329— PR where the rollover was first observedalert-report-2026-05-01— alert snapshot