WebhookStale: add "first webhook of day" precondition for low-traffic mornings #330

Open
opened 2026-05-02 15:00:05 +00:00 by forgejo_admin · 0 comments
Contributor

Type

Feature

Lineage

Standalone — discovered 2026-05-02 during PR #329 review-fix loop.

Repo

forgejo_admin/pal-e-platform

User Story

As an oncall engineer, I want WebhookStale to fire only when there's evidence Stripe stopped delivering webhooks (vs. simply no checkout activity yet that day), so that low-traffic mornings don't generate false-positive alerts that train me to ignore real outages.

Context

PR #329 added WebhookStale. The first iteration had two off-by-one bugs (fixed in 6501a2d) plus a fundamental design wrinkle:

The rule fires when time() - webhook_last_received_timestamp > 1800s AND business hours. On a low-traffic day (no checkout for hours overnight), staleness at 16:00 UTC (start of business hours) is automatically >30min. The for: 60m cushion in 6501a2d mitigates this for active days, but doesn't fix it for days with literally no morning checkouts.

Proper fix: only fire if a webhook has been seen recently AND there's a 30-min gap. That distinguishes "Stripe is sending events but we missed one" (real fire) from "no one's bought anything today yet" (expected).

File Targets

Files to modify:

  • terraform/modules/monitoring/main.tfpayment-pipeline-alerts PrometheusRule, WebhookStale rule (search for WebhookStale)

Files NOT to touch:

  • WebhookErrorRate rule (different signal, different design)
  • basketball-api-golden-signals.json dashboard (panels are correct)

Acceptance Criteria

  • WebhookStale does not fire at 16:00 UTC weekdays when there are simply no checkouts yet that day (verified by checking PromQL evaluation against historical data with low-traffic mornings)
  • WebhookStale STILL fires within ~30 min of Stripe genuinely failing to deliver webhooks during business hours
  • Solution preserves the day-of-week + UTC-midnight handling already in place
  • Comment on the rule explaining the precondition logic

Test Expectations

  • In PR description: include the PromQL evaluation showing the rule's behavior across a day with: (a) genuine outage, (b) low-traffic morning with first checkout at noon, (c) high-traffic morning, (d) outside business hours
  • After merge: monitor for one full week and confirm no daily false-positives at 16:00 UTC

Constraints

  • PromQL must be evaluatable in a single rule expression (don't require a recording rule unless necessary)
  • Don't break existing webhook_received_total/webhook_processed_total/webhook_errors_total/webhook_last_received_timestamp consumers
  • Possible approach: add (time() - webhook_last_received_timestamp{...} < 86400) as a precondition, but verify this doesn't make the rule silent during a sustained 24h+ outage

Checklist

  • PR opened
  • tofu validate + fmt clean
  • No unrelated changes
  • pal-e-platform — project
  • forgejo_admin/pal-e-platform #290 — origin of the rule
  • forgejo_admin/pal-e-platform #329 — PR where the rollover was first observed
  • alert-report-2026-05-01 — alert snapshot
### Type Feature ### Lineage Standalone — discovered 2026-05-02 during PR #329 review-fix loop. ### Repo `forgejo_admin/pal-e-platform` ### User Story As an oncall engineer, I want `WebhookStale` to fire only when there's evidence Stripe stopped delivering webhooks (vs. simply no checkout activity yet that day), so that low-traffic mornings don't generate false-positive alerts that train me to ignore real outages. ### Context PR #329 added `WebhookStale`. The first iteration had two off-by-one bugs (fixed in `6501a2d`) plus a fundamental design wrinkle: The rule fires when `time() - webhook_last_received_timestamp > 1800s` AND business hours. On a low-traffic day (no checkout for hours overnight), staleness at 16:00 UTC (start of business hours) is automatically >30min. The `for: 60m` cushion in `6501a2d` mitigates this for active days, but doesn't fix it for days with literally no morning checkouts. Proper fix: only fire if a webhook has been seen recently AND there's a 30-min gap. That distinguishes "Stripe is sending events but we missed one" (real fire) from "no one's bought anything today yet" (expected). ### File Targets Files to modify: - `terraform/modules/monitoring/main.tf` — `payment-pipeline-alerts` PrometheusRule, `WebhookStale` rule (search for `WebhookStale`) Files NOT to touch: - `WebhookErrorRate` rule (different signal, different design) - `basketball-api-golden-signals.json` dashboard (panels are correct) ### Acceptance Criteria - [ ] `WebhookStale` does not fire at 16:00 UTC weekdays when there are simply no checkouts yet that day (verified by checking PromQL evaluation against historical data with low-traffic mornings) - [ ] `WebhookStale` STILL fires within ~30 min of Stripe genuinely failing to deliver webhooks during business hours - [ ] Solution preserves the day-of-week + UTC-midnight handling already in place - [ ] Comment on the rule explaining the precondition logic ### Test Expectations - [ ] In PR description: include the PromQL evaluation showing the rule's behavior across a day with: (a) genuine outage, (b) low-traffic morning with first checkout at noon, (c) high-traffic morning, (d) outside business hours - [ ] After merge: monitor for one full week and confirm no daily false-positives at 16:00 UTC ### Constraints - PromQL must be evaluatable in a single rule expression (don't require a recording rule unless necessary) - Don't break existing `webhook_received_total`/`webhook_processed_total`/`webhook_errors_total`/`webhook_last_received_timestamp` consumers - Possible approach: add `(time() - webhook_last_received_timestamp{...} < 86400)` as a precondition, but verify this doesn't make the rule silent during a sustained 24h+ outage ### Checklist - [ ] PR opened - [ ] tofu validate + fmt clean - [ ] No unrelated changes ### Related - `pal-e-platform` — project - `forgejo_admin/pal-e-platform #290` — origin of the rule - `forgejo_admin/pal-e-platform #329` — PR where the rollover was first observed - `alert-report-2026-05-01` — alert snapshot
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
ldraney/pal-e-platform#330
No description provided.