Reduce alert noise + add payment pipeline observability #290
Labels
No labels
domain:backend
domain:devops
domain:frontend
status:approved
status:in-progress
status:needs-fix
status:qa
type:bug
type:devops
type:feature
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
ldraney/pal-e-platform#290
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Type
Feature
Lineage
Standalone — discovered during investigation of 100% checkout failure rate on basketball-api
/checkout/first-payment(Apr 13). 31 alerts firing, zero signal about the revenue-critical failure.Repo
forgejo_admin/pal-e-platformUser Story
As a platform operator
I want alerts that surface revenue-critical failures and suppress known noise
So that a broken payment pipeline triggers a notification instead of hiding in 31 firing alerts
Context
The Westside monthly payment checkout has been returning 409 to every parent since at least 18:08 Apr 13. No alert fired because:
Basketball-api currently exposes only:
basketball_api_up,webhook_received_total,webhook_processed_total,webhook_errors_total,webhook_last_received_timestamp. No per-endpoint HTTP metrics — that requires a separate basketball-api ticket to add Prometheus middleware.File Targets
Files to modify:
terraform/modules/monitoring/main.tf— remove/inhibit noisy blackbox targets (westside-dev, pal-e-app, mac-agent), add payment webhook alert rulesterraform/dashboards/basketball-api-golden-signals.json— new file, Grafana dashboard for webhook metrics and API uptimeFiles to NOT touch:
terraform/dashboards/pal-e-app-golden-signals.json— unrelated appterraform/dashboards/dora-dashboard.json— unrelatedAcceptance Criteria
webhook_errors_totalincreases for 5 minutes, a warning alert fireswebhook_last_received_timestampis stale for 30+ minutes during business hours, a warning alert firestofu planshows only the expected changesTest Expectations
tofu validatepassestofu plan -lock=falseshows expected resource changes (PrometheusRule updates, ConfigMap additions, blackbox target removals)kubectl get prometheusrules -n monitoringshows updated rulesConstraints
tofu plan -lock=false(state lock blocks CI)/metrics)Checklist
tofu validatepassestofu plan -lock=falsereviewedRelated
pal-e-platform— project this affectsprometheus-fastapi-instrumentatorfor per-endpoint HTTP metrics (prerequisite for CheckoutErrorRate alert)