observability: reduce alert noise + add payment pipeline signals #291
No reviewers
Labels
No labels
domain:backend
domain:devops
domain:frontend
status:approved
status:in-progress
status:needs-fix
status:qa
type:bug
type:devops
type:feature
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
ldraney/pal-e-platform!291
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "290-payment-pipeline-observability"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
Addresses the 31-firing-alerts fatigue that hid the Apr 13 checkout
outage. Removes expected-down blackbox targets, silences the noisiest
kube-state-metrics default rules, and adds revenue-critical Stripe
webhook alerts plus a basketball-api golden signals dashboard using
only the metrics that service currently exposes.
Changes
terraform/modules/monitoring/main.tfwestside-dev,pal-e-app,mac-agent(expected-down; were generating critical
EndpointDownalertsindistinguishable from real outages). Mac-agent uptime still
covered by the existing
MacAgentDownrule onup{job="mac-node-exporter"}.EndpointDowncritical->warning, raiseforfrom 2m -> 5m to suppress brief flaps.
kube-prometheus-stackdefaultRules.kubeStateMetricsand
kubernetesApps(the duplicate cross-namespace noise). OurPodRestartStorm,OOMKilled, andTargetDownrules carry thereal signal.
inhibit_rulesso a firingcriticalsuppressesits
warningtwin on the samealertname+namespace.PrometheusRulepayment-pipeline-alerts:WebhookErrorRate(warning, 5m):increase(webhook_errors_total[5m]) > 0WebhookStale(warning, 10m):time() - webhook_last_received_timestamp{event_type="checkout.session.completed"} > 1800gated on business hours (9am-9pm MST weekdays) via
hour() >= 16 or hour() < 4andday_of_week()1-5.kubernetes_config_map_v1.basketball_api_dashboardwiringthe dashboard JSON (mirrors the
pal-e-app-golden-signalspattern).terraform/dashboards/basketball-api-golden-signals.json(new)5m error count, received/processed rates by event_type, error
rate, and error ratio. Uses only the currently-exposed metrics
(
basketball_api_up,webhook_received_total,webhook_processed_total,webhook_errors_total,webhook_last_received_timestamp).tofu plan Output
tofu validatepasses.tofu plan -lock=falsesummary (trimmed tomodule.monitoring + pre-existing drift from previously merged PRs):
All monitoring changes match what the ticket scoped. The two
database/opsentries are unrelated drift from earlier mergedPRs, not touched by this branch.
Test Plan
tofu fmtcleantofu validatepassestofu plan -lock=falseshows only expected additions/updateskubectl get prometheusrules -n monitoring payment-pipeline-alertsbasketball_api_uppanel goes redEndpointDowncritical alerts for westside-dev / pal-e-app / mac-agentReview Checklist
tofu fmtruntofu validatepassestofu plan -lock=falsereviewed, only expected changes in module.monitoringpal-e-app-golden-signals.jsonwiring patternembedding_alerts/blackbox_alertspattern290-payment-pipeline-observability)Closes #290in PR bodyRelated Notes
kubernetes_config_map_v1.pal_e_production_dashboardand
kubernetes_manifest.embedding_alertsresources in the same moduleprometheus-fastapi-instrumentatorfor per-endpoint HTTP metrics(prerequisite for a future
CheckoutErrorRatealert)$0monthly_fee from blastAddresses the 31-firing-alerts fatigue identified during the Apr 13 checkout outage by removing expected-down blackbox targets, silencing kube-state-metrics default duplicates, and adding revenue-critical Stripe webhook alerts + a basketball-api golden signals dashboard. Changes: - Remove blackbox targets westside-dev, pal-e-app, mac-agent (all expected-down; generated critical EndpointDown alerts indistinguishable from real outages). Mac-agent uptime still covered by MacAgentDown. - Downgrade EndpointDown critical -> warning, raise "for" 2m -> 5m to suppress flaps. - Disable kube-prometheus-stack defaultRules.kubeStateMetrics and kubernetesApps (duplicate namespace noise). Our PodRestartStorm, OOMKilled, TargetDown rules provide the real signal. - Add alertmanager inhibit rule so critical alerts suppress their warning counterparts on the same alertname+namespace. - New PrometheusRule payment-pipeline-alerts with: - WebhookErrorRate (warning, 5m) on increase(webhook_errors_total[5m]) - WebhookStale (warning, 10m) when checkout.session.completed timestamp is >30min stale during business hours (9am-9pm MST weekdays) - New basketball-api-golden-signals dashboard wired via ConfigMap, following the pal-e-app-golden-signals pattern. Uses only metrics basketball-api currently exposes (basketball_api_up, webhook_*). tofu validate passes. tofu plan -lock=false shows only the expected resource additions/updates under module.monitoring plus pre-existing drift from previously-merged PRs (database/ops). Closes #290 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>PR #291 Review
DOMAIN REVIEW
Stack: OpenTofu/Terraform + Helm (kube-prometheus-stack) + Alertmanager + PrometheusRule CRD + Grafana dashboard JSON.
Blackbox target removal (verified safe):
westside-dev,pal-e-app,mac-agentremoved. Remaining probed targets:minio,basketball-api,westside-app,platform-validation,playme2k— all real production services retained.mac-agentuptime still covered byMacAgentDownrule atterraform/modules/monitoring/main.tf:222(up{job="mac-node-exporter"} == 0). Confirmed — no gap.pal-e-appuptime: note that removing blackbox for pal-e-app means the only remaining liveness signal is k8sTargetDown/PodRestartStorm. Acceptable given the redirect 3xx was the noise source, but worth flagging as a tradeoff (pal-e-app has no dedicated "down" alert equivalent to MacAgentDown).kube-prometheus-stack defaults:
defaultRules.rules.kubeStateMetrics = falseandkubernetesApps = falsedisable dozens of bundled rules includingKubeDeploymentReplicasMismatch,KubePodNotReady,KubeJobFailed,KubePodCrashLooping,KubeContainerWaiting,KubeDaemonSetRolloutStuck,KubeStatefulSetReplicasMismatch, and more. The PR assertsPodRestartStorm/OOMKilled/TargetDowncover the real signal.PodRestartStorm+OOMKilledexist in this module andTargetDownis in thekubePrometheusGeneralgroup (still enabled). Acceptable, but this is a wide silence — recommend monitoring firing-alert count for 1 week post-apply to confirm no critical signal lost (e.g., a stuck rollout won't page).Alertmanager inhibit rule:
source: severity=criticalinhibitstarget: severity=warningwhenalertname+namespacematch. Scoped correctly — will NOT over-silence because it requires samealertname, so a criticalPodRestartStormwon't suppress a warningWebhookErrorRate. Safe.PrometheusRule
payment-pipeline-alerts:WebhookErrorRate:increase(webhook_errors_total[5m]) > 0for 5m. Fires on any sustained error. Reasonable for warning severity.WebhookStale: time-gate expression verified:hour() >= 16 or hour() < 4= 16:00–04:00 UTC = 9am MST – 9pm MST (MST = UTC-7). Correct.day_of_week() > 0 and day_of_week() < 6= 1..5 = Mon–Fri. Correct.and on()chaining is valid PromQL (instant-vector scalars have no labels).release: kube-prometheus-stack— correct for Prometheus operator rule selector.Dashboard JSON:
panels,templating(DS_PROMETHEUS datasource var),schemaVersion: 39,uid: basketball-api-golden-signals. Mirrors existingpal-e-app-golden-signalspattern.basketball_api_up,webhook_received_total,webhook_processed_total,webhook_errors_total,webhook_last_received_timestamp).clamp_min(..., 0.0001)to prevent divide-by-zero — good defensive PromQL.grafana_dashboard = "1"on the ConfigMap matches the sidecar discovery convention used by the existing pal-e-app dashboard. Will auto-load.BLOCKERS
None.
NITS
WebhookStalecomment says "9am-9pm MST" but MST is typically UTC-7 only half the year. If the team operates on America/Denver (observes DST), the expression is off by 1 hour in DST months. Consider either widening the window (e.g.hour() >= 15 or hour() < 5) or documenting this as Arizona-style fixed MST. Not a blocker.EndpointDowndowngrade: now warning + 5m. Combined with inhibit rule, legitimate endpoint outages on a critical-paired alertname could be masked. ButEndpointDowndoesn't have a critical twin, so inhibit won't affect it. Fine.increase(...) > 0for 5m alerts even on a single transient error. If noisy in practice, bump to> 2or use rate-based threshold. Deferred to post-apply tuning.paledocs_db_url,embedding_worker_metrics) — confirmed out of scope for this PR and called out in the body. Good discipline.SOP COMPLIANCE
290-payment-pipeline-observability(matches{issue}-{kebab-purpose})Closes #290in bodytofu fmt+tofu validateclaimed cleantofu plan -lock=falseoutput included (4 add, 3 change, 0 destroy)pal_e_production_dashboard+embedding_alertspatterns (convention compliance)PROCESS OBSERVATIONS
feedback_validate_before_done, this PR is not done untiltofu applyruns AND post-apply checks pass (PrometheusRule present, dashboard renders, alert count drops). Recommend validation ticket in the board after merge.kubernetesAppsgroup is the single biggest scope item. Monitor for 1 week; if a real incident is missed, restore selected rules individually rather than the whole group.VERDICT: APPROVED