observability: reduce alert noise + add payment pipeline signals #291

Merged
forgejo_admin merged 1 commit from 290-payment-pipeline-observability into main 2026-04-14 02:53:02 +00:00
Contributor

Summary

Addresses the 31-firing-alerts fatigue that hid the Apr 13 checkout
outage. Removes expected-down blackbox targets, silences the noisiest
kube-state-metrics default rules, and adds revenue-critical Stripe
webhook alerts plus a basketball-api golden signals dashboard using
only the metrics that service currently exposes.

Changes

  • terraform/modules/monitoring/main.tf
    • Remove blackbox targets westside-dev, pal-e-app, mac-agent
      (expected-down; were generating critical EndpointDown alerts
      indistinguishable from real outages). Mac-agent uptime still
      covered by the existing MacAgentDown rule on up{job="mac-node-exporter"}.
    • Downgrade EndpointDown critical -> warning, raise for
      from 2m -> 5m to suppress brief flaps.
    • Disable kube-prometheus-stack defaultRules.kubeStateMetrics
      and kubernetesApps (the duplicate cross-namespace noise). Our
      PodRestartStorm, OOMKilled, and TargetDown rules carry the
      real signal.
    • Add alertmanager inhibit_rules so a firing critical suppresses
      its warning twin on the same alertname+namespace.
    • New PrometheusRule payment-pipeline-alerts:
      • WebhookErrorRate (warning, 5m): increase(webhook_errors_total[5m]) > 0
      • WebhookStale (warning, 10m): time() - webhook_last_received_timestamp{event_type="checkout.session.completed"} > 1800
        gated on business hours (9am-9pm MST weekdays) via
        hour() >= 16 or hour() < 4 and day_of_week() 1-5.
    • New kubernetes_config_map_v1.basketball_api_dashboard wiring
      the dashboard JSON (mirrors the pal-e-app-golden-signals pattern).
  • terraform/dashboards/basketball-api-golden-signals.json (new)
    • API uptime stat, webhook-age stats (any + checkout.session.completed),
      5m error count, received/processed rates by event_type, error
      rate, and error ratio. Uses only the currently-exposed metrics
      (basketball_api_up, webhook_received_total, webhook_processed_total,
      webhook_errors_total, webhook_last_received_timestamp).

tofu plan Output

tofu validate passes. tofu plan -lock=false summary (trimmed to
module.monitoring + pre-existing drift from previously merged PRs):

# module.monitoring.helm_release.blackbox_exporter will be updated in-place
# module.monitoring.helm_release.kube_prometheus_stack will be updated in-place
# module.monitoring.kubernetes_config_map_v1.basketball_api_dashboard will be created
# module.monitoring.kubernetes_manifest.blackbox_alerts will be updated in-place
# module.monitoring.kubernetes_manifest.payment_pipeline_alerts will be created
# module.database.kubernetes_secret_v1.paledocs_db_url will be created       (pre-existing drift)
# module.ops.kubernetes_service_v1.embedding_worker_metrics will be created  (pre-existing drift)
Plan: 4 to add, 3 to change, 0 to destroy.

All monitoring changes match what the ticket scoped. The two
database / ops entries are unrelated drift from earlier merged
PRs, not touched by this branch.

Test Plan

  • tofu fmt clean
  • tofu validate passes
  • tofu plan -lock=false shows only expected additions/updates
  • dashboard JSON parses
  • After apply: kubectl get prometheusrules -n monitoring payment-pipeline-alerts
  • After apply: Grafana -> "basketball-api Golden Signals" renders data
  • After apply: firing alert count drops below 5 real alerts
  • After apply: kill basketball-api pod, confirm basketball_api_up panel goes red
  • After apply: verify no EndpointDown critical alerts for westside-dev / pal-e-app / mac-agent

Review Checklist

  • tofu fmt run
  • tofu validate passes
  • tofu plan -lock=false reviewed, only expected changes in module.monitoring
  • No unrelated files touched (pal-e-app-golden-signals.json, dora-dashboard.json untouched)
  • Dashboard follows existing pal-e-app-golden-signals.json wiring pattern
  • PrometheusRule follows existing embedding_alerts / blackbox_alerts pattern
  • Branch named per convention (290-payment-pipeline-observability)
  • Closes #290 in PR body
  • Forgejo issue: Closes #290
  • Pattern reference: existing kubernetes_config_map_v1.pal_e_production_dashboard
    and kubernetes_manifest.embedding_alerts resources in the same module
  • Follow-up tickets (basketball-api repo, out of scope here):
    • Add prometheus-fastapi-instrumentator for per-endpoint HTTP metrics
      (prerequisite for a future CheckoutErrorRate alert)
    • Fix stale pending-order 409 bug
    • Exclude $0 monthly_fee from blast
    • Implement recurring billing for May 1
## Summary Addresses the 31-firing-alerts fatigue that hid the Apr 13 checkout outage. Removes expected-down blackbox targets, silences the noisiest kube-state-metrics default rules, and adds revenue-critical Stripe webhook alerts plus a basketball-api golden signals dashboard using only the metrics that service currently exposes. ## Changes - `terraform/modules/monitoring/main.tf` - Remove blackbox targets `westside-dev`, `pal-e-app`, `mac-agent` (expected-down; were generating critical `EndpointDown` alerts indistinguishable from real outages). Mac-agent uptime still covered by the existing `MacAgentDown` rule on `up{job="mac-node-exporter"}`. - Downgrade `EndpointDown` `critical` -> `warning`, raise `for` from 2m -> 5m to suppress brief flaps. - Disable `kube-prometheus-stack` `defaultRules.kubeStateMetrics` and `kubernetesApps` (the duplicate cross-namespace noise). Our `PodRestartStorm`, `OOMKilled`, and `TargetDown` rules carry the real signal. - Add alertmanager `inhibit_rules` so a firing `critical` suppresses its `warning` twin on the same `alertname`+`namespace`. - New `PrometheusRule` `payment-pipeline-alerts`: - `WebhookErrorRate` (warning, 5m): `increase(webhook_errors_total[5m]) > 0` - `WebhookStale` (warning, 10m): `time() - webhook_last_received_timestamp{event_type="checkout.session.completed"} > 1800` gated on business hours (9am-9pm MST weekdays) via `hour() >= 16 or hour() < 4` and `day_of_week()` 1-5. - New `kubernetes_config_map_v1.basketball_api_dashboard` wiring the dashboard JSON (mirrors the `pal-e-app-golden-signals` pattern). - `terraform/dashboards/basketball-api-golden-signals.json` (new) - API uptime stat, webhook-age stats (any + checkout.session.completed), 5m error count, received/processed rates by event_type, error rate, and error ratio. Uses only the currently-exposed metrics (`basketball_api_up`, `webhook_received_total`, `webhook_processed_total`, `webhook_errors_total`, `webhook_last_received_timestamp`). ## tofu plan Output `tofu validate` passes. `tofu plan -lock=false` summary (trimmed to module.monitoring + pre-existing drift from previously merged PRs): ``` # module.monitoring.helm_release.blackbox_exporter will be updated in-place # module.monitoring.helm_release.kube_prometheus_stack will be updated in-place # module.monitoring.kubernetes_config_map_v1.basketball_api_dashboard will be created # module.monitoring.kubernetes_manifest.blackbox_alerts will be updated in-place # module.monitoring.kubernetes_manifest.payment_pipeline_alerts will be created # module.database.kubernetes_secret_v1.paledocs_db_url will be created (pre-existing drift) # module.ops.kubernetes_service_v1.embedding_worker_metrics will be created (pre-existing drift) Plan: 4 to add, 3 to change, 0 to destroy. ``` All monitoring changes match what the ticket scoped. The two `database` / `ops` entries are unrelated drift from earlier merged PRs, not touched by this branch. ## Test Plan - [x] `tofu fmt` clean - [x] `tofu validate` passes - [x] `tofu plan -lock=false` shows only expected additions/updates - [x] dashboard JSON parses - [ ] After apply: `kubectl get prometheusrules -n monitoring payment-pipeline-alerts` - [ ] After apply: Grafana -> "basketball-api Golden Signals" renders data - [ ] After apply: firing alert count drops below 5 real alerts - [ ] After apply: kill basketball-api pod, confirm `basketball_api_up` panel goes red - [ ] After apply: verify no `EndpointDown` critical alerts for westside-dev / pal-e-app / mac-agent ## Review Checklist - [x] `tofu fmt` run - [x] `tofu validate` passes - [x] `tofu plan -lock=false` reviewed, only expected changes in module.monitoring - [x] No unrelated files touched (pal-e-app-golden-signals.json, dora-dashboard.json untouched) - [x] Dashboard follows existing `pal-e-app-golden-signals.json` wiring pattern - [x] PrometheusRule follows existing `embedding_alerts` / `blackbox_alerts` pattern - [x] Branch named per convention (`290-payment-pipeline-observability`) - [x] `Closes #290` in PR body ## Related Notes - Forgejo issue: Closes #290 - Pattern reference: existing `kubernetes_config_map_v1.pal_e_production_dashboard` and `kubernetes_manifest.embedding_alerts` resources in the same module - Follow-up tickets (basketball-api repo, out of scope here): - Add `prometheus-fastapi-instrumentator` for per-endpoint HTTP metrics (prerequisite for a future `CheckoutErrorRate` alert) - Fix stale pending-order 409 bug - Exclude `$0` monthly_fee from blast - Implement recurring billing for May 1
observability: reduce alert noise + add payment pipeline signals
All checks were successful
ci/woodpecker/pull_request_closed/woodpecker Pipeline was successful
ci/woodpecker/push/woodpecker Pipeline was successful
ci/woodpecker/pr/woodpecker Pipeline was successful
432e24ef2e
Addresses the 31-firing-alerts fatigue identified during the Apr 13
checkout outage by removing expected-down blackbox targets, silencing
kube-state-metrics default duplicates, and adding revenue-critical
Stripe webhook alerts + a basketball-api golden signals dashboard.

Changes:
- Remove blackbox targets westside-dev, pal-e-app, mac-agent (all
  expected-down; generated critical EndpointDown alerts indistinguishable
  from real outages). Mac-agent uptime still covered by MacAgentDown.
- Downgrade EndpointDown critical -> warning, raise "for" 2m -> 5m to
  suppress flaps.
- Disable kube-prometheus-stack defaultRules.kubeStateMetrics and
  kubernetesApps (duplicate namespace noise). Our PodRestartStorm,
  OOMKilled, TargetDown rules provide the real signal.
- Add alertmanager inhibit rule so critical alerts suppress their
  warning counterparts on the same alertname+namespace.
- New PrometheusRule payment-pipeline-alerts with:
    - WebhookErrorRate (warning, 5m) on increase(webhook_errors_total[5m])
    - WebhookStale (warning, 10m) when checkout.session.completed
      timestamp is >30min stale during business hours (9am-9pm MST weekdays)
- New basketball-api-golden-signals dashboard wired via ConfigMap,
  following the pal-e-app-golden-signals pattern. Uses only metrics
  basketball-api currently exposes (basketball_api_up, webhook_*).

tofu validate passes. tofu plan -lock=false shows only the expected
resource additions/updates under module.monitoring plus pre-existing
drift from previously-merged PRs (database/ops).

Closes #290

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Author
Contributor

PR #291 Review

DOMAIN REVIEW

Stack: OpenTofu/Terraform + Helm (kube-prometheus-stack) + Alertmanager + PrometheusRule CRD + Grafana dashboard JSON.

Blackbox target removal (verified safe):

  • westside-dev, pal-e-app, mac-agent removed. Remaining probed targets: minio, basketball-api, westside-app, platform-validation, playme2k — all real production services retained.
  • mac-agent uptime still covered by MacAgentDown rule at terraform/modules/monitoring/main.tf:222 (up{job="mac-node-exporter"} == 0). Confirmed — no gap.
  • pal-e-app uptime: note that removing blackbox for pal-e-app means the only remaining liveness signal is k8s TargetDown / PodRestartStorm. Acceptable given the redirect 3xx was the noise source, but worth flagging as a tradeoff (pal-e-app has no dedicated "down" alert equivalent to MacAgentDown).

kube-prometheus-stack defaults:

  • defaultRules.rules.kubeStateMetrics = false and kubernetesApps = false disable dozens of bundled rules including KubeDeploymentReplicasMismatch, KubePodNotReady, KubeJobFailed, KubePodCrashLooping, KubeContainerWaiting, KubeDaemonSetRolloutStuck, KubeStatefulSetReplicasMismatch, and more. The PR asserts PodRestartStorm/OOMKilled/TargetDown cover the real signal. PodRestartStorm + OOMKilled exist in this module and TargetDown is in the kubePrometheusGeneral group (still enabled). Acceptable, but this is a wide silence — recommend monitoring firing-alert count for 1 week post-apply to confirm no critical signal lost (e.g., a stuck rollout won't page).

Alertmanager inhibit rule:

  • source: severity=critical inhibits target: severity=warning when alertname+namespace match. Scoped correctly — will NOT over-silence because it requires same alertname, so a critical PodRestartStorm won't suppress a warning WebhookErrorRate. Safe.

PrometheusRule payment-pipeline-alerts:

  • WebhookErrorRate: increase(webhook_errors_total[5m]) > 0 for 5m. Fires on any sustained error. Reasonable for warning severity.
  • WebhookStale: time-gate expression verified:
    • hour() >= 16 or hour() < 4 = 16:00–04:00 UTC = 9am MST – 9pm MST (MST = UTC-7). Correct.
    • day_of_week() > 0 and day_of_week() < 6 = 1..5 = Mon–Fri. Correct.
    • and on() chaining is valid PromQL (instant-vector scalars have no labels).
    • Note: MST vs MDT — Arizona/MST is UTC-7 year-round but if "MST" here actually means Mountain Time with DST, UTC offset flips to -6 in summer, shifting business-hours window by 1 hour. Nit.
  • Labels include release: kube-prometheus-stack — correct for Prometheus operator rule selector.

Dashboard JSON:

  • Valid structure: panels, templating (DS_PROMETHEUS datasource var), schemaVersion: 39, uid: basketball-api-golden-signals. Mirrors existing pal-e-app-golden-signals pattern.
  • PromQL expressions only use metrics confirmed exposed by basketball-api (basketball_api_up, webhook_received_total, webhook_processed_total, webhook_errors_total, webhook_last_received_timestamp).
  • Error ratio panel uses clamp_min(..., 0.0001) to prevent divide-by-zero — good defensive PromQL.
  • Label grafana_dashboard = "1" on the ConfigMap matches the sidecar discovery convention used by the existing pal-e-app dashboard. Will auto-load.

BLOCKERS

None.

NITS

  • MST vs MDT: WebhookStale comment says "9am-9pm MST" but MST is typically UTC-7 only half the year. If the team operates on America/Denver (observes DST), the expression is off by 1 hour in DST months. Consider either widening the window (e.g. hour() >= 15 or hour() < 5) or documenting this as Arizona-style fixed MST. Not a blocker.
  • EndpointDown downgrade: now warning + 5m. Combined with inhibit rule, legitimate endpoint outages on a critical-paired alertname could be masked. But EndpointDown doesn't have a critical twin, so inhibit won't affect it. Fine.
  • WebhookErrorRate fires on 1 error: increase(...) > 0 for 5m alerts even on a single transient error. If noisy in practice, bump to > 2 or use rate-based threshold. Deferred to post-apply tuning.
  • Pre-existing drift in plan output (paledocs_db_url, embedding_worker_metrics) — confirmed out of scope for this PR and called out in the body. Good discipline.

SOP COMPLIANCE

  • Branch named 290-payment-pipeline-observability (matches {issue}-{kebab-purpose})
  • PR body has Summary, Changes, tofu plan Output, Test Plan, Review Checklist, Related
  • Closes #290 in body
  • tofu fmt + tofu validate claimed clean
  • tofu plan -lock=false output included (4 add, 3 change, 0 destroy)
  • No secrets, no .env files
  • No unrelated files touched — only 2 files changed
  • Follows existing pal_e_production_dashboard + embedding_alerts patterns (convention compliance)
  • Post-apply verification items in Test Plan remain unchecked — must be validated after tofu apply before closing issue

PROCESS OBSERVATIONS

  • CFR impact: positive. Reduces false-positive alert volume (31 firing → target <5) which directly improves MTTR signal-to-noise. The Apr 13 checkout outage postmortem lesson is addressed structurally.
  • DF impact: neutral. Pure infra add, no deployment velocity change.
  • LT impact: Webhook error visibility closes a documented observability gap for the payment pipeline.
  • Validation gap: Per feedback_validate_before_done, this PR is not done until tofu apply runs AND post-apply checks pass (PrometheusRule present, dashboard renders, alert count drops). Recommend validation ticket in the board after merge.
  • Widened default-rule silence risk: disabling kubernetesApps group is the single biggest scope item. Monitor for 1 week; if a real incident is missed, restore selected rules individually rather than the whole group.

VERDICT: APPROVED

## PR #291 Review ### DOMAIN REVIEW Stack: OpenTofu/Terraform + Helm (kube-prometheus-stack) + Alertmanager + PrometheusRule CRD + Grafana dashboard JSON. **Blackbox target removal (verified safe):** - `westside-dev`, `pal-e-app`, `mac-agent` removed. Remaining probed targets: `minio`, `basketball-api`, `westside-app`, `platform-validation`, `playme2k` — all real production services retained. - `mac-agent` uptime still covered by `MacAgentDown` rule at `terraform/modules/monitoring/main.tf:222` (`up{job="mac-node-exporter"} == 0`). Confirmed — no gap. - `pal-e-app` uptime: note that removing blackbox for pal-e-app means the only remaining liveness signal is k8s `TargetDown` / `PodRestartStorm`. Acceptable given the redirect 3xx was the noise source, but worth flagging as a tradeoff (pal-e-app has no dedicated "down" alert equivalent to MacAgentDown). **kube-prometheus-stack defaults:** - `defaultRules.rules.kubeStateMetrics = false` and `kubernetesApps = false` disable dozens of bundled rules including `KubeDeploymentReplicasMismatch`, `KubePodNotReady`, `KubeJobFailed`, `KubePodCrashLooping`, `KubeContainerWaiting`, `KubeDaemonSetRolloutStuck`, `KubeStatefulSetReplicasMismatch`, and more. The PR asserts `PodRestartStorm`/`OOMKilled`/`TargetDown` cover the real signal. `PodRestartStorm` + `OOMKilled` exist in this module and `TargetDown` is in the `kubePrometheusGeneral` group (still enabled). Acceptable, but this is a wide silence — recommend monitoring firing-alert count for 1 week post-apply to confirm no critical signal lost (e.g., a stuck rollout won't page). **Alertmanager inhibit rule:** - `source: severity=critical` inhibits `target: severity=warning` when `alertname`+`namespace` match. Scoped correctly — will NOT over-silence because it requires same `alertname`, so a critical `PodRestartStorm` won't suppress a warning `WebhookErrorRate`. Safe. **PrometheusRule `payment-pipeline-alerts`:** - `WebhookErrorRate`: `increase(webhook_errors_total[5m]) > 0` for 5m. Fires on any sustained error. Reasonable for warning severity. - `WebhookStale`: time-gate expression verified: - `hour() >= 16 or hour() < 4` = 16:00–04:00 UTC = 9am MST – 9pm MST (MST = UTC-7). Correct. - `day_of_week() > 0 and day_of_week() < 6` = 1..5 = Mon–Fri. Correct. - `and on()` chaining is valid PromQL (instant-vector scalars have no labels). - Note: MST vs MDT — Arizona/MST is UTC-7 year-round but if "MST" here actually means Mountain Time with DST, UTC offset flips to -6 in summer, shifting business-hours window by 1 hour. Nit. - Labels include `release: kube-prometheus-stack` — correct for Prometheus operator rule selector. **Dashboard JSON:** - Valid structure: `panels`, `templating` (DS_PROMETHEUS datasource var), `schemaVersion: 39`, `uid: basketball-api-golden-signals`. Mirrors existing `pal-e-app-golden-signals` pattern. - PromQL expressions only use metrics confirmed exposed by basketball-api (`basketball_api_up`, `webhook_received_total`, `webhook_processed_total`, `webhook_errors_total`, `webhook_last_received_timestamp`). - Error ratio panel uses `clamp_min(..., 0.0001)` to prevent divide-by-zero — good defensive PromQL. - Label `grafana_dashboard = "1"` on the ConfigMap matches the sidecar discovery convention used by the existing pal-e-app dashboard. Will auto-load. ### BLOCKERS None. ### NITS - **MST vs MDT**: `WebhookStale` comment says "9am-9pm MST" but MST is typically UTC-7 only half the year. If the team operates on America/Denver (observes DST), the expression is off by 1 hour in DST months. Consider either widening the window (e.g. `hour() >= 15 or hour() < 5`) or documenting this as Arizona-style fixed MST. Not a blocker. - **`EndpointDown` downgrade**: now warning + 5m. Combined with inhibit rule, legitimate endpoint outages on a critical-paired alertname could be masked. But `EndpointDown` doesn't have a critical twin, so inhibit won't affect it. Fine. - **WebhookErrorRate fires on 1 error**: `increase(...) > 0` for 5m alerts even on a single transient error. If noisy in practice, bump to `> 2` or use rate-based threshold. Deferred to post-apply tuning. - **Pre-existing drift in plan output** (`paledocs_db_url`, `embedding_worker_metrics`) — confirmed out of scope for this PR and called out in the body. Good discipline. ### SOP COMPLIANCE - [x] Branch named `290-payment-pipeline-observability` (matches `{issue}-{kebab-purpose}`) - [x] PR body has Summary, Changes, tofu plan Output, Test Plan, Review Checklist, Related - [x] `Closes #290` in body - [x] `tofu fmt` + `tofu validate` claimed clean - [x] `tofu plan -lock=false` output included (4 add, 3 change, 0 destroy) - [x] No secrets, no .env files - [x] No unrelated files touched — only 2 files changed - [x] Follows existing `pal_e_production_dashboard` + `embedding_alerts` patterns (convention compliance) - [ ] Post-apply verification items in Test Plan remain unchecked — must be validated after tofu apply before closing issue ### PROCESS OBSERVATIONS - **CFR impact**: positive. Reduces false-positive alert volume (31 firing → target <5) which directly improves MTTR signal-to-noise. The Apr 13 checkout outage postmortem lesson is addressed structurally. - **DF impact**: neutral. Pure infra add, no deployment velocity change. - **LT impact**: Webhook error visibility closes a documented observability gap for the payment pipeline. - **Validation gap**: Per `feedback_validate_before_done`, this PR is not done until `tofu apply` runs AND post-apply checks pass (PrometheusRule present, dashboard renders, alert count drops). Recommend validation ticket in the board after merge. - **Widened default-rule silence risk**: disabling `kubernetesApps` group is the single biggest scope item. Monitor for 1 week; if a real incident is missed, restore selected rules individually rather than the whole group. ### VERDICT: APPROVED
forgejo_admin deleted branch 290-payment-pipeline-observability 2026-04-14 02:53:02 +00:00
Sign in to join this conversation.
No description provided.