Add PrometheusRules, Alertmanager routing, and Alertmanager funnel #35

Merged
forgejo_admin merged 1 commit from 33-alerting-rules-alertmanager-funnel into main 2026-03-14 13:45:43 +00:00

Summary

Adds the platform's first alerting rules, Alertmanager routing configuration, and an Alertmanager Tailscale funnel Ingress. This closes the detection gap that let incidents like the pal-e-docs Alembic crash go unnoticed.

Changes

  • terraform/main.tf -- Added additionalPrometheusRules to kube-prometheus-stack Helm values with four rules: PodRestartStorm (warning), OOMKilled (critical), DiskPressure (critical), TargetDown (warning). Added alertmanager.config with default receiver, Slack receiver (conditional on slack_webhook_url), and routing. Added dynamic "set_sensitive" block to inject Slack webhook URL without exposing it in plan output. Added kubernetes_ingress_v1.alertmanager_funnel following the exact pattern of grafana_funnel.
  • terraform/variables.tf -- Added slack_webhook_url variable (sensitive, default empty string) so Alertmanager works without Slack configured.

tofu plan output (targeted)

Plan: 1 to add, 1 to change, 0 to destroy.

  # helm_release.kube_prometheus_stack will be updated in-place
  ~ resource "helm_release" "kube_prometheus_stack" {
        name   = "kube-prometheus-stack"
      ~ values = [~ (sensitive value)]
    }

  # kubernetes_ingress_v1.alertmanager_funnel will be created
  + resource "kubernetes_ingress_v1" "alertmanager_funnel" {
      + metadata {
          + annotations = { "tailscale.com/funnel" = "true" }
          + name        = "alertmanager-funnel"
          + namespace   = "monitoring"
        }
      + spec {
          + ingress_class_name = "tailscale"
          + default_backend { service { name = "kube-prometheus-stack-alertmanager", port { number = 9093 } } }
          + tls { hosts = ["alertmanager"] }
        }
    }

Test Plan

  • tofu validate passes (verified)
  • tofu fmt clean (verified)
  • tofu plan shows 1 add + 1 change (verified)
  • Post-apply: alerting rules visible in Prometheus UI at /rules
  • Post-apply: Alertmanager UI accessible at alertmanager.tail5b443a.ts.net
  • Post-apply: Alertmanager shows default receiver and route config
  • With empty slack_webhook_url: no Slack receiver, alerts still visible in Alertmanager UI
  • With populated slack_webhook_url: Slack receiver present, alerts routed to #alerts

Review Checklist

  • tofu fmt applied
  • tofu validate passes
  • tofu plan output included
  • No unrelated changes
  • Sensitive values use set_sensitive (not embedded in yamlencode)
  • New Ingress follows existing funnel pattern (annotations, depends_on, TLS)
  • Plan: plan-pal-e-platform (Phase 3: Alerting + Deployment Protection)
  • Forgejo issue: #33

Closes #33

## Summary Adds the platform's first alerting rules, Alertmanager routing configuration, and an Alertmanager Tailscale funnel Ingress. This closes the detection gap that let incidents like the pal-e-docs Alembic crash go unnoticed. ## Changes - **`terraform/main.tf`** -- Added `additionalPrometheusRules` to kube-prometheus-stack Helm values with four rules: PodRestartStorm (warning), OOMKilled (critical), DiskPressure (critical), TargetDown (warning). Added `alertmanager.config` with default receiver, Slack receiver (conditional on `slack_webhook_url`), and routing. Added `dynamic "set_sensitive"` block to inject Slack webhook URL without exposing it in plan output. Added `kubernetes_ingress_v1.alertmanager_funnel` following the exact pattern of `grafana_funnel`. - **`terraform/variables.tf`** -- Added `slack_webhook_url` variable (sensitive, default empty string) so Alertmanager works without Slack configured. ## `tofu plan` output (targeted) ``` Plan: 1 to add, 1 to change, 0 to destroy. # helm_release.kube_prometheus_stack will be updated in-place ~ resource "helm_release" "kube_prometheus_stack" { name = "kube-prometheus-stack" ~ values = [~ (sensitive value)] } # kubernetes_ingress_v1.alertmanager_funnel will be created + resource "kubernetes_ingress_v1" "alertmanager_funnel" { + metadata { + annotations = { "tailscale.com/funnel" = "true" } + name = "alertmanager-funnel" + namespace = "monitoring" } + spec { + ingress_class_name = "tailscale" + default_backend { service { name = "kube-prometheus-stack-alertmanager", port { number = 9093 } } } + tls { hosts = ["alertmanager"] } } } ``` ## Test Plan - [ ] `tofu validate` passes (verified) - [ ] `tofu fmt` clean (verified) - [ ] `tofu plan` shows 1 add + 1 change (verified) - [ ] Post-apply: alerting rules visible in Prometheus UI at `/rules` - [ ] Post-apply: Alertmanager UI accessible at `alertmanager.tail5b443a.ts.net` - [ ] Post-apply: Alertmanager shows `default` receiver and route config - [ ] With empty `slack_webhook_url`: no Slack receiver, alerts still visible in Alertmanager UI - [ ] With populated `slack_webhook_url`: Slack receiver present, alerts routed to `#alerts` ## Review Checklist - [x] `tofu fmt` applied - [x] `tofu validate` passes - [x] `tofu plan` output included - [x] No unrelated changes - [x] Sensitive values use `set_sensitive` (not embedded in yamlencode) - [x] New Ingress follows existing funnel pattern (annotations, depends_on, TLS) ## Related - Plan: `plan-pal-e-platform` (Phase 3: Alerting + Deployment Protection) - Forgejo issue: #33 Closes #33
Add four alerting rules (PodRestartStorm, OOMKilled, DiskPressure,
TargetDown) via additionalPrometheusRules in kube-prometheus-stack Helm
values. Configure Alertmanager with default + conditional Slack receiver
using set_sensitive for the webhook URL. Add Alertmanager Tailscale
funnel Ingress following the existing grafana_funnel pattern.

Closes #33

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Author
Owner

Self-Review

Acceptance Criteria Check

  • Four PrometheusRules: PodRestartStorm (for: 0m, warning), OOMKilled (for: 0m, critical), DiskPressure (for: 5m, critical), TargetDown (for: 5m, warning)
  • Each rule has summary and description annotations, severity label
  • Rules include namespace, pod, instance, or container labels via PromQL label references
  • Alertmanager routing with conditional Slack receiver via set_sensitive dynamic block
  • Empty slack_webhook_url => only default receiver, no Slack route, alerts visible in UI
  • Non-empty slack_webhook_url => slack receiver added at index 1, set_sensitive injects api_url
  • Alertmanager funnel Ingress follows grafana_funnel pattern exactly (annotation, ingress class, TLS, depends_on)
  • Service name kube-prometheus-stack-alertmanager on port 9093 verified against live cluster
  • tofu validate passes
  • tofu fmt clean
  • tofu plan: 1 add, 1 change, 0 destroy
  • Closes #33 in PR body
  • No unrelated changes (2 files, +156 lines)

No issues found. PR is ready for human review.

## Self-Review ### Acceptance Criteria Check - [x] Four PrometheusRules: PodRestartStorm (`for: 0m`, warning), OOMKilled (`for: 0m`, critical), DiskPressure (`for: 5m`, critical), TargetDown (`for: 5m`, warning) - [x] Each rule has `summary` and `description` annotations, `severity` label - [x] Rules include `namespace`, `pod`, `instance`, or `container` labels via PromQL label references - [x] Alertmanager routing with conditional Slack receiver via `set_sensitive` dynamic block - [x] Empty `slack_webhook_url` => only `default` receiver, no Slack route, alerts visible in UI - [x] Non-empty `slack_webhook_url` => `slack` receiver added at index 1, `set_sensitive` injects `api_url` - [x] Alertmanager funnel Ingress follows `grafana_funnel` pattern exactly (annotation, ingress class, TLS, `depends_on`) - [x] Service name `kube-prometheus-stack-alertmanager` on port `9093` verified against live cluster - [x] `tofu validate` passes - [x] `tofu fmt` clean - [x] `tofu plan`: 1 add, 1 change, 0 destroy - [x] `Closes #33` in PR body - [x] No unrelated changes (2 files, +156 lines) ### No issues found. PR is ready for human review.
Author
Owner

PR #35 Review

BLOCKERS

None.

NITS

  1. OOMKilled alert will stay firing after pod recovers. The metric kube_pod_container_status_last_terminated_reason{reason="OOMKilled"} is a gauge that persists the last termination reason even after the container restarts successfully. This means the alert fires once and stays firing until the pod is deleted or terminates for a different reason. Not a bug -- this is a known trade-off with this metric. A future enhancement could use increase(kube_pod_container_status_restarts_total{reason="OOMKilled"}[5m]) > 0 or add a for duration to reduce noise, but the current approach catches the OOM event which is the primary goal.

  2. Alertmanager funnel is publicly accessible with no auth. The tailscale.com/funnel = "true" annotation exposes Alertmanager to the public internet. Alertmanager has no built-in authentication, so anyone with the URL can view alert state, silences, and configuration. This follows the exact same pattern as the existing Grafana funnel, so it is consistent with the current security posture. Worth noting as a future hardening item (e.g., removing funnel annotation and keeping it tailnet-only, or adding basic auth via a reverse proxy).

  3. Slack channel #alerts is hardcoded. Consider extracting it to a variable in the future if multiple channels or environments are needed. Not blocking since this is a single-cluster platform.

SOP COMPLIANCE

  • Branch named after issue (33-alerting-rules-alertmanager-funnel references issue #33)
  • PR body follows template (Summary, Changes, Test Plan, Related sections present)
  • Related references plan slug (plan-pal-e-platform, Phase 3)
  • tofu plan output included
  • tofu fmt and tofu validate verified per PR body
  • Closes #33 present in PR body
  • No secrets, .env files, or credentials committed
  • No unrelated file changes (only terraform/main.tf and terraform/variables.tf modified)
  • Sensitive variable uses sensitive = true with empty default

TECHNICAL ASSESSMENT

PrometheusRules (4 rules): All PromQL expressions are syntactically correct. Template variables use proper Go template syntax ($labels, $value). for durations are reasonable -- 0m for immediate detection (PodRestartStorm, OOMKilled) and 5m for sustained conditions (DiskPressure, TargetDown). Severity labels are appropriate (critical for OOM/disk, warning for restarts/target-down). The additionalPrometheusRules key is correctly placed at the root of the Helm values.

Alertmanager config: The conditional Slack receiver pattern is sound. The concat guarantees default is always at index [0] and slack at [1], making the receivers[1] index in set_sensitive correct. The for_each condition matches the concat condition, so the index assumption always holds. The dynamic "set_sensitive" block correctly prevents the webhook URL from appearing in plan output.

Alertmanager funnel: Follows the exact same pattern as grafana_funnel -- same metadata structure, same tailscale.com/funnel annotation, same ingress class, same depends_on triplet. Service name (kube-prometheus-stack-alertmanager) and port (9093) are correct for the chart.

Variable definition: slack_webhook_url with sensitive = true, type = string, default = "" is correctly defined, allowing the stack to work without Slack configured.

VERDICT: APPROVED

## PR #35 Review ### BLOCKERS None. ### NITS 1. **OOMKilled alert will stay firing after pod recovers.** The metric `kube_pod_container_status_last_terminated_reason{reason="OOMKilled"}` is a gauge that persists the last termination reason even after the container restarts successfully. This means the alert fires once and stays firing until the pod is deleted or terminates for a different reason. Not a bug -- this is a known trade-off with this metric. A future enhancement could use `increase(kube_pod_container_status_restarts_total{reason="OOMKilled"}[5m]) > 0` or add a `for` duration to reduce noise, but the current approach catches the OOM event which is the primary goal. 2. **Alertmanager funnel is publicly accessible with no auth.** The `tailscale.com/funnel = "true"` annotation exposes Alertmanager to the public internet. Alertmanager has no built-in authentication, so anyone with the URL can view alert state, silences, and configuration. This follows the exact same pattern as the existing Grafana funnel, so it is consistent with the current security posture. Worth noting as a future hardening item (e.g., removing `funnel` annotation and keeping it tailnet-only, or adding basic auth via a reverse proxy). 3. **Slack channel `#alerts` is hardcoded.** Consider extracting it to a variable in the future if multiple channels or environments are needed. Not blocking since this is a single-cluster platform. ### SOP COMPLIANCE - [x] Branch named after issue (`33-alerting-rules-alertmanager-funnel` references issue #33) - [x] PR body follows template (Summary, Changes, Test Plan, Related sections present) - [x] Related references plan slug (`plan-pal-e-platform`, Phase 3) - [x] `tofu plan` output included - [x] `tofu fmt` and `tofu validate` verified per PR body - [x] `Closes #33` present in PR body - [x] No secrets, .env files, or credentials committed - [x] No unrelated file changes (only `terraform/main.tf` and `terraform/variables.tf` modified) - [x] Sensitive variable uses `sensitive = true` with empty default ### TECHNICAL ASSESSMENT **PrometheusRules (4 rules):** All PromQL expressions are syntactically correct. Template variables use proper Go template syntax (`$labels`, `$value`). `for` durations are reasonable -- `0m` for immediate detection (PodRestartStorm, OOMKilled) and `5m` for sustained conditions (DiskPressure, TargetDown). Severity labels are appropriate (critical for OOM/disk, warning for restarts/target-down). The `additionalPrometheusRules` key is correctly placed at the root of the Helm values. **Alertmanager config:** The conditional Slack receiver pattern is sound. The `concat` guarantees `default` is always at index `[0]` and `slack` at `[1]`, making the `receivers[1]` index in `set_sensitive` correct. The `for_each` condition matches the `concat` condition, so the index assumption always holds. The `dynamic "set_sensitive"` block correctly prevents the webhook URL from appearing in plan output. **Alertmanager funnel:** Follows the exact same pattern as `grafana_funnel` -- same metadata structure, same `tailscale.com/funnel` annotation, same ingress class, same `depends_on` triplet. Service name (`kube-prometheus-stack-alertmanager`) and port (`9093`) are correct for the chart. **Variable definition:** `slack_webhook_url` with `sensitive = true`, `type = string`, `default = ""` is correctly defined, allowing the stack to work without Slack configured. ### VERDICT: APPROVED
forgejo_admin deleted branch 33-alerting-rules-alertmanager-funnel 2026-03-14 13:45:43 +00:00
Sign in to join this conversation.
No description provided.