Add PrometheusRules, Alertmanager routing, and Alertmanager funnel #35

Merged

forgejo_admin merged 1 commit from 33-alerting-rules-alertmanager-funnel into main

2026-03-14 13:45:43 +00:00

forgejo_admin commented

2026-03-14 13:41:46 +00:00

Owner

Summary

Adds the platform's first alerting rules, Alertmanager routing configuration, and an Alertmanager Tailscale funnel Ingress. This closes the detection gap that let incidents like the pal-e-docs Alembic crash go unnoticed.

Changes

terraform/main.tf -- Added additionalPrometheusRules to kube-prometheus-stack Helm values with four rules: PodRestartStorm (warning), OOMKilled (critical), DiskPressure (critical), TargetDown (warning). Added alertmanager.config with default receiver, Slack receiver (conditional on slack_webhook_url), and routing. Added dynamic "set_sensitive" block to inject Slack webhook URL without exposing it in plan output. Added kubernetes_ingress_v1.alertmanager_funnel following the exact pattern of grafana_funnel.
terraform/variables.tf -- Added slack_webhook_url variable (sensitive, default empty string) so Alertmanager works without Slack configured.

`tofu plan` output (targeted)

Plan: 1 to add, 1 to change, 0 to destroy.

  # helm_release.kube_prometheus_stack will be updated in-place
  ~ resource "helm_release" "kube_prometheus_stack" {
        name   = "kube-prometheus-stack"
      ~ values = [~ (sensitive value)]
    }

  # kubernetes_ingress_v1.alertmanager_funnel will be created
  + resource "kubernetes_ingress_v1" "alertmanager_funnel" {
      + metadata {
          + annotations = { "tailscale.com/funnel" = "true" }
          + name        = "alertmanager-funnel"
          + namespace   = "monitoring"
        }
      + spec {
          + ingress_class_name = "tailscale"
          + default_backend { service { name = "kube-prometheus-stack-alertmanager", port { number = 9093 } } }
          + tls { hosts = ["alertmanager"] }
        }
    }

Test Plan

tofu validate passes (verified)
tofu fmt clean (verified)
tofu plan shows 1 add + 1 change (verified)
Post-apply: alerting rules visible in Prometheus UI at /rules
Post-apply: Alertmanager UI accessible at alertmanager.tail5b443a.ts.net
Post-apply: Alertmanager shows default receiver and route config
With empty slack_webhook_url: no Slack receiver, alerts still visible in Alertmanager UI
With populated slack_webhook_url: Slack receiver present, alerts routed to #alerts

Review Checklist

tofu fmt applied
tofu validate passes
tofu plan output included
No unrelated changes
Sensitive values use set_sensitive (not embedded in yamlencode)
New Ingress follows existing funnel pattern (annotations, depends_on, TLS)

Plan: plan-pal-e-platform (Phase 3: Alerting + Deployment Protection)
Forgejo issue: #33

Closes #33

## Summary Adds the platform's first alerting rules, Alertmanager routing configuration, and an Alertmanager Tailscale funnel Ingress. This closes the detection gap that let incidents like the pal-e-docs Alembic crash go unnoticed. ## Changes - **`terraform/main.tf`** -- Added `additionalPrometheusRules` to kube-prometheus-stack Helm values with four rules: PodRestartStorm (warning), OOMKilled (critical), DiskPressure (critical), TargetDown (warning). Added `alertmanager.config` with default receiver, Slack receiver (conditional on `slack_webhook_url`), and routing. Added `dynamic "set_sensitive"` block to inject Slack webhook URL without exposing it in plan output. Added `kubernetes_ingress_v1.alertmanager_funnel` following the exact pattern of `grafana_funnel`. - **`terraform/variables.tf`** -- Added `slack_webhook_url` variable (sensitive, default empty string) so Alertmanager works without Slack configured. ## `tofu plan` output (targeted) ``` Plan: 1 to add, 1 to change, 0 to destroy. # helm_release.kube_prometheus_stack will be updated in-place ~ resource "helm_release" "kube_prometheus_stack" { name = "kube-prometheus-stack" ~ values = [~ (sensitive value)] } # kubernetes_ingress_v1.alertmanager_funnel will be created + resource "kubernetes_ingress_v1" "alertmanager_funnel" { + metadata { + annotations = { "tailscale.com/funnel" = "true" } + name = "alertmanager-funnel" + namespace = "monitoring" } + spec { + ingress_class_name = "tailscale" + default_backend { service { name = "kube-prometheus-stack-alertmanager", port { number = 9093 } } } + tls { hosts = ["alertmanager"] } } } ``` ## Test Plan - [ ] `tofu validate` passes (verified) - [ ] `tofu fmt` clean (verified) - [ ] `tofu plan` shows 1 add + 1 change (verified) - [ ] Post-apply: alerting rules visible in Prometheus UI at `/rules` - [ ] Post-apply: Alertmanager UI accessible at `alertmanager.tail5b443a.ts.net` - [ ] Post-apply: Alertmanager shows `default` receiver and route config - [ ] With empty `slack_webhook_url`: no Slack receiver, alerts still visible in Alertmanager UI - [ ] With populated `slack_webhook_url`: Slack receiver present, alerts routed to `#alerts` ## Review Checklist - [x] `tofu fmt` applied - [x] `tofu validate` passes - [x] `tofu plan` output included - [x] No unrelated changes - [x] Sensitive values use `set_sensitive` (not embedded in yamlencode) - [x] New Ingress follows existing funnel pattern (annotations, depends_on, TLS) ## Related - Plan: `plan-pal-e-platform` (Phase 3: Alerting + Deployment Protection) - Forgejo issue: #33 Closes #33

forgejo_admin added 1 commit

2026-03-14 13:41:46 +00:00

Add PrometheusRules, Alertmanager routing, and Alertmanager funnel cd37b87e20

Add four alerting rules (PodRestartStorm, OOMKilled, DiskPressure,
TargetDown) via additionalPrometheusRules in kube-prometheus-stack Helm
values. Configure Alertmanager with default + conditional Slack receiver
using set_sensitive for the webhook URL. Add Alertmanager Tailscale
funnel Ingress following the existing grafana_funnel pattern.

Closes #33

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

forgejo_admin commented

2026-03-14 13:42:21 +00:00

Author

Owner

Self-Review

Acceptance Criteria Check

Four PrometheusRules: PodRestartStorm (for: 0m, warning), OOMKilled (for: 0m, critical), DiskPressure (for: 5m, critical), TargetDown (for: 5m, warning)
Each rule has summary and description annotations, severity label
Rules include namespace, pod, instance, or container labels via PromQL label references
Alertmanager routing with conditional Slack receiver via set_sensitive dynamic block
Empty slack_webhook_url => only default receiver, no Slack route, alerts visible in UI
Non-empty slack_webhook_url => slack receiver added at index 1, set_sensitive injects api_url
Alertmanager funnel Ingress follows grafana_funnel pattern exactly (annotation, ingress class, TLS, depends_on)
Service name kube-prometheus-stack-alertmanager on port 9093 verified against live cluster
tofu validate passes
tofu fmt clean
tofu plan: 1 add, 1 change, 0 destroy
Closes #33 in PR body
No unrelated changes (2 files, +156 lines)

No issues found. PR is ready for human review.

## Self-Review ### Acceptance Criteria Check - [x] Four PrometheusRules: PodRestartStorm (`for: 0m`, warning), OOMKilled (`for: 0m`, critical), DiskPressure (`for: 5m`, critical), TargetDown (`for: 5m`, warning) - [x] Each rule has `summary` and `description` annotations, `severity` label - [x] Rules include `namespace`, `pod`, `instance`, or `container` labels via PromQL label references - [x] Alertmanager routing with conditional Slack receiver via `set_sensitive` dynamic block - [x] Empty `slack_webhook_url` => only `default` receiver, no Slack route, alerts visible in UI - [x] Non-empty `slack_webhook_url` => `slack` receiver added at index 1, `set_sensitive` injects `api_url` - [x] Alertmanager funnel Ingress follows `grafana_funnel` pattern exactly (annotation, ingress class, TLS, `depends_on`) - [x] Service name `kube-prometheus-stack-alertmanager` on port `9093` verified against live cluster - [x] `tofu validate` passes - [x] `tofu fmt` clean - [x] `tofu plan`: 1 add, 1 change, 0 destroy - [x] `Closes #33` in PR body - [x] No unrelated changes (2 files, +156 lines) ### No issues found. PR is ready for human review.

forgejo_admin commented

2026-03-14 13:44:23 +00:00

Author

Owner

PR #35 Review

BLOCKERS

None.

NITS

OOMKilled alert will stay firing after pod recovers. The metric kube_pod_container_status_last_terminated_reason{reason="OOMKilled"} is a gauge that persists the last termination reason even after the container restarts successfully. This means the alert fires once and stays firing until the pod is deleted or terminates for a different reason. Not a bug -- this is a known trade-off with this metric. A future enhancement could use increase(kube_pod_container_status_restarts_total{reason="OOMKilled"}[5m]) > 0 or add a for duration to reduce noise, but the current approach catches the OOM event which is the primary goal.
Alertmanager funnel is publicly accessible with no auth. The tailscale.com/funnel = "true" annotation exposes Alertmanager to the public internet. Alertmanager has no built-in authentication, so anyone with the URL can view alert state, silences, and configuration. This follows the exact same pattern as the existing Grafana funnel, so it is consistent with the current security posture. Worth noting as a future hardening item (e.g., removing funnel annotation and keeping it tailnet-only, or adding basic auth via a reverse proxy).
Slack channel #alerts is hardcoded. Consider extracting it to a variable in the future if multiple channels or environments are needed. Not blocking since this is a single-cluster platform.

SOP COMPLIANCE

Branch named after issue (33-alerting-rules-alertmanager-funnel references issue #33)
PR body follows template (Summary, Changes, Test Plan, Related sections present)
Related references plan slug (plan-pal-e-platform, Phase 3)
tofu plan output included
tofu fmt and tofu validate verified per PR body
Closes #33 present in PR body
No secrets, .env files, or credentials committed
No unrelated file changes (only terraform/main.tf and terraform/variables.tf modified)
Sensitive variable uses sensitive = true with empty default

TECHNICAL ASSESSMENT

PrometheusRules (4 rules): All PromQL expressions are syntactically correct. Template variables use proper Go template syntax ($labels, $value). for durations are reasonable -- 0m for immediate detection (PodRestartStorm, OOMKilled) and 5m for sustained conditions (DiskPressure, TargetDown). Severity labels are appropriate (critical for OOM/disk, warning for restarts/target-down). The additionalPrometheusRules key is correctly placed at the root of the Helm values.

Alertmanager config: The conditional Slack receiver pattern is sound. The concat guarantees default is always at index [0] and slack at [1], making the receivers[1] index in set_sensitive correct. The for_each condition matches the concat condition, so the index assumption always holds. The dynamic "set_sensitive" block correctly prevents the webhook URL from appearing in plan output.

Alertmanager funnel: Follows the exact same pattern as grafana_funnel -- same metadata structure, same tailscale.com/funnel annotation, same ingress class, same depends_on triplet. Service name (kube-prometheus-stack-alertmanager) and port (9093) are correct for the chart.

Variable definition: slack_webhook_url with sensitive = true, type = string, default = "" is correctly defined, allowing the stack to work without Slack configured.

VERDICT: APPROVED

## PR #35 Review ### BLOCKERS None. ### NITS 1. **OOMKilled alert will stay firing after pod recovers.** The metric `kube_pod_container_status_last_terminated_reason{reason="OOMKilled"}` is a gauge that persists the last termination reason even after the container restarts successfully. This means the alert fires once and stays firing until the pod is deleted or terminates for a different reason. Not a bug -- this is a known trade-off with this metric. A future enhancement could use `increase(kube_pod_container_status_restarts_total{reason="OOMKilled"}[5m]) > 0` or add a `for` duration to reduce noise, but the current approach catches the OOM event which is the primary goal. 2. **Alertmanager funnel is publicly accessible with no auth.** The `tailscale.com/funnel = "true"` annotation exposes Alertmanager to the public internet. Alertmanager has no built-in authentication, so anyone with the URL can view alert state, silences, and configuration. This follows the exact same pattern as the existing Grafana funnel, so it is consistent with the current security posture. Worth noting as a future hardening item (e.g., removing `funnel` annotation and keeping it tailnet-only, or adding basic auth via a reverse proxy). 3. **Slack channel `#alerts` is hardcoded.** Consider extracting it to a variable in the future if multiple channels or environments are needed. Not blocking since this is a single-cluster platform. ### SOP COMPLIANCE - [x] Branch named after issue (`33-alerting-rules-alertmanager-funnel` references issue #33) - [x] PR body follows template (Summary, Changes, Test Plan, Related sections present) - [x] Related references plan slug (`plan-pal-e-platform`, Phase 3) - [x] `tofu plan` output included - [x] `tofu fmt` and `tofu validate` verified per PR body - [x] `Closes #33` present in PR body - [x] No secrets, .env files, or credentials committed - [x] No unrelated file changes (only `terraform/main.tf` and `terraform/variables.tf` modified) - [x] Sensitive variable uses `sensitive = true` with empty default ### TECHNICAL ASSESSMENT **PrometheusRules (4 rules):** All PromQL expressions are syntactically correct. Template variables use proper Go template syntax (`$labels`, `$value`). `for` durations are reasonable -- `0m` for immediate detection (PodRestartStorm, OOMKilled) and `5m` for sustained conditions (DiskPressure, TargetDown). Severity labels are appropriate (critical for OOM/disk, warning for restarts/target-down). The `additionalPrometheusRules` key is correctly placed at the root of the Helm values. **Alertmanager config:** The conditional Slack receiver pattern is sound. The `concat` guarantees `default` is always at index `[0]` and `slack` at `[1]`, making the `receivers[1]` index in `set_sensitive` correct. The `for_each` condition matches the `concat` condition, so the index assumption always holds. The `dynamic "set_sensitive"` block correctly prevents the webhook URL from appearing in plan output. **Alertmanager funnel:** Follows the exact same pattern as `grafana_funnel` -- same metadata structure, same `tailscale.com/funnel` annotation, same ingress class, same `depends_on` triplet. Service name (`kube-prometheus-stack-alertmanager`) and port (`9093`) are correct for the chart. **Variable definition:** `slack_webhook_url` with `sensitive = true`, `type = string`, `default = ""` is correctly defined, allowing the stack to work without Slack configured. ### VERDICT: APPROVED