Add PrometheusRules, Alertmanager routing, and Alertmanager funnel #33

Closed
opened 2026-03-14 13:35:14 +00:00 by forgejo_admin · 0 comments

Lineage

plan-pal-e-platform → Phase 3 (Alerting + Deployment Protection)

Repo

forgejo_admin/pal-e-platform

User Story

As the platform operator
I want critical infrastructure alerts to fire and reach me automatically
So that I can detect and respond to incidents in minutes instead of hours (MTTR)

Context

The monitoring stack is deployed (Prometheus, Grafana, Alertmanager, Loki) but zero alerting rules exist. Alertmanager is deployed with persistent storage (1Gi) but has no routing configuration. If a pod OOMKills or a disk fills up, nobody knows until something breaks visibly.

The ruleSelectorNilUsesHelmValues = false setting in kube-prometheus-stack means Prometheus will discover PrometheusRules anywhere — we just need to define them.

The pal-e-docs Alembic crash on 2026-02-26 went undetected. This is a DORA anti-pattern: we're measuring deployment frequency (via the DORA exporter) but can't improve MTTR because there's no detection layer. This issue closes that gap.

Key technical facts:

  • kube-prometheus-stack v82.0.0 deployed in monitoring namespace
  • Alertmanager has persistent storage but no routing config
  • ruleSelectorNilUsesHelmValues = false — Prometheus discovers any PrometheusRule
  • Grafana has a Tailscale funnel — Alertmanager does not yet
  • All infrastructure managed via Terraform in terraform/main.tf
  • Helm values use yamlencode({...}) pattern throughout

File Targets

Files to modify:

  • terraform/main.tf — add additionalPrometheusRules to kube-prometheus-stack Helm values, add Alertmanager config (receiver + route) to kube-prometheus-stack Helm values, add Alertmanager Tailscale funnel Ingress resource
  • terraform/variables.tf — add slack_webhook_url variable (sensitive, default empty string)

Files NOT to touch:

  • terraform/dashboards/dora-dashboard.json — existing dashboard, unrelated
  • salt/ — host-level config, not relevant to this issue

Acceptance Criteria

  • Four PrometheusRules defined in kube-prometheus-stack Helm values via additionalPrometheusRules:
    • Pod restart storm: increase(kube_pod_container_status_restarts_total[15m]) > 3
    • OOMKilled: kube_pod_container_status_last_terminated_reason{reason="OOMKilled"} > 0
    • Disk pressure: (node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 < 15
    • Target down: up == 0 for 5m (with for: 5m to avoid flapping)
  • Each rule has: summary and description annotations, severity label (critical or warning)
  • Alertmanager routing configured in kube-prometheus-stack Helm values with a Slack webhook receiver — the Slack webhook URL comes from a sensitive TF variable slack_webhook_url
  • If slack_webhook_url is empty, Alertmanager should still function (alerts visible in UI, no Slack delivery) — use a conditional or null receiver pattern
  • Alertmanager UI accessible via Tailscale funnel Ingress at alertmanager.{tailscale_domain} — follow the same pattern as grafana_funnel in main.tf
  • tofu validate passes
  • tofu fmt applied

Test Expectations

  • tofu plan shows expected new/changed resources (PrometheusRules in Helm values, Alertmanager config, new Ingress)
  • Post-apply: rules visible in Prometheus UI at /rules
  • Post-apply: Alertmanager UI accessible at funnel URL
  • Post-apply: Alertmanager shows configured receiver and route

Constraints

  • Follow existing Terraform patterns in main.tf: resource naming conventions, depends_on chains, yamlencode({...}) for Helm values
  • Use additionalPrometheusRules inside the kube-prometheus-stack Helm values (not a separate kubernetes_manifest) — keeps all monitoring config in one Helm release
  • slack_webhook_url must use set_sensitive (like grafana.adminPassword and other secrets)
  • Alertmanager funnel Ingress must follow the exact pattern of existing funnels: tailscale.com/funnel annotation, tailscale ingress class, TLS hosts block, depends_on including tailscale_operator and tailscale_acl
  • Include for duration on alert rules to prevent flapping (e.g., for: 5m on target down, for: 0m on OOMKilled since those are point-in-time events)
  • All alerts should include namespace, pod, or instance labels in the rule so they're useful for debugging

Checklist

  • PR opened with Closes #33 in body
  • tofu plan output included in PR description
  • tofu fmt and tofu validate pass
  • No unrelated changes
  • project-pal-e-platform — project
  • phase-observability-3-alerting — phase note in pal-e-docs
### Lineage `plan-pal-e-platform` → Phase 3 (Alerting + Deployment Protection) ### Repo `forgejo_admin/pal-e-platform` ### User Story As the platform operator I want critical infrastructure alerts to fire and reach me automatically So that I can detect and respond to incidents in minutes instead of hours (MTTR) ### Context The monitoring stack is deployed (Prometheus, Grafana, Alertmanager, Loki) but **zero alerting rules exist**. Alertmanager is deployed with persistent storage (1Gi) but has no routing configuration. If a pod OOMKills or a disk fills up, nobody knows until something breaks visibly. The `ruleSelectorNilUsesHelmValues = false` setting in kube-prometheus-stack means Prometheus will discover PrometheusRules anywhere — we just need to define them. The pal-e-docs Alembic crash on 2026-02-26 went undetected. This is a DORA anti-pattern: we're measuring deployment frequency (via the DORA exporter) but can't improve MTTR because there's no detection layer. This issue closes that gap. Key technical facts: - kube-prometheus-stack v82.0.0 deployed in `monitoring` namespace - Alertmanager has persistent storage but **no routing config** - `ruleSelectorNilUsesHelmValues = false` — Prometheus discovers any PrometheusRule - Grafana has a Tailscale funnel — Alertmanager does **not** yet - All infrastructure managed via Terraform in `terraform/main.tf` - Helm values use `yamlencode({...})` pattern throughout ### File Targets Files to modify: - `terraform/main.tf` — add `additionalPrometheusRules` to kube-prometheus-stack Helm values, add Alertmanager config (receiver + route) to kube-prometheus-stack Helm values, add Alertmanager Tailscale funnel Ingress resource - `terraform/variables.tf` — add `slack_webhook_url` variable (sensitive, default empty string) Files NOT to touch: - `terraform/dashboards/dora-dashboard.json` — existing dashboard, unrelated - `salt/` — host-level config, not relevant to this issue ### Acceptance Criteria - [ ] Four PrometheusRules defined in kube-prometheus-stack Helm values via `additionalPrometheusRules`: - Pod restart storm: `increase(kube_pod_container_status_restarts_total[15m]) > 3` - OOMKilled: `kube_pod_container_status_last_terminated_reason{reason="OOMKilled"} > 0` - Disk pressure: `(node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 < 15` - Target down: `up == 0` for 5m (with `for: 5m` to avoid flapping) - [ ] Each rule has: `summary` and `description` annotations, `severity` label (critical or warning) - [ ] Alertmanager routing configured in kube-prometheus-stack Helm values with a Slack webhook receiver — the Slack webhook URL comes from a sensitive TF variable `slack_webhook_url` - [ ] If `slack_webhook_url` is empty, Alertmanager should still function (alerts visible in UI, no Slack delivery) — use a conditional or null receiver pattern - [ ] Alertmanager UI accessible via Tailscale funnel Ingress at `alertmanager.{tailscale_domain}` — follow the same pattern as `grafana_funnel` in main.tf - [ ] `tofu validate` passes - [ ] `tofu fmt` applied ### Test Expectations - [ ] `tofu plan` shows expected new/changed resources (PrometheusRules in Helm values, Alertmanager config, new Ingress) - [ ] Post-apply: rules visible in Prometheus UI at `/rules` - [ ] Post-apply: Alertmanager UI accessible at funnel URL - [ ] Post-apply: Alertmanager shows configured receiver and route ### Constraints - Follow existing Terraform patterns in `main.tf`: resource naming conventions, `depends_on` chains, `yamlencode({...})` for Helm values - Use `additionalPrometheusRules` inside the kube-prometheus-stack Helm values (not a separate `kubernetes_manifest`) — keeps all monitoring config in one Helm release - `slack_webhook_url` must use `set_sensitive` (like `grafana.adminPassword` and other secrets) - Alertmanager funnel Ingress must follow the exact pattern of existing funnels: `tailscale.com/funnel` annotation, `tailscale` ingress class, TLS hosts block, `depends_on` including `tailscale_operator` and `tailscale_acl` - Include `for` duration on alert rules to prevent flapping (e.g., `for: 5m` on target down, `for: 0m` on OOMKilled since those are point-in-time events) - All alerts should include `namespace`, `pod`, or `instance` labels in the rule so they're useful for debugging ### Checklist - [ ] PR opened with `Closes #33` in body - [ ] `tofu plan` output included in PR description - [ ] `tofu fmt` and `tofu validate` pass - [ ] No unrelated changes ### Related - `project-pal-e-platform` — project - `phase-observability-3-alerting` — phase note in pal-e-docs
forgejo_admin 2026-03-14 13:44:24 +00:00
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
forgejo_admin/pal-e-platform#33
No description provided.