forgejo_admin/pal-e-platform

Fork

You've already forked pal-e-platform

Code Issues 35 Pull requests Projects Releases Packages Wiki Activity Actions

Add PrometheusRules, Alertmanager routing, and Alertmanager funnel #33

New issue

Closed

opened 2026-03-14 13:35:14 +00:00 by forgejo_admin · 0 comments

forgejo_admin commented

2026-03-14 13:35:14 +00:00

Owner

Copy link

Lineage

plan-pal-e-platform → Phase 3 (Alerting + Deployment Protection)

Repo

forgejo_admin/pal-e-platform

User Story

As the platform operator
I want critical infrastructure alerts to fire and reach me automatically
So that I can detect and respond to incidents in minutes instead of hours (MTTR)

Context

The monitoring stack is deployed (Prometheus, Grafana, Alertmanager, Loki) but zero alerting rules exist. Alertmanager is deployed with persistent storage (1Gi) but has no routing configuration. If a pod OOMKills or a disk fills up, nobody knows until something breaks visibly.

The ruleSelectorNilUsesHelmValues = false setting in kube-prometheus-stack means Prometheus will discover PrometheusRules anywhere — we just need to define them.

The pal-e-docs Alembic crash on 2026-02-26 went undetected. This is a DORA anti-pattern: we're measuring deployment frequency (via the DORA exporter) but can't improve MTTR because there's no detection layer. This issue closes that gap.

Key technical facts:

kube-prometheus-stack v82.0.0 deployed in monitoring namespace
Alertmanager has persistent storage but no routing config
ruleSelectorNilUsesHelmValues = false — Prometheus discovers any PrometheusRule
Grafana has a Tailscale funnel — Alertmanager does not yet
All infrastructure managed via Terraform in terraform/main.tf
Helm values use yamlencode({...}) pattern throughout

File Targets

Files to modify:

terraform/main.tf — add additionalPrometheusRules to kube-prometheus-stack Helm values, add Alertmanager config (receiver + route) to kube-prometheus-stack Helm values, add Alertmanager Tailscale funnel Ingress resource
terraform/variables.tf — add slack_webhook_url variable (sensitive, default empty string)

Files NOT to touch:

terraform/dashboards/dora-dashboard.json — existing dashboard, unrelated
salt/ — host-level config, not relevant to this issue

Acceptance Criteria

Four PrometheusRules defined in kube-prometheus-stack Helm values via additionalPrometheusRules:
- Pod restart storm: increase(kube_pod_container_status_restarts_total[15m]) > 3
- OOMKilled: kube_pod_container_status_last_terminated_reason{reason="OOMKilled"} > 0
- Disk pressure: (node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 < 15
- Target down: up == 0 for 5m (with for: 5m to avoid flapping)
Each rule has: summary and description annotations, severity label (critical or warning)
Alertmanager routing configured in kube-prometheus-stack Helm values with a Slack webhook receiver — the Slack webhook URL comes from a sensitive TF variable slack_webhook_url
If slack_webhook_url is empty, Alertmanager should still function (alerts visible in UI, no Slack delivery) — use a conditional or null receiver pattern
Alertmanager UI accessible via Tailscale funnel Ingress at alertmanager.{tailscale_domain} — follow the same pattern as grafana_funnel in main.tf
tofu validate passes
tofu fmt applied

Test Expectations

tofu plan shows expected new/changed resources (PrometheusRules in Helm values, Alertmanager config, new Ingress)
Post-apply: rules visible in Prometheus UI at /rules
Post-apply: Alertmanager UI accessible at funnel URL
Post-apply: Alertmanager shows configured receiver and route

Constraints

Follow existing Terraform patterns in main.tf: resource naming conventions, depends_on chains, yamlencode({...}) for Helm values
Use additionalPrometheusRules inside the kube-prometheus-stack Helm values (not a separate kubernetes_manifest) — keeps all monitoring config in one Helm release
slack_webhook_url must use set_sensitive (like grafana.adminPassword and other secrets)
Alertmanager funnel Ingress must follow the exact pattern of existing funnels: tailscale.com/funnel annotation, tailscale ingress class, TLS hosts block, depends_on including tailscale_operator and tailscale_acl
Include for duration on alert rules to prevent flapping (e.g., for: 5m on target down, for: 0m on OOMKilled since those are point-in-time events)
All alerts should include namespace, pod, or instance labels in the rule so they're useful for debugging

Checklist

PR opened with Closes #33 in body
tofu plan output included in PR description
tofu fmt and tofu validate pass
No unrelated changes

project-pal-e-platform — project
phase-observability-3-alerting — phase note in pal-e-docs

### Lineage `plan-pal-e-platform` → Phase 3 (Alerting + Deployment Protection) ### Repo `forgejo_admin/pal-e-platform` ### User Story As the platform operator I want critical infrastructure alerts to fire and reach me automatically So that I can detect and respond to incidents in minutes instead of hours (MTTR) ### Context The monitoring stack is deployed (Prometheus, Grafana, Alertmanager, Loki) but **zero alerting rules exist**. Alertmanager is deployed with persistent storage (1Gi) but has no routing configuration. If a pod OOMKills or a disk fills up, nobody knows until something breaks visibly. The `ruleSelectorNilUsesHelmValues = false` setting in kube-prometheus-stack means Prometheus will discover PrometheusRules anywhere — we just need to define them. The pal-e-docs Alembic crash on 2026-02-26 went undetected. This is a DORA anti-pattern: we're measuring deployment frequency (via the DORA exporter) but can't improve MTTR because there's no detection layer. This issue closes that gap. Key technical facts: - kube-prometheus-stack v82.0.0 deployed in `monitoring` namespace - Alertmanager has persistent storage but **no routing config** - `ruleSelectorNilUsesHelmValues = false` — Prometheus discovers any PrometheusRule - Grafana has a Tailscale funnel — Alertmanager does **not** yet - All infrastructure managed via Terraform in `terraform/main.tf` - Helm values use `yamlencode({...})` pattern throughout ### File Targets Files to modify: - `terraform/main.tf` — add `additionalPrometheusRules` to kube-prometheus-stack Helm values, add Alertmanager config (receiver + route) to kube-prometheus-stack Helm values, add Alertmanager Tailscale funnel Ingress resource - `terraform/variables.tf` — add `slack_webhook_url` variable (sensitive, default empty string) Files NOT to touch: - `terraform/dashboards/dora-dashboard.json` — existing dashboard, unrelated - `salt/` — host-level config, not relevant to this issue ### Acceptance Criteria - [ ] Four PrometheusRules defined in kube-prometheus-stack Helm values via `additionalPrometheusRules`: - Pod restart storm: `increase(kube_pod_container_status_restarts_total[15m]) > 3` - OOMKilled: `kube_pod_container_status_last_terminated_reason{reason="OOMKilled"} > 0` - Disk pressure: `(node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 < 15` - Target down: `up == 0` for 5m (with `for: 5m` to avoid flapping) - [ ] Each rule has: `summary` and `description` annotations, `severity` label (critical or warning) - [ ] Alertmanager routing configured in kube-prometheus-stack Helm values with a Slack webhook receiver — the Slack webhook URL comes from a sensitive TF variable `slack_webhook_url` - [ ] If `slack_webhook_url` is empty, Alertmanager should still function (alerts visible in UI, no Slack delivery) — use a conditional or null receiver pattern - [ ] Alertmanager UI accessible via Tailscale funnel Ingress at `alertmanager.{tailscale_domain}` — follow the same pattern as `grafana_funnel` in main.tf - [ ] `tofu validate` passes - [ ] `tofu fmt` applied ### Test Expectations - [ ] `tofu plan` shows expected new/changed resources (PrometheusRules in Helm values, Alertmanager config, new Ingress) - [ ] Post-apply: rules visible in Prometheus UI at `/rules` - [ ] Post-apply: Alertmanager UI accessible at funnel URL - [ ] Post-apply: Alertmanager shows configured receiver and route ### Constraints - Follow existing Terraform patterns in `main.tf`: resource naming conventions, `depends_on` chains, `yamlencode({...})` for Helm values - Use `additionalPrometheusRules` inside the kube-prometheus-stack Helm values (not a separate `kubernetes_manifest`) — keeps all monitoring config in one Helm release - `slack_webhook_url` must use `set_sensitive` (like `grafana.adminPassword` and other secrets) - Alertmanager funnel Ingress must follow the exact pattern of existing funnels: `tailscale.com/funnel` annotation, `tailscale` ingress class, TLS hosts block, `depends_on` including `tailscale_operator` and `tailscale_acl` - Include `for` duration on alert rules to prevent flapping (e.g., `for: 5m` on target down, `for: 0m` on OOMKilled since those are point-in-time events) - All alerts should include `namespace`, `pod`, or `instance` labels in the rule so they're useful for debugging ### Checklist - [ ] PR opened with `Closes #33` in body - [ ] `tofu plan` output included in PR description - [ ] `tofu fmt` and `tofu validate` pass - [ ] No unrelated changes ### Related - `project-pal-e-platform` — project - `phase-observability-3-alerting` — phase note in pal-e-docs

forgejo_admin referenced this issue from a commit

2026-03-14 13:41:06 +00:00

Add PrometheusRules, Alertmanager routing, and Alertmanager funnel

forgejo_admin referenced this issue from a pull request that will close it,

2026-03-14 13:41:46 +00:00

Add PrometheusRules, Alertmanager routing, and Alertmanager funnel #35

forgejo_admin referenced this issue

2026-03-14 13:44:23 +00:00

Add PrometheusRules, Alertmanager routing, and Alertmanager funnel #35

forgejo_admin

2026-03-14 13:44:24 +00:00

closed this issue
added the
status:approved
label

forgejo_admin referenced this issue

2026-03-18 17:24:08 +00:00

Platform cleanup: resolve 15 active alerts + stabilize CI #109

No Branch/Tag specified

main

188-cross-repo-worktree-isolation-for-parall

111-fix-keycloak-probe

86-fix-rotate-woodpecker-api-token-in-salt

71-feat-deploy-pyrra-slo-manager-with-servi

64-hotfix-woodpecker-oauth-login-broken-for

18-fix-grafana-duplicate-default-datasource-v2

18-fix-grafana-duplicate-default-datasource

13-fix-cnpg-all-parameters

No results found.

Labels

Clear labels

QA passed, awaiting merge approval

status:in-progress

Dev agent is actively working

status:needs-fix

QA found issues, back to dev

status:qa

PR submitted, awaiting QA review

type:bug

Bug fix

type:devops

Infrastructure/CI/config work

No labels

Milestone

Clear milestone

No items

No milestone

Projects

Clear projects

No items

No project

Assignees

Clear assignees

No assignees

1 participant

Notifications

Due date

The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference

forgejo_admin/pal-e-platform#33

Reference in a new issue

Repository

forgejo_admin/pal-e-platform

Title

Body

No description provided.

Delete branch "%!s()"

Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?

Rows
Columns