Add PrometheusRules, Alertmanager routing, and Alertmanager funnel #33
Labels
No labels
domain:backend
domain:devops
domain:frontend
status:approved
status:in-progress
status:needs-fix
status:qa
type:bug
type:devops
type:feature
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
forgejo_admin/pal-e-platform#33
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Lineage
plan-pal-e-platform→ Phase 3 (Alerting + Deployment Protection)Repo
forgejo_admin/pal-e-platformUser Story
As the platform operator
I want critical infrastructure alerts to fire and reach me automatically
So that I can detect and respond to incidents in minutes instead of hours (MTTR)
Context
The monitoring stack is deployed (Prometheus, Grafana, Alertmanager, Loki) but zero alerting rules exist. Alertmanager is deployed with persistent storage (1Gi) but has no routing configuration. If a pod OOMKills or a disk fills up, nobody knows until something breaks visibly.
The
ruleSelectorNilUsesHelmValues = falsesetting in kube-prometheus-stack means Prometheus will discover PrometheusRules anywhere — we just need to define them.The pal-e-docs Alembic crash on 2026-02-26 went undetected. This is a DORA anti-pattern: we're measuring deployment frequency (via the DORA exporter) but can't improve MTTR because there's no detection layer. This issue closes that gap.
Key technical facts:
monitoringnamespaceruleSelectorNilUsesHelmValues = false— Prometheus discovers any PrometheusRuleterraform/main.tfyamlencode({...})pattern throughoutFile Targets
Files to modify:
terraform/main.tf— addadditionalPrometheusRulesto kube-prometheus-stack Helm values, add Alertmanager config (receiver + route) to kube-prometheus-stack Helm values, add Alertmanager Tailscale funnel Ingress resourceterraform/variables.tf— addslack_webhook_urlvariable (sensitive, default empty string)Files NOT to touch:
terraform/dashboards/dora-dashboard.json— existing dashboard, unrelatedsalt/— host-level config, not relevant to this issueAcceptance Criteria
additionalPrometheusRules:increase(kube_pod_container_status_restarts_total[15m]) > 3kube_pod_container_status_last_terminated_reason{reason="OOMKilled"} > 0(node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 < 15up == 0for 5m (withfor: 5mto avoid flapping)summaryanddescriptionannotations,severitylabel (critical or warning)slack_webhook_urlslack_webhook_urlis empty, Alertmanager should still function (alerts visible in UI, no Slack delivery) — use a conditional or null receiver patternalertmanager.{tailscale_domain}— follow the same pattern asgrafana_funnelin main.tftofu validatepassestofu fmtappliedTest Expectations
tofu planshows expected new/changed resources (PrometheusRules in Helm values, Alertmanager config, new Ingress)/rulesConstraints
main.tf: resource naming conventions,depends_onchains,yamlencode({...})for Helm valuesadditionalPrometheusRulesinside the kube-prometheus-stack Helm values (not a separatekubernetes_manifest) — keeps all monitoring config in one Helm releaseslack_webhook_urlmust useset_sensitive(likegrafana.adminPasswordand other secrets)tailscale.com/funnelannotation,tailscaleingress class, TLS hosts block,depends_onincludingtailscale_operatorandtailscale_aclforduration on alert rules to prevent flapping (e.g.,for: 5mon target down,for: 0mon OOMKilled since those are point-in-time events)namespace,pod, orinstancelabels in the rule so they're useful for debuggingChecklist
Closes #33in bodytofu planoutput included in PR descriptiontofu fmtandtofu validatepassRelated
project-pal-e-platform— projectphase-observability-3-alerting— phase note in pal-e-docs