Add PrometheusRules, Alertmanager routing, and Alertmanager funnel #35
No reviewers
Labels
No labels
domain:backend
domain:devops
domain:frontend
status:approved
status:in-progress
status:needs-fix
status:qa
type:bug
type:devops
type:feature
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
forgejo_admin/pal-e-platform!35
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "33-alerting-rules-alertmanager-funnel"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
Adds the platform's first alerting rules, Alertmanager routing configuration, and an Alertmanager Tailscale funnel Ingress. This closes the detection gap that let incidents like the pal-e-docs Alembic crash go unnoticed.
Changes
terraform/main.tf-- AddedadditionalPrometheusRulesto kube-prometheus-stack Helm values with four rules: PodRestartStorm (warning), OOMKilled (critical), DiskPressure (critical), TargetDown (warning). Addedalertmanager.configwith default receiver, Slack receiver (conditional onslack_webhook_url), and routing. Addeddynamic "set_sensitive"block to inject Slack webhook URL without exposing it in plan output. Addedkubernetes_ingress_v1.alertmanager_funnelfollowing the exact pattern ofgrafana_funnel.terraform/variables.tf-- Addedslack_webhook_urlvariable (sensitive, default empty string) so Alertmanager works without Slack configured.tofu planoutput (targeted)Test Plan
tofu validatepasses (verified)tofu fmtclean (verified)tofu planshows 1 add + 1 change (verified)/rulesalertmanager.tail5b443a.ts.netdefaultreceiver and route configslack_webhook_url: no Slack receiver, alerts still visible in Alertmanager UIslack_webhook_url: Slack receiver present, alerts routed to#alertsReview Checklist
tofu fmtappliedtofu validatepassestofu planoutput includedset_sensitive(not embedded in yamlencode)Related
plan-pal-e-platform(Phase 3: Alerting + Deployment Protection)Closes #33
Self-Review
Acceptance Criteria Check
for: 0m, warning), OOMKilled (for: 0m, critical), DiskPressure (for: 5m, critical), TargetDown (for: 5m, warning)summaryanddescriptionannotations,severitylabelnamespace,pod,instance, orcontainerlabels via PromQL label referencesset_sensitivedynamic blockslack_webhook_url=> onlydefaultreceiver, no Slack route, alerts visible in UIslack_webhook_url=>slackreceiver added at index 1,set_sensitiveinjectsapi_urlgrafana_funnelpattern exactly (annotation, ingress class, TLS,depends_on)kube-prometheus-stack-alertmanageron port9093verified against live clustertofu validatepassestofu fmtcleantofu plan: 1 add, 1 change, 0 destroyCloses #33in PR bodyNo issues found. PR is ready for human review.
PR #35 Review
BLOCKERS
None.
NITS
OOMKilled alert will stay firing after pod recovers. The metric
kube_pod_container_status_last_terminated_reason{reason="OOMKilled"}is a gauge that persists the last termination reason even after the container restarts successfully. This means the alert fires once and stays firing until the pod is deleted or terminates for a different reason. Not a bug -- this is a known trade-off with this metric. A future enhancement could useincrease(kube_pod_container_status_restarts_total{reason="OOMKilled"}[5m]) > 0or add aforduration to reduce noise, but the current approach catches the OOM event which is the primary goal.Alertmanager funnel is publicly accessible with no auth. The
tailscale.com/funnel = "true"annotation exposes Alertmanager to the public internet. Alertmanager has no built-in authentication, so anyone with the URL can view alert state, silences, and configuration. This follows the exact same pattern as the existing Grafana funnel, so it is consistent with the current security posture. Worth noting as a future hardening item (e.g., removingfunnelannotation and keeping it tailnet-only, or adding basic auth via a reverse proxy).Slack channel
#alertsis hardcoded. Consider extracting it to a variable in the future if multiple channels or environments are needed. Not blocking since this is a single-cluster platform.SOP COMPLIANCE
33-alerting-rules-alertmanager-funnelreferences issue #33)plan-pal-e-platform, Phase 3)tofu planoutput includedtofu fmtandtofu validateverified per PR bodyCloses #33present in PR bodyterraform/main.tfandterraform/variables.tfmodified)sensitive = truewith empty defaultTECHNICAL ASSESSMENT
PrometheusRules (4 rules): All PromQL expressions are syntactically correct. Template variables use proper Go template syntax (
$labels,$value).fordurations are reasonable --0mfor immediate detection (PodRestartStorm, OOMKilled) and5mfor sustained conditions (DiskPressure, TargetDown). Severity labels are appropriate (critical for OOM/disk, warning for restarts/target-down). TheadditionalPrometheusRuleskey is correctly placed at the root of the Helm values.Alertmanager config: The conditional Slack receiver pattern is sound. The
concatguaranteesdefaultis always at index[0]andslackat[1], making thereceivers[1]index inset_sensitivecorrect. Thefor_eachcondition matches theconcatcondition, so the index assumption always holds. Thedynamic "set_sensitive"block correctly prevents the webhook URL from appearing in plan output.Alertmanager funnel: Follows the exact same pattern as
grafana_funnel-- same metadata structure, sametailscale.com/funnelannotation, same ingress class, samedepends_ontriplet. Service name (kube-prometheus-stack-alertmanager) and port (9093) are correct for the chart.Variable definition:
slack_webhook_urlwithsensitive = true,type = string,default = ""is correctly defined, allowing the stack to work without Slack configured.VERDICT: APPROVED