Add PrometheusRule alerts for error rate, latency, and availability #17
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Type
Feature
Lineage
Child of
ldraney/landscaping-assistant #43(Observability & DORA metrics stack).Relates to Sloth SLOs (#91) — manual threshold alerts now, Sloth auto-generates burn-rate alerts later.
Repo
ldraney/pal-e-platformUser Story
As a platform operator
I want alerts when landscaping-assistant error rate spikes, latency degrades, or availability drops
So that I know when things break without watching dashboards
Context
Alertmanager is wired to Telegram but has no application-level alert rules for landscaping-assistant. The platform has 31 rule groups and 4 existing PrometheusRule resources in
terraform/modules/monitoring/main.tf(blackbox, embedding, payment pipeline, gmail oauth). This issue adds threshold-based alerts using the verified yabeda-rails metric names. Later, Sloth (#91) will auto-generate multi-window burn-rate alerts from SLO definitions — these manual rules serve as the initial safety net.File Targets
Files the agent should modify or create:
terraform/modules/monitoring/main.tf(pal-e-platform) — addkubernetes_manifestresource for PrometheusRule CRD, matching the existing pattern (blackbox, embedding, etc.)Files the agent should NOT touch:
overlays/landscaping-assistant/(pal-e-deployments) — PrometheusRules are platform-level, not per-service kustomize overlaysVerified Metric Names (from yabeda-rails)
rails_requests_total(counter) — labels: controller, action, status, format, methodrails_request_duration_bucket/_sum/_count(histogram) — same labelspuma_running,puma_max_threads,puma_backlog(gauges)PromQL for Alert Rules
Acceptance Criteria
Test Expectations
tofu planshows the new PrometheusRule resourcetofu applysucceedskubectl get prometheusrules -n monitoringshows the new ruleConstraints
terraform/modules/monitoring/main.tf(kubernetes_manifest resource type)namespace="landscaping-assistant"on all queriesChecklist
Related
project-landscaping-observability— observability projectldraney/landscaping-assistant #43— parent observability issueldraney/landscaping-assistant #91— Sloth SLOs (future auto-generated rules)Scope review finding (review-17-2026-06-01):
pal-e-platformTerraform, notpal-e-deploymentskustomize. Every existing PrometheusRule is interraform/modules/monitoring/main.tfaskubernetes_manifestresources.http_requests_totalandhttp_request_duration_seconds_bucket, but yabeda-rails emitsrails_requests_totalandrails_request_duration.Needs refinement before this is workable.
Scope Review: NEEDS_REFINEMENT
Review note:
review-1293-2026-06-02Prior flags about "wrong repo" and "wrong metric names" are confirmed. Five issues found:
landscaping-assistantbut all file targets are inpal-e-deployments. Needs to be re-created on the correct repo with board item URL updated.rails_requests_total,rails_request_duration_seconds_bucket). Agent will not know what to write.story:observabilityis not defined in the project-landscaping-assistant user-stories table.arch-k8s-deploynote exists in pal-e-docs.Scope Review: NEEDS_REFINEMENT (3rd review)
Review note:
review-1307-2026-06-04Both issues flagged in the two prior reviews (2026-06-01, 2026-06-02) remain unresolved — the issue body has not been updated.
5 findings:
Wrong repo (UNFIXED):
### Reposayspal-e-deploymentsbut all existing PrometheusRules are Terraformkubernetes_manifestresources inpal-e-platformatterraform/modules/monitoring/main.tf. Verified against 4 existing alert resources (blackbox, embedding, payment, gmail).Wrong file targets (UNFIXED): Points to Kustomize overlay paths. Should target
terraform/modules/monitoring/main.tfinpal-e-platform.Missing metric names (UNFIXED): AC says "5xx error rate" and "p95 latency" without specifying PromQL or metric names. The app uses yabeda-rails which emits
rails_requests_total(counter) andrails_request_duration_bucket(histogram) — NOT the generichttp_requests_total/http_request_duration_seconds_bucketan agent would guess.Wrong test expectations:
kustomize buildshould betofu plan/tofu validate.Pyrra overlap: The observability roadmap (docs/observability-roadmap.md) puts PrometheusRules under Pyrra (Phase 6) for auto-generated SLO burn-rate alerting. Unclear if these manual rules are interim or permanent.
All 5 recommendations are detailed in the review note with suggested PromQL expressions and corrected file targets.