Add PrometheusRule alerts for error rate, latency, and availability #17

Open
opened 2026-05-25 03:04:14 +00:00 by ldraney · 3 comments
Owner

Type

Feature

Lineage

Child of ldraney/landscaping-assistant #43 (Observability & DORA metrics stack).
Relates to Sloth SLOs (#91) — manual threshold alerts now, Sloth auto-generates burn-rate alerts later.

Repo

ldraney/pal-e-platform

User Story

As a platform operator
I want alerts when landscaping-assistant error rate spikes, latency degrades, or availability drops
So that I know when things break without watching dashboards

Context

Alertmanager is wired to Telegram but has no application-level alert rules for landscaping-assistant. The platform has 31 rule groups and 4 existing PrometheusRule resources in terraform/modules/monitoring/main.tf (blackbox, embedding, payment pipeline, gmail oauth). This issue adds threshold-based alerts using the verified yabeda-rails metric names. Later, Sloth (#91) will auto-generate multi-window burn-rate alerts from SLO definitions — these manual rules serve as the initial safety net.

File Targets

Files the agent should modify or create:

  • terraform/modules/monitoring/main.tf (pal-e-platform) — add kubernetes_manifest resource for PrometheusRule CRD, matching the existing pattern (blackbox, embedding, etc.)

Files the agent should NOT touch:

  • overlays/landscaping-assistant/ (pal-e-deployments) — PrometheusRules are platform-level, not per-service kustomize overlays

Verified Metric Names (from yabeda-rails)

  • rails_requests_total (counter) — labels: controller, action, status, format, method
  • rails_request_duration_bucket / _sum / _count (histogram) — same labels
  • puma_running, puma_max_threads, puma_backlog (gauges)

PromQL for Alert Rules

# Error rate > 5% for 5 minutes
sum(rate(rails_requests_total{namespace="landscaping-assistant", status=~"5.."}[5m]))
/ clamp_min(sum(rate(rails_requests_total{namespace="landscaping-assistant"}[5m])), 0.001) > 0.05

# P95 latency > 1s for 5 minutes
histogram_quantile(0.95, sum(rate(rails_request_duration_bucket{namespace="landscaping-assistant"}[5m])) by (le)) > 1

# Zero requests for 5 minutes (availability)
sum(rate(rails_requests_total{namespace="landscaping-assistant"}[5m])) == 0

Acceptance Criteria

  • PrometheusRule resource deployed in monitoring namespace
  • Error rate alert fires when 5xx rate exceeds 5% for 5 minutes
  • Latency alert fires when p95 exceeds 1 second for 5 minutes
  • Availability alert fires when zero requests for 5 minutes
  • Alerts route to Telegram via existing Alertmanager config

Test Expectations

  • tofu plan shows the new PrometheusRule resource
  • tofu apply succeeds
  • kubectl get prometheusrules -n monitoring shows the new rule
  • Prometheus UI shows the rules as active (Alerts tab)

Constraints

  • Follow existing pattern in terraform/modules/monitoring/main.tf (kubernetes_manifest resource type)
  • Use verified metric names above — do not guess
  • Namespace filter: namespace="landscaping-assistant" on all queries

Checklist

  • PR opened (pal-e-platform)
  • Tests pass
  • No unrelated changes
  • project-landscaping-observability — observability project
  • ldraney/landscaping-assistant #43 — parent observability issue
  • ldraney/landscaping-assistant #91 — Sloth SLOs (future auto-generated rules)
### Type Feature ### Lineage Child of `ldraney/landscaping-assistant #43` (Observability & DORA metrics stack). Relates to Sloth SLOs (#91) — manual threshold alerts now, Sloth auto-generates burn-rate alerts later. ### Repo `ldraney/pal-e-platform` ### User Story As a platform operator I want alerts when landscaping-assistant error rate spikes, latency degrades, or availability drops So that I know when things break without watching dashboards ### Context Alertmanager is wired to Telegram but has no application-level alert rules for landscaping-assistant. The platform has 31 rule groups and 4 existing PrometheusRule resources in `terraform/modules/monitoring/main.tf` (blackbox, embedding, payment pipeline, gmail oauth). This issue adds threshold-based alerts using the verified yabeda-rails metric names. Later, Sloth (#91) will auto-generate multi-window burn-rate alerts from SLO definitions — these manual rules serve as the initial safety net. ### File Targets Files the agent should modify or create: - `terraform/modules/monitoring/main.tf` (pal-e-platform) — add `kubernetes_manifest` resource for PrometheusRule CRD, matching the existing pattern (blackbox, embedding, etc.) Files the agent should NOT touch: - `overlays/landscaping-assistant/` (pal-e-deployments) — PrometheusRules are platform-level, not per-service kustomize overlays ### Verified Metric Names (from yabeda-rails) - `rails_requests_total` (counter) — labels: controller, action, status, format, method - `rails_request_duration_bucket` / `_sum` / `_count` (histogram) — same labels - `puma_running`, `puma_max_threads`, `puma_backlog` (gauges) ### PromQL for Alert Rules ``` # Error rate > 5% for 5 minutes sum(rate(rails_requests_total{namespace="landscaping-assistant", status=~"5.."}[5m])) / clamp_min(sum(rate(rails_requests_total{namespace="landscaping-assistant"}[5m])), 0.001) > 0.05 # P95 latency > 1s for 5 minutes histogram_quantile(0.95, sum(rate(rails_request_duration_bucket{namespace="landscaping-assistant"}[5m])) by (le)) > 1 # Zero requests for 5 minutes (availability) sum(rate(rails_requests_total{namespace="landscaping-assistant"}[5m])) == 0 ``` ### Acceptance Criteria - [ ] PrometheusRule resource deployed in monitoring namespace - [ ] Error rate alert fires when 5xx rate exceeds 5% for 5 minutes - [ ] Latency alert fires when p95 exceeds 1 second for 5 minutes - [ ] Availability alert fires when zero requests for 5 minutes - [ ] Alerts route to Telegram via existing Alertmanager config ### Test Expectations - [ ] `tofu plan` shows the new PrometheusRule resource - [ ] `tofu apply` succeeds - [ ] `kubectl get prometheusrules -n monitoring` shows the new rule - [ ] Prometheus UI shows the rules as active (Alerts tab) ### Constraints - Follow existing pattern in `terraform/modules/monitoring/main.tf` (kubernetes_manifest resource type) - Use verified metric names above — do not guess - Namespace filter: `namespace="landscaping-assistant"` on all queries ### Checklist - [ ] PR opened (pal-e-platform) - [ ] Tests pass - [ ] No unrelated changes ### Related - `project-landscaping-observability` — observability project - `ldraney/landscaping-assistant #43` — parent observability issue - `ldraney/landscaping-assistant #91` — Sloth SLOs (future auto-generated rules)
Author
Owner

Scope review finding (review-17-2026-06-01):

  1. Wrong repo: Should target pal-e-platform Terraform, not pal-e-deployments kustomize. Every existing PrometheusRule is in terraform/modules/monitoring/main.tf as kubernetes_manifest resources.
  2. Wrong metric names: Alert expressions use http_requests_total and http_request_duration_seconds_bucket, but yabeda-rails emits rails_requests_total and rails_request_duration.

Needs refinement before this is workable.

**Scope review finding (review-17-2026-06-01):** 1. **Wrong repo**: Should target `pal-e-platform` Terraform, not `pal-e-deployments` kustomize. Every existing PrometheusRule is in `terraform/modules/monitoring/main.tf` as `kubernetes_manifest` resources. 2. **Wrong metric names**: Alert expressions use `http_requests_total` and `http_request_duration_seconds_bucket`, but yabeda-rails emits `rails_requests_total` and `rails_request_duration`. Needs refinement before this is workable.
Author
Owner

Scope Review: NEEDS_REFINEMENT

Review note: review-1293-2026-06-02

Prior flags about "wrong repo" and "wrong metric names" are confirmed. Five issues found:

  • Wrong repo: Issue is filed on landscaping-assistant but all file targets are in pal-e-deployments. Needs to be re-created on the correct repo with board item URL updated.
  • Missing metric names: AC references "5xx error rate" and "p95 latency" without specifying actual PromQL or metric names (rails_requests_total, rails_request_duration_seconds_bucket). Agent will not know what to write.
  • Stale dependency in title: #19 is now closed/done. Only #15 (ServiceMonitor fix) remains as a blocker.
  • Missing story note: story:observability is not defined in the project-landscaping-assistant user-stories table.
  • Missing arch note: No arch-k8s-deploy note exists in pal-e-docs.
## Scope Review: NEEDS_REFINEMENT Review note: `review-1293-2026-06-02` Prior flags about "wrong repo" and "wrong metric names" are confirmed. Five issues found: - **Wrong repo**: Issue is filed on `landscaping-assistant` but all file targets are in `pal-e-deployments`. Needs to be re-created on the correct repo with board item URL updated. - **Missing metric names**: AC references "5xx error rate" and "p95 latency" without specifying actual PromQL or metric names (`rails_requests_total`, `rails_request_duration_seconds_bucket`). Agent will not know what to write. - **Stale dependency in title**: #19 is now closed/done. Only #15 (ServiceMonitor fix) remains as a blocker. - **Missing story note**: `story:observability` is not defined in the project-landscaping-assistant user-stories table. - **Missing arch note**: No `arch-k8s-deploy` note exists in pal-e-docs.
Author
Owner

Scope Review: NEEDS_REFINEMENT (3rd review)

Review note: review-1307-2026-06-04

Both issues flagged in the two prior reviews (2026-06-01, 2026-06-02) remain unresolved — the issue body has not been updated.

5 findings:

  1. Wrong repo (UNFIXED): ### Repo says pal-e-deployments but all existing PrometheusRules are Terraform kubernetes_manifest resources in pal-e-platform at terraform/modules/monitoring/main.tf. Verified against 4 existing alert resources (blackbox, embedding, payment, gmail).

  2. Wrong file targets (UNFIXED): Points to Kustomize overlay paths. Should target terraform/modules/monitoring/main.tf in pal-e-platform.

  3. Missing metric names (UNFIXED): AC says "5xx error rate" and "p95 latency" without specifying PromQL or metric names. The app uses yabeda-rails which emits rails_requests_total (counter) and rails_request_duration_bucket (histogram) — NOT the generic http_requests_total / http_request_duration_seconds_bucket an agent would guess.

  4. Wrong test expectations: kustomize build should be tofu plan / tofu validate.

  5. Pyrra overlap: The observability roadmap (docs/observability-roadmap.md) puts PrometheusRules under Pyrra (Phase 6) for auto-generated SLO burn-rate alerting. Unclear if these manual rules are interim or permanent.

All 5 recommendations are detailed in the review note with suggested PromQL expressions and corrected file targets.

**Scope Review: NEEDS_REFINEMENT** (3rd review) Review note: `review-1307-2026-06-04` Both issues flagged in the two prior reviews (2026-06-01, 2026-06-02) remain unresolved — the issue body has not been updated. **5 findings:** 1. **Wrong repo (UNFIXED):** `### Repo` says `pal-e-deployments` but all existing PrometheusRules are Terraform `kubernetes_manifest` resources in `pal-e-platform` at `terraform/modules/monitoring/main.tf`. Verified against 4 existing alert resources (blackbox, embedding, payment, gmail). 2. **Wrong file targets (UNFIXED):** Points to Kustomize overlay paths. Should target `terraform/modules/monitoring/main.tf` in `pal-e-platform`. 3. **Missing metric names (UNFIXED):** AC says "5xx error rate" and "p95 latency" without specifying PromQL or metric names. The app uses yabeda-rails which emits `rails_requests_total` (counter) and `rails_request_duration_bucket` (histogram) — NOT the generic `http_requests_total` / `http_request_duration_seconds_bucket` an agent would guess. 4. **Wrong test expectations:** `kustomize build` should be `tofu plan` / `tofu validate`. 5. **Pyrra overlap:** The observability roadmap (docs/observability-roadmap.md) puts PrometheusRules under Pyrra (Phase 6) for auto-generated SLO burn-rate alerting. Unclear if these manual rules are interim or permanent. All 5 recommendations are detailed in the review note with suggested PromQL expressions and corrected file targets.
Sign in to join this conversation.
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
ldraney/landscaping-assistant#17
No description provided.