Add PrometheusRule alerts for error rate, latency, and availability

ldraney commented

2026-05-25 03:04:14 +00:00

Owner

Type

Feature

Lineage

Child of ldraney/landscaping-assistant #43 (Observability & DORA metrics stack).
Relates to Sloth SLOs (#91) — manual threshold alerts now, Sloth auto-generates burn-rate alerts later.

Repo

ldraney/pal-e-platform

User Story

As a platform operator
I want alerts when landscaping-assistant error rate spikes, latency degrades, or availability drops
So that I know when things break without watching dashboards

Context

Alertmanager is wired to Telegram but has no application-level alert rules for landscaping-assistant. The platform has 31 rule groups and 4 existing PrometheusRule resources in terraform/modules/monitoring/main.tf (blackbox, embedding, payment pipeline, gmail oauth). This issue adds threshold-based alerts using the verified yabeda-rails metric names. Later, Sloth (#91) will auto-generate multi-window burn-rate alerts from SLO definitions — these manual rules serve as the initial safety net.

File Targets

Files the agent should modify or create:

terraform/modules/monitoring/main.tf (pal-e-platform) — add kubernetes_manifest resource for PrometheusRule CRD, matching the existing pattern (blackbox, embedding, etc.)

Files the agent should NOT touch:

overlays/landscaping-assistant/ (pal-e-deployments) — PrometheusRules are platform-level, not per-service kustomize overlays

Verified Metric Names (from yabeda-rails)

rails_requests_total (counter) — labels: controller, action, status, format, method
rails_request_duration_bucket / _sum / _count (histogram) — same labels
puma_running, puma_max_threads, puma_backlog (gauges)

PromQL for Alert Rules

# Error rate > 5% for 5 minutes
sum(rate(rails_requests_total{namespace="landscaping-assistant", status=~"5.."}[5m]))
/ clamp_min(sum(rate(rails_requests_total{namespace="landscaping-assistant"}[5m])), 0.001) > 0.05

# P95 latency > 1s for 5 minutes
histogram_quantile(0.95, sum(rate(rails_request_duration_bucket{namespace="landscaping-assistant"}[5m])) by (le)) > 1

# Zero requests for 5 minutes (availability)
sum(rate(rails_requests_total{namespace="landscaping-assistant"}[5m])) == 0

Acceptance Criteria

PrometheusRule resource deployed in monitoring namespace
Error rate alert fires when 5xx rate exceeds 5% for 5 minutes
Latency alert fires when p95 exceeds 1 second for 5 minutes
Availability alert fires when zero requests for 5 minutes
Alerts route to Telegram via existing Alertmanager config

Test Expectations

tofu plan shows the new PrometheusRule resource
tofu apply succeeds
kubectl get prometheusrules -n monitoring shows the new rule
Prometheus UI shows the rules as active (Alerts tab)

Constraints

Follow existing pattern in terraform/modules/monitoring/main.tf (kubernetes_manifest resource type)
Use verified metric names above — do not guess
Namespace filter: namespace="landscaping-assistant" on all queries

Checklist

PR opened (pal-e-platform)
Tests pass
No unrelated changes

project-landscaping-observability — observability project
ldraney/landscaping-assistant #43 — parent observability issue
ldraney/landscaping-assistant #91 — Sloth SLOs (future auto-generated rules)

### Type Feature ### Lineage Child of `ldraney/landscaping-assistant #43` (Observability & DORA metrics stack). Relates to Sloth SLOs (#91) — manual threshold alerts now, Sloth auto-generates burn-rate alerts later. ### Repo `ldraney/pal-e-platform` ### User Story As a platform operator I want alerts when landscaping-assistant error rate spikes, latency degrades, or availability drops So that I know when things break without watching dashboards ### Context Alertmanager is wired to Telegram but has no application-level alert rules for landscaping-assistant. The platform has 31 rule groups and 4 existing PrometheusRule resources in `terraform/modules/monitoring/main.tf` (blackbox, embedding, payment pipeline, gmail oauth). This issue adds threshold-based alerts using the verified yabeda-rails metric names. Later, Sloth (#91) will auto-generate multi-window burn-rate alerts from SLO definitions — these manual rules serve as the initial safety net. ### File Targets Files the agent should modify or create: - `terraform/modules/monitoring/main.tf` (pal-e-platform) — add `kubernetes_manifest` resource for PrometheusRule CRD, matching the existing pattern (blackbox, embedding, etc.) Files the agent should NOT touch: - `overlays/landscaping-assistant/` (pal-e-deployments) — PrometheusRules are platform-level, not per-service kustomize overlays ### Verified Metric Names (from yabeda-rails) - `rails_requests_total` (counter) — labels: controller, action, status, format, method - `rails_request_duration_bucket` / `_sum` / `_count` (histogram) — same labels - `puma_running`, `puma_max_threads`, `puma_backlog` (gauges) ### PromQL for Alert Rules ``` # Error rate > 5% for 5 minutes sum(rate(rails_requests_total{namespace="landscaping-assistant", status=~"5.."}[5m])) / clamp_min(sum(rate(rails_requests_total{namespace="landscaping-assistant"}[5m])), 0.001) > 0.05 # P95 latency > 1s for 5 minutes histogram_quantile(0.95, sum(rate(rails_request_duration_bucket{namespace="landscaping-assistant"}[5m])) by (le)) > 1 # Zero requests for 5 minutes (availability) sum(rate(rails_requests_total{namespace="landscaping-assistant"}[5m])) == 0 ``` ### Acceptance Criteria - [ ] PrometheusRule resource deployed in monitoring namespace - [ ] Error rate alert fires when 5xx rate exceeds 5% for 5 minutes - [ ] Latency alert fires when p95 exceeds 1 second for 5 minutes - [ ] Availability alert fires when zero requests for 5 minutes - [ ] Alerts route to Telegram via existing Alertmanager config ### Test Expectations - [ ] `tofu plan` shows the new PrometheusRule resource - [ ] `tofu apply` succeeds - [ ] `kubectl get prometheusrules -n monitoring` shows the new rule - [ ] Prometheus UI shows the rules as active (Alerts tab) ### Constraints - Follow existing pattern in `terraform/modules/monitoring/main.tf` (kubernetes_manifest resource type) - Use verified metric names above — do not guess - Namespace filter: `namespace="landscaping-assistant"` on all queries ### Checklist - [ ] PR opened (pal-e-platform) - [ ] Tests pass - [ ] No unrelated changes ### Related - `project-landscaping-observability` — observability project - `ldraney/landscaping-assistant #43` — parent observability issue - `ldraney/landscaping-assistant #91` — Sloth SLOs (future auto-generated rules)

ldraney referenced this issue

2026-05-29 12:13:02 +00:00

Observability & DORA metrics stack [PARENT] #43

ldraney referenced this issue

2026-05-29 12:27:59 +00:00

Add /metrics endpoint with yabeda-prometheus #47

ldraney commented

2026-06-01 08:58:03 +00:00

Author

Owner

Scope review finding (review-17-2026-06-01):

Wrong repo: Should target pal-e-platform Terraform, not pal-e-deployments kustomize. Every existing PrometheusRule is in terraform/modules/monitoring/main.tf as kubernetes_manifest resources.
Wrong metric names: Alert expressions use http_requests_total and http_request_duration_seconds_bucket, but yabeda-rails emits rails_requests_total and rails_request_duration.

Needs refinement before this is workable.

**Scope review finding (review-17-2026-06-01):** 1. **Wrong repo**: Should target `pal-e-platform` Terraform, not `pal-e-deployments` kustomize. Every existing PrometheusRule is in `terraform/modules/monitoring/main.tf` as `kubernetes_manifest` resources. 2. **Wrong metric names**: Alert expressions use `http_requests_total` and `http_request_duration_seconds_bucket`, but yabeda-rails emits `rails_requests_total` and `rails_request_duration`. Needs refinement before this is workable.

ldraney commented

2026-06-02 12:29:33 +00:00

Author

Owner

Scope Review: NEEDS_REFINEMENT

Review note: review-1293-2026-06-02

Prior flags about "wrong repo" and "wrong metric names" are confirmed. Five issues found:

Wrong repo: Issue is filed on landscaping-assistant but all file targets are in pal-e-deployments. Needs to be re-created on the correct repo with board item URL updated.
Missing metric names: AC references "5xx error rate" and "p95 latency" without specifying actual PromQL or metric names (rails_requests_total, rails_request_duration_seconds_bucket). Agent will not know what to write.
Stale dependency in title: #19 is now closed/done. Only #15 (ServiceMonitor fix) remains as a blocker.
Missing story note: story:observability is not defined in the project-landscaping-assistant user-stories table.
Missing arch note: No arch-k8s-deploy note exists in pal-e-docs.

## Scope Review: NEEDS_REFINEMENT Review note: `review-1293-2026-06-02` Prior flags about "wrong repo" and "wrong metric names" are confirmed. Five issues found: - **Wrong repo**: Issue is filed on `landscaping-assistant` but all file targets are in `pal-e-deployments`. Needs to be re-created on the correct repo with board item URL updated. - **Missing metric names**: AC references "5xx error rate" and "p95 latency" without specifying actual PromQL or metric names (`rails_requests_total`, `rails_request_duration_seconds_bucket`). Agent will not know what to write. - **Stale dependency in title**: #19 is now closed/done. Only #15 (ServiceMonitor fix) remains as a blocker. - **Missing story note**: `story:observability` is not defined in the project-landscaping-assistant user-stories table. - **Missing arch note**: No `arch-k8s-deploy` note exists in pal-e-docs.

ldraney referenced this issue

2026-06-04 04:19:00 +00:00

500 on POST /today: @queued_property_ids nil in create turbo stream #74

ldraney referenced this issue

2026-06-04 05:32:26 +00:00

Add observability roadmap doc with target architecture #84

ldraney commented

2026-06-04 11:56:22 +00:00

Author

Owner

Scope Review: NEEDS_REFINEMENT (3rd review)

Review note: review-1307-2026-06-04

Both issues flagged in the two prior reviews (2026-06-01, 2026-06-02) remain unresolved — the issue body has not been updated.

5 findings:

Wrong repo (UNFIXED): ### Repo says pal-e-deployments but all existing PrometheusRules are Terraform kubernetes_manifest resources in pal-e-platform at terraform/modules/monitoring/main.tf. Verified against 4 existing alert resources (blackbox, embedding, payment, gmail).
Wrong file targets (UNFIXED): Points to Kustomize overlay paths. Should target terraform/modules/monitoring/main.tf in pal-e-platform.
Missing metric names (UNFIXED): AC says "5xx error rate" and "p95 latency" without specifying PromQL or metric names. The app uses yabeda-rails which emits rails_requests_total (counter) and rails_request_duration_bucket (histogram) — NOT the generic http_requests_total / http_request_duration_seconds_bucket an agent would guess.
Wrong test expectations: kustomize build should be tofu plan / tofu validate.
Pyrra overlap: The observability roadmap (docs/observability-roadmap.md) puts PrometheusRules under Pyrra (Phase 6) for auto-generated SLO burn-rate alerting. Unclear if these manual rules are interim or permanent.

All 5 recommendations are detailed in the review note with suggested PromQL expressions and corrected file targets.

**Scope Review: NEEDS_REFINEMENT** (3rd review) Review note: `review-1307-2026-06-04` Both issues flagged in the two prior reviews (2026-06-01, 2026-06-02) remain unresolved — the issue body has not been updated. **5 findings:** 1. **Wrong repo (UNFIXED):** `### Repo` says `pal-e-deployments` but all existing PrometheusRules are Terraform `kubernetes_manifest` resources in `pal-e-platform` at `terraform/modules/monitoring/main.tf`. Verified against 4 existing alert resources (blackbox, embedding, payment, gmail). 2. **Wrong file targets (UNFIXED):** Points to Kustomize overlay paths. Should target `terraform/modules/monitoring/main.tf` in `pal-e-platform`. 3. **Missing metric names (UNFIXED):** AC says "5xx error rate" and "p95 latency" without specifying PromQL or metric names. The app uses yabeda-rails which emits `rails_requests_total` (counter) and `rails_request_duration_bucket` (histogram) — NOT the generic `http_requests_total` / `http_request_duration_seconds_bucket` an agent would guess. 4. **Wrong test expectations:** `kustomize build` should be `tofu plan` / `tofu validate`. 5. **Pyrra overlap:** The observability roadmap (docs/observability-roadmap.md) puts PrometheusRules under Pyrra (Phase 6) for auto-generated SLO burn-rate alerting. Unclear if these manual rules are interim or permanent. All 5 recommendations are detailed in the review note with suggested PromQL expressions and corrected file targets.

ldraney referenced this issue

2026-06-04 12:03:48 +00:00

Deploy Sloth for SLO tracking with error budgets #90

ldraney referenced this issue

2026-06-04 12:12:10 +00:00

Production 500 on POST /today: fix merged but not deployed #94

ldraney referenced this issue

2026-06-04 12:12:28 +00:00

Clean up AlertManager: disable default kube-prometheus-stack rules #95