Observability & DORA metrics stack [PARENT]

ldraney commented

2026-05-29 12:13:02 +00:00

Owner

Type

Feature

Lineage

Standalone — consolidates existing observability/DORA sub-issues (#15, #16, #17, #19, #20, #21) into one tracked parent. #22 closed as duplicate of #16.

Repo

ldraney/landscaping-assistant

User Story

As the operator
I want Prometheus metrics, Grafana dashboards, alerting rules, and uptime monitoring
So that I have production visibility into request performance, error rates, and availability

Context

The app has structured logging (lograge JSON) and a local error subscriber already shipped (#39). The next layer is Prometheus metrics exposure, Grafana visualization, alerting, and blackbox uptime — the standard golden signals + DORA stack.

No /metrics endpoint exists. No ServiceMonitor, PrometheusRule, or Grafana dashboard manifests exist. The monitoring namespace on the cluster is available but this app isn't registered with it.

This is a parent tracking ticket. Individual sub-issues stay open for branch/PR linking. Progress is tracked via the checklist below.

File Targets

Files the agent should modify or create:

Gemfile — add prometheus-client gem
config/routes.rb — mount /metrics endpoint
config/initializers/prometheus.rb — metrics registry setup
k8s/service-monitor.yaml — ServiceMonitor CRD
k8s/prometheus-rule.yaml — PrometheusRule CRD
k8s/grafana-dashboard.json — golden signals dashboard
Blackbox exporter config (in platform-terraform or equivalent)

Files the agent should NOT touch:

config/initializers/error_subscriber.rb — already done
Lograge config in production.rb — already done

Acceptance Criteria

Phase 1 — Metrics foundation

#19: App exposes /metrics endpoint with request rate, error rate, duration histograms
#15: ServiceMonitor CRD deployed, Prometheus successfully scraping metrics

Phase 2 — Visualization & alerting

#16: Grafana golden signals dashboard shows request rate, error rate, latency, saturation
#17: PrometheusRule fires alerts for >5% error rate, >1s p99 latency, pod unavailability

Phase 3 — Uptime & verification

#21: Blackbox exporter probing the app's health endpoint
#20: DORA metrics (deploy frequency, lead time, MTTR, change failure rate) visible in Grafana

Test Expectations

/metrics endpoint returns Prometheus text format with expected metric names
kubectl get servicemonitor -n landscaping-assistant shows the monitor
Prometheus targets page shows landscaping-assistant as UP
Grafana dashboard loads without errors
Run command: curl localhost:3000/metrics returns prometheus exposition format

Constraints

Use prometheus-client gem (not prometheus_exporter — we want in-process, not a separate collector)
ServiceMonitor must match existing label conventions on the cluster
Grafana dashboard should be provisioned via ConfigMap, not manual import
Sub-issues remain open for individual PR tracking

Checklist

Phase 1 PRs merged (#19, #15)
Phase 2 PRs merged (#16, #17)
Phase 3 PRs merged (#21, #20)
All sub-issues closed
This parent issue closed

project-landscaping-assistant — project this affects
#39 — structured logging (done)
#18 — CI registry URL fix (done)
#22 — duplicate of #16 (closed)

### Type Feature ### Lineage Standalone — consolidates existing observability/DORA sub-issues (#15, #16, #17, #19, #20, #21) into one tracked parent. #22 closed as duplicate of #16. ### Repo `ldraney/landscaping-assistant` ### User Story As the operator I want Prometheus metrics, Grafana dashboards, alerting rules, and uptime monitoring So that I have production visibility into request performance, error rates, and availability ### Context The app has structured logging (lograge JSON) and a local error subscriber already shipped (#39). The next layer is Prometheus metrics exposure, Grafana visualization, alerting, and blackbox uptime — the standard golden signals + DORA stack. No `/metrics` endpoint exists. No ServiceMonitor, PrometheusRule, or Grafana dashboard manifests exist. The monitoring namespace on the cluster is available but this app isn't registered with it. This is a parent tracking ticket. Individual sub-issues stay open for branch/PR linking. Progress is tracked via the checklist below. ### File Targets Files the agent should modify or create: - `Gemfile` — add prometheus-client gem - `config/routes.rb` — mount /metrics endpoint - `config/initializers/prometheus.rb` — metrics registry setup - `k8s/service-monitor.yaml` — ServiceMonitor CRD - `k8s/prometheus-rule.yaml` — PrometheusRule CRD - `k8s/grafana-dashboard.json` — golden signals dashboard - Blackbox exporter config (in platform-terraform or equivalent) Files the agent should NOT touch: - `config/initializers/error_subscriber.rb` — already done - Lograge config in `production.rb` — already done ### Acceptance Criteria **Phase 1 — Metrics foundation** - [ ] #19: App exposes `/metrics` endpoint with request rate, error rate, duration histograms - [ ] #15: ServiceMonitor CRD deployed, Prometheus successfully scraping metrics **Phase 2 — Visualization & alerting** - [ ] #16: Grafana golden signals dashboard shows request rate, error rate, latency, saturation - [ ] #17: PrometheusRule fires alerts for >5% error rate, >1s p99 latency, pod unavailability **Phase 3 — Uptime & verification** - [ ] #21: Blackbox exporter probing the app's health endpoint - [ ] #20: DORA metrics (deploy frequency, lead time, MTTR, change failure rate) visible in Grafana ### Test Expectations - [ ] `/metrics` endpoint returns Prometheus text format with expected metric names - [ ] `kubectl get servicemonitor -n landscaping-assistant` shows the monitor - [ ] Prometheus targets page shows landscaping-assistant as UP - [ ] Grafana dashboard loads without errors - Run command: `curl localhost:3000/metrics` returns prometheus exposition format ### Constraints - Use `prometheus-client` gem (not prometheus_exporter — we want in-process, not a separate collector) - ServiceMonitor must match existing label conventions on the cluster - Grafana dashboard should be provisioned via ConfigMap, not manual import - Sub-issues remain open for individual PR tracking ### Checklist - [ ] Phase 1 PRs merged (#19, #15) - [ ] Phase 2 PRs merged (#16, #17) - [ ] Phase 3 PRs merged (#21, #20) - [ ] All sub-issues closed - [ ] This parent issue closed ### Related - `project-landscaping-assistant` — project this affects - #39 — structured logging (done) - #18 — CI registry URL fix (done) - #22 — duplicate of #16 (closed)