Observability & DORA metrics stack [PARENT] #43

Open
opened 2026-05-29 12:13:02 +00:00 by ldraney · 0 comments
Owner

Type

Feature

Lineage

Standalone — consolidates existing observability/DORA sub-issues (#15, #16, #17, #19, #20, #21) into one tracked parent. #22 closed as duplicate of #16.

Repo

ldraney/landscaping-assistant

User Story

As the operator
I want Prometheus metrics, Grafana dashboards, alerting rules, and uptime monitoring
So that I have production visibility into request performance, error rates, and availability

Context

The app has structured logging (lograge JSON) and a local error subscriber already shipped (#39). The next layer is Prometheus metrics exposure, Grafana visualization, alerting, and blackbox uptime — the standard golden signals + DORA stack.

No /metrics endpoint exists. No ServiceMonitor, PrometheusRule, or Grafana dashboard manifests exist. The monitoring namespace on the cluster is available but this app isn't registered with it.

This is a parent tracking ticket. Individual sub-issues stay open for branch/PR linking. Progress is tracked via the checklist below.

File Targets

Files the agent should modify or create:

  • Gemfile — add prometheus-client gem
  • config/routes.rb — mount /metrics endpoint
  • config/initializers/prometheus.rb — metrics registry setup
  • k8s/service-monitor.yaml — ServiceMonitor CRD
  • k8s/prometheus-rule.yaml — PrometheusRule CRD
  • k8s/grafana-dashboard.json — golden signals dashboard
  • Blackbox exporter config (in platform-terraform or equivalent)

Files the agent should NOT touch:

  • config/initializers/error_subscriber.rb — already done
  • Lograge config in production.rb — already done

Acceptance Criteria

Phase 1 — Metrics foundation

  • #19: App exposes /metrics endpoint with request rate, error rate, duration histograms
  • #15: ServiceMonitor CRD deployed, Prometheus successfully scraping metrics

Phase 2 — Visualization & alerting

  • #16: Grafana golden signals dashboard shows request rate, error rate, latency, saturation
  • #17: PrometheusRule fires alerts for >5% error rate, >1s p99 latency, pod unavailability

Phase 3 — Uptime & verification

  • #21: Blackbox exporter probing the app's health endpoint
  • #20: DORA metrics (deploy frequency, lead time, MTTR, change failure rate) visible in Grafana

Test Expectations

  • /metrics endpoint returns Prometheus text format with expected metric names
  • kubectl get servicemonitor -n landscaping-assistant shows the monitor
  • Prometheus targets page shows landscaping-assistant as UP
  • Grafana dashboard loads without errors
  • Run command: curl localhost:3000/metrics returns prometheus exposition format

Constraints

  • Use prometheus-client gem (not prometheus_exporter — we want in-process, not a separate collector)
  • ServiceMonitor must match existing label conventions on the cluster
  • Grafana dashboard should be provisioned via ConfigMap, not manual import
  • Sub-issues remain open for individual PR tracking

Checklist

  • Phase 1 PRs merged (#19, #15)
  • Phase 2 PRs merged (#16, #17)
  • Phase 3 PRs merged (#21, #20)
  • All sub-issues closed
  • This parent issue closed
  • project-landscaping-assistant — project this affects
  • #39 — structured logging (done)
  • #18 — CI registry URL fix (done)
  • #22 — duplicate of #16 (closed)
### Type Feature ### Lineage Standalone — consolidates existing observability/DORA sub-issues (#15, #16, #17, #19, #20, #21) into one tracked parent. #22 closed as duplicate of #16. ### Repo `ldraney/landscaping-assistant` ### User Story As the operator I want Prometheus metrics, Grafana dashboards, alerting rules, and uptime monitoring So that I have production visibility into request performance, error rates, and availability ### Context The app has structured logging (lograge JSON) and a local error subscriber already shipped (#39). The next layer is Prometheus metrics exposure, Grafana visualization, alerting, and blackbox uptime — the standard golden signals + DORA stack. No `/metrics` endpoint exists. No ServiceMonitor, PrometheusRule, or Grafana dashboard manifests exist. The monitoring namespace on the cluster is available but this app isn't registered with it. This is a parent tracking ticket. Individual sub-issues stay open for branch/PR linking. Progress is tracked via the checklist below. ### File Targets Files the agent should modify or create: - `Gemfile` — add prometheus-client gem - `config/routes.rb` — mount /metrics endpoint - `config/initializers/prometheus.rb` — metrics registry setup - `k8s/service-monitor.yaml` — ServiceMonitor CRD - `k8s/prometheus-rule.yaml` — PrometheusRule CRD - `k8s/grafana-dashboard.json` — golden signals dashboard - Blackbox exporter config (in platform-terraform or equivalent) Files the agent should NOT touch: - `config/initializers/error_subscriber.rb` — already done - Lograge config in `production.rb` — already done ### Acceptance Criteria **Phase 1 — Metrics foundation** - [ ] #19: App exposes `/metrics` endpoint with request rate, error rate, duration histograms - [ ] #15: ServiceMonitor CRD deployed, Prometheus successfully scraping metrics **Phase 2 — Visualization & alerting** - [ ] #16: Grafana golden signals dashboard shows request rate, error rate, latency, saturation - [ ] #17: PrometheusRule fires alerts for >5% error rate, >1s p99 latency, pod unavailability **Phase 3 — Uptime & verification** - [ ] #21: Blackbox exporter probing the app's health endpoint - [ ] #20: DORA metrics (deploy frequency, lead time, MTTR, change failure rate) visible in Grafana ### Test Expectations - [ ] `/metrics` endpoint returns Prometheus text format with expected metric names - [ ] `kubectl get servicemonitor -n landscaping-assistant` shows the monitor - [ ] Prometheus targets page shows landscaping-assistant as UP - [ ] Grafana dashboard loads without errors - Run command: `curl localhost:3000/metrics` returns prometheus exposition format ### Constraints - Use `prometheus-client` gem (not prometheus_exporter — we want in-process, not a separate collector) - ServiceMonitor must match existing label conventions on the cluster - Grafana dashboard should be provisioned via ConfigMap, not manual import - Sub-issues remain open for individual PR tracking ### Checklist - [ ] Phase 1 PRs merged (#19, #15) - [ ] Phase 2 PRs merged (#16, #17) - [ ] Phase 3 PRs merged (#21, #20) - [ ] All sub-issues closed - [ ] This parent issue closed ### Related - `project-landscaping-assistant` — project this affects - #39 — structured logging (done) - #18 — CI registry URL fix (done) - #22 — duplicate of #16 (closed)
Sign in to join this conversation.
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
ldraney/landscaping-assistant#43
No description provided.