Add observability: client-side errors, structured logs, Grafana dashboard, uptime probe #39

New issue

Closed

opened 2026-05-26 15:47:44 +00:00 by ldraney · 0 comments

ldraney commented

2026-05-26 15:47:44 +00:00

Owner

Type

Feature

Lineage

Standalone — discovered during live debugging of invisible client-side location feature failure (2026-05-26). GPS reverse geocode call was failing silently; all .catch() blocks swallow errors with no logging or reporting.

Repo

ldraney/landscaping-assistant

User Story

As the app operator
I want visibility into client-side JS errors, server exceptions, and app uptime
So that I can diagnose failures instead of flying blind when something breaks in production

Context

The app has zero observability beyond kubectl logs. A transient client-side failure (Nominatim reverse geocode in location_controller.js) was completely invisible — all 6 Stimulus controllers silently swallow errors in .catch(() => { generic message }) blocks. No console logging, no server notification.

Server-side is slightly better (Rails STDOUT → Promtail → Loki), but there's no Grafana dashboard to query those logs and no alerting when the app is unreachable.

The platform already runs a full observability stack in the monitoring namespace — Prometheus, Loki (7d retention), Grafana (grafana.tail5b443a.ts.net), Alertmanager, Blackbox Exporter (14 probes for other services). This app isn't wired into any of it. This is a wiring + instrumentation task, not a deploy-new-infra task.

Key decisions:

DIY client-side error endpoint over GlitchTip/Sentry — zero dependencies, covers 90% of value for this app size
Rails built-in error subscriber over external error trackers — framework-native, zero gems
Lograge for structured logs — single-line JSON makes Loki queries useful
Blackbox Exporter probe over Uptime Kuma — infra already exists, just add a target
Prometheus app metrics (Yabeda) deferred — overkill for a single small app, log-based metrics sufficient for now

File Targets

Files to modify or create:

app/javascript/application.js — add global window.onerror + unhandledrejection handlers that POST to /errors/client
app/javascript/controllers/location_controller.js — add console.error(e) in .catch(), include error type in user message
app/javascript/controllers/add_location_controller.js — same catch fix
app/javascript/controllers/upload_controller.js — same catch fix (and any other controllers with silent catches)
app/controllers/client_errors_controller.rb — new endpoint that logs client errors via Rails.logger.error
config/routes.rb — add POST /errors/client route
config/initializers/error_subscriber.rb — Rails error reporting subscriber (structured error logging)
Gemfile — add lograge
config/environments/production.rb — enable lograge with JSON formatter

Files NOT in this repo (platform/infra side):

Blackbox Exporter config or Probe CR — add target for https://landscaping-assistant.tail5b443a.ts.net/up
Grafana dashboard JSON — create dashboard for landscaping-assistant namespace
k8s Deployment spec — verify liveness/readiness probes exist for /up

Acceptance Criteria

When a JS error occurs in any Stimulus controller, the error is logged to console.error AND posted to the server endpoint
When an uncaught JS exception or unhandled promise rejection occurs, it appears in Loki logs within 60 seconds
When an unhandled server exception occurs, it is logged with structured context (class, message, severity, source)
When I query Loki for {namespace="landscaping-assistant"}, request logs are single-line JSON with controller, action, status, duration
When the Tailscale Funnel URL is unreachable, Alertmanager fires an alert
When the app pod hangs, Kubernetes restarts it automatically (liveness probe)
A Grafana dashboard exists showing: error log stream, request patterns, client-side errors

Test Expectations

Unit test: POST /errors/client with valid JSON payload returns 200 and logs the error
Unit test: POST /errors/client with invalid/oversized payload returns 422 (rate-limit/size guard)
Manual test: Trigger a JS error in browser, verify it appears in kubectl logs within seconds
Manual test: Hit Grafana dashboard, confirm log panels populate
Run command: bundle exec rspec spec/controllers/client_errors_controller_spec.rb

Constraints

No external SaaS dependencies — everything runs on the existing cluster
Client error endpoint must be rate-limited or size-capped to prevent abuse (simple request.body.read(4096) limit is fine)
No authentication on the client error endpoint (the app has no auth) but validate JSON structure
Lograge should not suppress health check logs (already handled by config.silence_healthcheck_path = "/up")
Grafana dashboard should be provisioned as JSON (checked into a platform repo or configmap), not created ad-hoc in the UI
Follow existing Blackbox Exporter probe pattern used by the other 14 services

Checklist

PR opened
Tests pass
No unrelated changes
Grafana dashboard accessible
Blackbox probe firing in Prometheus targets

project-landscaping-assistant — project this affects
observability-audit-2026-02-25 — platform observability baseline
plan-2026-02-25-platform-observability — platform observability plan (completed)

### Type Feature ### Lineage Standalone — discovered during live debugging of invisible client-side location feature failure (2026-05-26). GPS reverse geocode call was failing silently; all `.catch()` blocks swallow errors with no logging or reporting. ### Repo `ldraney/landscaping-assistant` ### User Story As the app operator I want visibility into client-side JS errors, server exceptions, and app uptime So that I can diagnose failures instead of flying blind when something breaks in production ### Context The app has zero observability beyond `kubectl logs`. A transient client-side failure (Nominatim reverse geocode in `location_controller.js`) was completely invisible — all 6 Stimulus controllers silently swallow errors in `.catch(() => { generic message })` blocks. No console logging, no server notification. Server-side is slightly better (Rails STDOUT → Promtail → Loki), but there's no Grafana dashboard to query those logs and no alerting when the app is unreachable. The platform already runs a full observability stack in the `monitoring` namespace — Prometheus, Loki (7d retention), Grafana (`grafana.tail5b443a.ts.net`), Alertmanager, Blackbox Exporter (14 probes for other services). This app isn't wired into any of it. This is a wiring + instrumentation task, not a deploy-new-infra task. Key decisions: - DIY client-side error endpoint over GlitchTip/Sentry — zero dependencies, covers 90% of value for this app size - Rails built-in error subscriber over external error trackers — framework-native, zero gems - Lograge for structured logs — single-line JSON makes Loki queries useful - Blackbox Exporter probe over Uptime Kuma — infra already exists, just add a target - Prometheus app metrics (Yabeda) deferred — overkill for a single small app, log-based metrics sufficient for now ### File Targets Files to modify or create: - `app/javascript/application.js` — add global `window.onerror` + `unhandledrejection` handlers that POST to `/errors/client` - `app/javascript/controllers/location_controller.js` — add `console.error(e)` in `.catch()`, include error type in user message - `app/javascript/controllers/add_location_controller.js` — same catch fix - `app/javascript/controllers/upload_controller.js` — same catch fix (and any other controllers with silent catches) - `app/controllers/client_errors_controller.rb` — new endpoint that logs client errors via `Rails.logger.error` - `config/routes.rb` — add `POST /errors/client` route - `config/initializers/error_subscriber.rb` — Rails error reporting subscriber (structured error logging) - `Gemfile` — add `lograge` - `config/environments/production.rb` — enable lograge with JSON formatter Files NOT in this repo (platform/infra side): - Blackbox Exporter config or Probe CR — add target for `https://landscaping-assistant.tail5b443a.ts.net/up` - Grafana dashboard JSON — create dashboard for landscaping-assistant namespace - k8s Deployment spec — verify liveness/readiness probes exist for `/up` ### Acceptance Criteria - [ ] When a JS error occurs in any Stimulus controller, the error is logged to `console.error` AND posted to the server endpoint - [ ] When an uncaught JS exception or unhandled promise rejection occurs, it appears in Loki logs within 60 seconds - [ ] When an unhandled server exception occurs, it is logged with structured context (class, message, severity, source) - [ ] When I query Loki for `{namespace="landscaping-assistant"}`, request logs are single-line JSON with controller, action, status, duration - [ ] When the Tailscale Funnel URL is unreachable, Alertmanager fires an alert - [ ] When the app pod hangs, Kubernetes restarts it automatically (liveness probe) - [ ] A Grafana dashboard exists showing: error log stream, request patterns, client-side errors ### Test Expectations - [ ] Unit test: POST `/errors/client` with valid JSON payload returns 200 and logs the error - [ ] Unit test: POST `/errors/client` with invalid/oversized payload returns 422 (rate-limit/size guard) - [ ] Manual test: Trigger a JS error in browser, verify it appears in `kubectl logs` within seconds - [ ] Manual test: Hit Grafana dashboard, confirm log panels populate - Run command: `bundle exec rspec spec/controllers/client_errors_controller_spec.rb` ### Constraints - No external SaaS dependencies — everything runs on the existing cluster - Client error endpoint must be rate-limited or size-capped to prevent abuse (simple `request.body.read(4096)` limit is fine) - No authentication on the client error endpoint (the app has no auth) but validate JSON structure - Lograge should not suppress health check logs (already handled by `config.silence_healthcheck_path = "/up"`) - Grafana dashboard should be provisioned as JSON (checked into a platform repo or configmap), not created ad-hoc in the UI - Follow existing Blackbox Exporter probe pattern used by the other 14 services ### Checklist - [ ] PR opened - [ ] Tests pass - [ ] No unrelated changes - [ ] Grafana dashboard accessible - [ ] Blackbox probe firing in Prometheus targets ### Related - `project-landscaping-assistant` — project this affects - `observability-audit-2026-02-25` — platform observability baseline - `plan-2026-02-25-platform-observability` — platform observability plan (completed)