Add observability: client-side errors, structured logs, Grafana dashboard, uptime probe #39

Closed
opened 2026-05-26 15:47:44 +00:00 by ldraney · 0 comments
Owner

Type

Feature

Lineage

Standalone — discovered during live debugging of invisible client-side location feature failure (2026-05-26). GPS reverse geocode call was failing silently; all .catch() blocks swallow errors with no logging or reporting.

Repo

ldraney/landscaping-assistant

User Story

As the app operator
I want visibility into client-side JS errors, server exceptions, and app uptime
So that I can diagnose failures instead of flying blind when something breaks in production

Context

The app has zero observability beyond kubectl logs. A transient client-side failure (Nominatim reverse geocode in location_controller.js) was completely invisible — all 6 Stimulus controllers silently swallow errors in .catch(() => { generic message }) blocks. No console logging, no server notification.

Server-side is slightly better (Rails STDOUT → Promtail → Loki), but there's no Grafana dashboard to query those logs and no alerting when the app is unreachable.

The platform already runs a full observability stack in the monitoring namespace — Prometheus, Loki (7d retention), Grafana (grafana.tail5b443a.ts.net), Alertmanager, Blackbox Exporter (14 probes for other services). This app isn't wired into any of it. This is a wiring + instrumentation task, not a deploy-new-infra task.

Key decisions:

  • DIY client-side error endpoint over GlitchTip/Sentry — zero dependencies, covers 90% of value for this app size
  • Rails built-in error subscriber over external error trackers — framework-native, zero gems
  • Lograge for structured logs — single-line JSON makes Loki queries useful
  • Blackbox Exporter probe over Uptime Kuma — infra already exists, just add a target
  • Prometheus app metrics (Yabeda) deferred — overkill for a single small app, log-based metrics sufficient for now

File Targets

Files to modify or create:

  • app/javascript/application.js — add global window.onerror + unhandledrejection handlers that POST to /errors/client
  • app/javascript/controllers/location_controller.js — add console.error(e) in .catch(), include error type in user message
  • app/javascript/controllers/add_location_controller.js — same catch fix
  • app/javascript/controllers/upload_controller.js — same catch fix (and any other controllers with silent catches)
  • app/controllers/client_errors_controller.rb — new endpoint that logs client errors via Rails.logger.error
  • config/routes.rb — add POST /errors/client route
  • config/initializers/error_subscriber.rb — Rails error reporting subscriber (structured error logging)
  • Gemfile — add lograge
  • config/environments/production.rb — enable lograge with JSON formatter

Files NOT in this repo (platform/infra side):

  • Blackbox Exporter config or Probe CR — add target for https://landscaping-assistant.tail5b443a.ts.net/up
  • Grafana dashboard JSON — create dashboard for landscaping-assistant namespace
  • k8s Deployment spec — verify liveness/readiness probes exist for /up

Acceptance Criteria

  • When a JS error occurs in any Stimulus controller, the error is logged to console.error AND posted to the server endpoint
  • When an uncaught JS exception or unhandled promise rejection occurs, it appears in Loki logs within 60 seconds
  • When an unhandled server exception occurs, it is logged with structured context (class, message, severity, source)
  • When I query Loki for {namespace="landscaping-assistant"}, request logs are single-line JSON with controller, action, status, duration
  • When the Tailscale Funnel URL is unreachable, Alertmanager fires an alert
  • When the app pod hangs, Kubernetes restarts it automatically (liveness probe)
  • A Grafana dashboard exists showing: error log stream, request patterns, client-side errors

Test Expectations

  • Unit test: POST /errors/client with valid JSON payload returns 200 and logs the error
  • Unit test: POST /errors/client with invalid/oversized payload returns 422 (rate-limit/size guard)
  • Manual test: Trigger a JS error in browser, verify it appears in kubectl logs within seconds
  • Manual test: Hit Grafana dashboard, confirm log panels populate
  • Run command: bundle exec rspec spec/controllers/client_errors_controller_spec.rb

Constraints

  • No external SaaS dependencies — everything runs on the existing cluster
  • Client error endpoint must be rate-limited or size-capped to prevent abuse (simple request.body.read(4096) limit is fine)
  • No authentication on the client error endpoint (the app has no auth) but validate JSON structure
  • Lograge should not suppress health check logs (already handled by config.silence_healthcheck_path = "/up")
  • Grafana dashboard should be provisioned as JSON (checked into a platform repo or configmap), not created ad-hoc in the UI
  • Follow existing Blackbox Exporter probe pattern used by the other 14 services

Checklist

  • PR opened
  • Tests pass
  • No unrelated changes
  • Grafana dashboard accessible
  • Blackbox probe firing in Prometheus targets
  • project-landscaping-assistant — project this affects
  • observability-audit-2026-02-25 — platform observability baseline
  • plan-2026-02-25-platform-observability — platform observability plan (completed)
### Type Feature ### Lineage Standalone — discovered during live debugging of invisible client-side location feature failure (2026-05-26). GPS reverse geocode call was failing silently; all `.catch()` blocks swallow errors with no logging or reporting. ### Repo `ldraney/landscaping-assistant` ### User Story As the app operator I want visibility into client-side JS errors, server exceptions, and app uptime So that I can diagnose failures instead of flying blind when something breaks in production ### Context The app has zero observability beyond `kubectl logs`. A transient client-side failure (Nominatim reverse geocode in `location_controller.js`) was completely invisible — all 6 Stimulus controllers silently swallow errors in `.catch(() => { generic message })` blocks. No console logging, no server notification. Server-side is slightly better (Rails STDOUT → Promtail → Loki), but there's no Grafana dashboard to query those logs and no alerting when the app is unreachable. The platform already runs a full observability stack in the `monitoring` namespace — Prometheus, Loki (7d retention), Grafana (`grafana.tail5b443a.ts.net`), Alertmanager, Blackbox Exporter (14 probes for other services). This app isn't wired into any of it. This is a wiring + instrumentation task, not a deploy-new-infra task. Key decisions: - DIY client-side error endpoint over GlitchTip/Sentry — zero dependencies, covers 90% of value for this app size - Rails built-in error subscriber over external error trackers — framework-native, zero gems - Lograge for structured logs — single-line JSON makes Loki queries useful - Blackbox Exporter probe over Uptime Kuma — infra already exists, just add a target - Prometheus app metrics (Yabeda) deferred — overkill for a single small app, log-based metrics sufficient for now ### File Targets Files to modify or create: - `app/javascript/application.js` — add global `window.onerror` + `unhandledrejection` handlers that POST to `/errors/client` - `app/javascript/controllers/location_controller.js` — add `console.error(e)` in `.catch()`, include error type in user message - `app/javascript/controllers/add_location_controller.js` — same catch fix - `app/javascript/controllers/upload_controller.js` — same catch fix (and any other controllers with silent catches) - `app/controllers/client_errors_controller.rb` — new endpoint that logs client errors via `Rails.logger.error` - `config/routes.rb` — add `POST /errors/client` route - `config/initializers/error_subscriber.rb` — Rails error reporting subscriber (structured error logging) - `Gemfile` — add `lograge` - `config/environments/production.rb` — enable lograge with JSON formatter Files NOT in this repo (platform/infra side): - Blackbox Exporter config or Probe CR — add target for `https://landscaping-assistant.tail5b443a.ts.net/up` - Grafana dashboard JSON — create dashboard for landscaping-assistant namespace - k8s Deployment spec — verify liveness/readiness probes exist for `/up` ### Acceptance Criteria - [ ] When a JS error occurs in any Stimulus controller, the error is logged to `console.error` AND posted to the server endpoint - [ ] When an uncaught JS exception or unhandled promise rejection occurs, it appears in Loki logs within 60 seconds - [ ] When an unhandled server exception occurs, it is logged with structured context (class, message, severity, source) - [ ] When I query Loki for `{namespace="landscaping-assistant"}`, request logs are single-line JSON with controller, action, status, duration - [ ] When the Tailscale Funnel URL is unreachable, Alertmanager fires an alert - [ ] When the app pod hangs, Kubernetes restarts it automatically (liveness probe) - [ ] A Grafana dashboard exists showing: error log stream, request patterns, client-side errors ### Test Expectations - [ ] Unit test: POST `/errors/client` with valid JSON payload returns 200 and logs the error - [ ] Unit test: POST `/errors/client` with invalid/oversized payload returns 422 (rate-limit/size guard) - [ ] Manual test: Trigger a JS error in browser, verify it appears in `kubectl logs` within seconds - [ ] Manual test: Hit Grafana dashboard, confirm log panels populate - Run command: `bundle exec rspec spec/controllers/client_errors_controller_spec.rb` ### Constraints - No external SaaS dependencies — everything runs on the existing cluster - Client error endpoint must be rate-limited or size-capped to prevent abuse (simple `request.body.read(4096)` limit is fine) - No authentication on the client error endpoint (the app has no auth) but validate JSON structure - Lograge should not suppress health check logs (already handled by `config.silence_healthcheck_path = "/up"`) - Grafana dashboard should be provisioned as JSON (checked into a platform repo or configmap), not created ad-hoc in the UI - Follow existing Blackbox Exporter probe pattern used by the other 14 services ### Checklist - [ ] PR opened - [ ] Tests pass - [ ] No unrelated changes - [ ] Grafana dashboard accessible - [ ] Blackbox probe firing in Prometheus targets ### Related - `project-landscaping-assistant` — project this affects - `observability-audit-2026-02-25` — platform observability baseline - `plan-2026-02-25-platform-observability` — platform observability plan (completed)
Sign in to join this conversation.
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
ldraney/landscaping-assistant#39
No description provided.