Observability: Grafana dashboard + structured logging + SPA health probes #99

Closed
opened 2026-03-18 15:58:27 +00:00 by forgejo_admin · 1 comment

Lineage

plan-wkq → Phase 15 (Production Port) — discovered scope from SPA validation

Repo

forgejo_admin/basketball-api + forgejo_admin/pal-e-platform (for Grafana/blackbox config)

User Story

As an admin
I want to see API health, auth failures, and error rates in Grafana
So that I can debug issues like "Could not load dashboard data" without SSH-ing into pods

Context

During Phase 15 SPA validation, Lucas hit "Could not load dashboard data" on phone but Playwright showed the API working fine. Debugging required manual kubectl logs, curl with tokens, and guesswork. The observability stack exists (Prometheus, Loki, Grafana, blackbox) but there's no app-level dashboard for basketball-api or the westside SPA.

Basketball-api has a ServiceMonitor and /metrics endpoint, but no Grafana dashboard consuming the data. Loki collects pod logs but they're unstructured (plain uvicorn access logs). No SPA-side error tracking exists.

File Targets

basketball-api:

  • src/basketball_api/main.py — add structured JSON logging middleware (request method, path, status, latency, user_id from token)
  • src/basketball_api/routes/health.py — add /health/ready endpoint that checks DB connectivity

pal-e-platform or pal-e-services:

  • Grafana dashboard JSON (ConfigMap or provisioned) — basketball-api request metrics
  • Blackbox exporter target for westside-dev.tail5b443a.ts.net and westsidekingsandqueens.tail5b443a.ts.net
  • PrometheusRule for auth failure rate alert

Acceptance Criteria

  • Grafana dashboard shows: request rate, latency p50/p95/p99, error rate (4xx/5xx), all by endpoint
  • Grafana dashboard shows: auth failure rate (401s) as a separate panel
  • Basketball-api logs are structured JSON (parseable by Loki)
  • Blackbox exporter probes both SPA hostnames (dev + prod)
  • At least one alert rule: basketball-api 5xx rate > 5% for 5 minutes

Test Expectations

  • Grafana dashboard loads with real data
  • Loki query {namespace="basketball-api"} | json | status >= 400 returns structured results
  • Blackbox targets show UP in Prometheus

Constraints

  • Use existing kube-prometheus-stack — no new monitoring tools
  • Dashboard should be provisioned via ConfigMap (not manually created in Grafana UI)
  • Structured logging must not break existing /metrics endpoint

Checklist

  • Structured logging middleware
  • Grafana dashboard provisioned
  • Blackbox probes for SPA
  • Alert rule for 5xx
  • PR opened
  • Tests pass
  • plan-pal-e-platform — platform hardening (observability is a core pillar)
  • sop-incident-response — observability feeds incident detection
  • ServiceMonitor already exists in pal-e-deployments/overlays/basketball-api/
### Lineage `plan-wkq` → Phase 15 (Production Port) — discovered scope from SPA validation ### Repo `forgejo_admin/basketball-api` + `forgejo_admin/pal-e-platform` (for Grafana/blackbox config) ### User Story As an admin I want to see API health, auth failures, and error rates in Grafana So that I can debug issues like "Could not load dashboard data" without SSH-ing into pods ### Context During Phase 15 SPA validation, Lucas hit "Could not load dashboard data" on phone but Playwright showed the API working fine. Debugging required manual kubectl logs, curl with tokens, and guesswork. The observability stack exists (Prometheus, Loki, Grafana, blackbox) but there's no app-level dashboard for basketball-api or the westside SPA. Basketball-api has a ServiceMonitor and `/metrics` endpoint, but no Grafana dashboard consuming the data. Loki collects pod logs but they're unstructured (plain uvicorn access logs). No SPA-side error tracking exists. ### File Targets `basketball-api`: - `src/basketball_api/main.py` — add structured JSON logging middleware (request method, path, status, latency, user_id from token) - `src/basketball_api/routes/health.py` — add `/health/ready` endpoint that checks DB connectivity `pal-e-platform` or `pal-e-services`: - Grafana dashboard JSON (ConfigMap or provisioned) — basketball-api request metrics - Blackbox exporter target for `westside-dev.tail5b443a.ts.net` and `westsidekingsandqueens.tail5b443a.ts.net` - PrometheusRule for auth failure rate alert ### Acceptance Criteria - [ ] Grafana dashboard shows: request rate, latency p50/p95/p99, error rate (4xx/5xx), all by endpoint - [ ] Grafana dashboard shows: auth failure rate (401s) as a separate panel - [ ] Basketball-api logs are structured JSON (parseable by Loki) - [ ] Blackbox exporter probes both SPA hostnames (dev + prod) - [ ] At least one alert rule: basketball-api 5xx rate > 5% for 5 minutes ### Test Expectations - [ ] Grafana dashboard loads with real data - [ ] Loki query `{namespace="basketball-api"} | json | status >= 400` returns structured results - [ ] Blackbox targets show UP in Prometheus ### Constraints - Use existing kube-prometheus-stack — no new monitoring tools - Dashboard should be provisioned via ConfigMap (not manually created in Grafana UI) - Structured logging must not break existing `/metrics` endpoint ### Checklist - [ ] Structured logging middleware - [ ] Grafana dashboard provisioned - [ ] Blackbox probes for SPA - [ ] Alert rule for 5xx - [ ] PR opened - [ ] Tests pass ### Related - `plan-pal-e-platform` — platform hardening (observability is a core pillar) - `sop-incident-response` — observability feeds incident detection - ServiceMonitor already exists in `pal-e-deployments/overlays/basketball-api/`
forgejo_admin 2026-03-18 16:13:29 +00:00
Author
Owner

Progress: Structured Logging (PR #100 merged)

Completed

  • logging_config.py — JSON structured logging middleware (method, path, status_code, latency_ms, client_ip, user_agent, user_id from JWT)
  • /health/ready endpoint — DB connectivity check, 200/503
  • 14 new tests (12 logging, 2 health), 347 total pass
  • QA: approved on second review (blocker: test coverage, fixed)

Remaining

  • Prometheus HTTP instrumentation (prometheus-fastapi-instrumentator) — /metrics only exports basketball_api_up currently
  • Grafana dashboard (needs instrumentation first)
  • Alert rules for 5xx rate
  • Blackbox probe for westside-dev (PR #106 on pal-e-platform, pending CI)
## Progress: Structured Logging (PR #100 merged) ### Completed - `logging_config.py` — JSON structured logging middleware (method, path, status_code, latency_ms, client_ip, user_agent, user_id from JWT) - `/health/ready` endpoint — DB connectivity check, 200/503 - 14 new tests (12 logging, 2 health), 347 total pass - QA: approved on second review (blocker: test coverage, fixed) ### Remaining - Prometheus HTTP instrumentation (`prometheus-fastapi-instrumentator`) — `/metrics` only exports `basketball_api_up` currently - Grafana dashboard (needs instrumentation first) - Alert rules for 5xx rate - Blackbox probe for westside-dev (PR #106 on pal-e-platform, pending CI)
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
forgejo_admin/basketball-api#99
No description provided.