feat: synthetic monitoring + DORA dashboard fixes (Phases 14+15)

### Lineage `plan-pal-e-platform` → Phase 14 (Synthetic Monitoring) + Phase 15 (DORA Re-Baseline) ### Repo `forgejo_admin/pal-e-platform` ### User Story As a platform operator I want automated uptime probes for all Tailscale funnel endpoints and accurate DORA dashboard panels So that I detect outages proactively (not when a pipeline fails) and can measure DORA metrics with real Prometheus data ### Context The platform has 22 Tailscale funnel endpoints but zero automated uptime checks. If Forgejo goes down, we find out when a pipeline fails — that's reactive. Blackbox Exporter is the standard Prometheus solution for HTTP probing. Additionally, the DORA dashboard has broken Lead Time panels — they query `dora_pr_lead_time_seconds{quantile="0.5"}` which doesn't exist. The exporter produces histogram buckets (`_bucket` suffix), requiring `histogram_quantile()`. The repo variable also only shows repos with Woodpecker pipelines (1 repo) instead of all 30 repos with PR data. ### File Targets Files to modify: - `terraform/main.tf` — add Blackbox Exporter Helm release, PrometheusRule for downtime alerts, uptime dashboard ConfigMap - `terraform/dashboards/dora-dashboard.json` — fix Lead Time panel queries and repo variable (already done) Files to create: - `terraform/dashboards/uptime-dashboard.json` — uptime matrix, latency trends, availability % Files NOT to touch: - `terraform/variables.tf` — no new variables needed - `terraform/dashboards/pal-e-docs-golden-signals.json` — unrelated ### Acceptance Criteria - [ ] Blackbox Exporter pod is running in monitoring namespace - [ ] `probe_success` metric exists in Prometheus for all configured targets - [ ] PrometheusRule fires alert when `probe_success == 0` for >2 minutes - [ ] Uptime dashboard visible in Grafana with all probed endpoints - [ ] DORA dashboard Lead Time panels render actual data (histogram_quantile) - [ ] DORA dashboard repo dropdown shows all 30 repos (not just 1) - [ ] `tofu validate` passes - [ ] `tofu fmt` clean ### Test Expectations - [ ] `tofu validate` passes - [ ] `tofu plan` shows expected resource additions (Helm release, PrometheusRule, ConfigMap) - [ ] No destructive changes in plan output - Run command: `cd terraform && tofu validate && tofu fmt -check` ### Constraints - Blackbox Exporter in `monitoring` namespace (same as Prometheus) - Use internal service URLs where possible (avoid Tailscale round-trip for in-cluster services) - External probes for services that need funnel validation (pal-e-docs, westside-app, etc.) - Dashboard ConfigMaps must have `grafana_dashboard: "1"` label for sidecar discovery - Follow existing Helm release pattern in main.tf ### Checklist - [ ] PR opened - [ ] `tofu plan` output included in PR - [ ] No unrelated changes ### Related - `project-pal-e-platform` — Platform Hardening project - `phase-pal-e-platform-14-synthetic-monitoring` — synthetic monitoring phase - `phase-pal-e-platform-15-dora-rebaseline` — DORA re-baseline phase - `dora-framework` — axiom document (just updated with real Prometheus data)

forgejo_admin commented

2026-03-14 21:05:48 +00:00

Owner

Lineage

plan-pal-e-platform → Phase 14 (Synthetic Monitoring) + Phase 15 (DORA Re-Baseline)

Repo

forgejo_admin/pal-e-platform

User Story

As a platform operator
I want automated uptime probes for all Tailscale funnel endpoints and accurate DORA dashboard panels
So that I detect outages proactively (not when a pipeline fails) and can measure DORA metrics with real Prometheus data

Context

The platform has 22 Tailscale funnel endpoints but zero automated uptime checks. If Forgejo goes down, we find out when a pipeline fails — that's reactive. Blackbox Exporter is the standard Prometheus solution for HTTP probing.

Additionally, the DORA dashboard has broken Lead Time panels — they query dora_pr_lead_time_seconds{quantile="0.5"} which doesn't exist. The exporter produces histogram buckets (_bucket suffix), requiring histogram_quantile(). The repo variable also only shows repos with Woodpecker pipelines (1 repo) instead of all 30 repos with PR data.

File Targets

Files to modify:

terraform/main.tf — add Blackbox Exporter Helm release, PrometheusRule for downtime alerts, uptime dashboard ConfigMap
terraform/dashboards/dora-dashboard.json — fix Lead Time panel queries and repo variable (already done)

Files to create:

terraform/dashboards/uptime-dashboard.json — uptime matrix, latency trends, availability %

Files NOT to touch:

terraform/variables.tf — no new variables needed
terraform/dashboards/pal-e-docs-golden-signals.json — unrelated

Acceptance Criteria

Blackbox Exporter pod is running in monitoring namespace
probe_success metric exists in Prometheus for all configured targets
PrometheusRule fires alert when probe_success == 0 for >2 minutes
Uptime dashboard visible in Grafana with all probed endpoints
DORA dashboard Lead Time panels render actual data (histogram_quantile)
DORA dashboard repo dropdown shows all 30 repos (not just 1)
tofu validate passes
tofu fmt clean

Test Expectations

tofu validate passes
tofu plan shows expected resource additions (Helm release, PrometheusRule, ConfigMap)
No destructive changes in plan output
Run command: cd terraform && tofu validate && tofu fmt -check

Constraints

Blackbox Exporter in monitoring namespace (same as Prometheus)
Use internal service URLs where possible (avoid Tailscale round-trip for in-cluster services)
External probes for services that need funnel validation (pal-e-docs, westside-app, etc.)
Dashboard ConfigMaps must have grafana_dashboard: "1" label for sidecar discovery
Follow existing Helm release pattern in main.tf