feat: synthetic monitoring + DORA dashboard fixes (Phases 14+15) #66

Closed
opened 2026-03-14 21:05:48 +00:00 by forgejo_admin · 0 comments

Lineage

plan-pal-e-platform → Phase 14 (Synthetic Monitoring) + Phase 15 (DORA Re-Baseline)

Repo

forgejo_admin/pal-e-platform

User Story

As a platform operator
I want automated uptime probes for all Tailscale funnel endpoints and accurate DORA dashboard panels
So that I detect outages proactively (not when a pipeline fails) and can measure DORA metrics with real Prometheus data

Context

The platform has 22 Tailscale funnel endpoints but zero automated uptime checks. If Forgejo goes down, we find out when a pipeline fails — that's reactive. Blackbox Exporter is the standard Prometheus solution for HTTP probing.

Additionally, the DORA dashboard has broken Lead Time panels — they query dora_pr_lead_time_seconds{quantile="0.5"} which doesn't exist. The exporter produces histogram buckets (_bucket suffix), requiring histogram_quantile(). The repo variable also only shows repos with Woodpecker pipelines (1 repo) instead of all 30 repos with PR data.

File Targets

Files to modify:

  • terraform/main.tf — add Blackbox Exporter Helm release, PrometheusRule for downtime alerts, uptime dashboard ConfigMap
  • terraform/dashboards/dora-dashboard.json — fix Lead Time panel queries and repo variable (already done)

Files to create:

  • terraform/dashboards/uptime-dashboard.json — uptime matrix, latency trends, availability %

Files NOT to touch:

  • terraform/variables.tf — no new variables needed
  • terraform/dashboards/pal-e-docs-golden-signals.json — unrelated

Acceptance Criteria

  • Blackbox Exporter pod is running in monitoring namespace
  • probe_success metric exists in Prometheus for all configured targets
  • PrometheusRule fires alert when probe_success == 0 for >2 minutes
  • Uptime dashboard visible in Grafana with all probed endpoints
  • DORA dashboard Lead Time panels render actual data (histogram_quantile)
  • DORA dashboard repo dropdown shows all 30 repos (not just 1)
  • tofu validate passes
  • tofu fmt clean

Test Expectations

  • tofu validate passes
  • tofu plan shows expected resource additions (Helm release, PrometheusRule, ConfigMap)
  • No destructive changes in plan output
  • Run command: cd terraform && tofu validate && tofu fmt -check

Constraints

  • Blackbox Exporter in monitoring namespace (same as Prometheus)
  • Use internal service URLs where possible (avoid Tailscale round-trip for in-cluster services)
  • External probes for services that need funnel validation (pal-e-docs, westside-app, etc.)
  • Dashboard ConfigMaps must have grafana_dashboard: "1" label for sidecar discovery
  • Follow existing Helm release pattern in main.tf

Checklist

  • PR opened
  • tofu plan output included in PR
  • No unrelated changes
  • project-pal-e-platform — Platform Hardening project
  • phase-pal-e-platform-14-synthetic-monitoring — synthetic monitoring phase
  • phase-pal-e-platform-15-dora-rebaseline — DORA re-baseline phase
  • dora-framework — axiom document (just updated with real Prometheus data)
### Lineage `plan-pal-e-platform` → Phase 14 (Synthetic Monitoring) + Phase 15 (DORA Re-Baseline) ### Repo `forgejo_admin/pal-e-platform` ### User Story As a platform operator I want automated uptime probes for all Tailscale funnel endpoints and accurate DORA dashboard panels So that I detect outages proactively (not when a pipeline fails) and can measure DORA metrics with real Prometheus data ### Context The platform has 22 Tailscale funnel endpoints but zero automated uptime checks. If Forgejo goes down, we find out when a pipeline fails — that's reactive. Blackbox Exporter is the standard Prometheus solution for HTTP probing. Additionally, the DORA dashboard has broken Lead Time panels — they query `dora_pr_lead_time_seconds{quantile="0.5"}` which doesn't exist. The exporter produces histogram buckets (`_bucket` suffix), requiring `histogram_quantile()`. The repo variable also only shows repos with Woodpecker pipelines (1 repo) instead of all 30 repos with PR data. ### File Targets Files to modify: - `terraform/main.tf` — add Blackbox Exporter Helm release, PrometheusRule for downtime alerts, uptime dashboard ConfigMap - `terraform/dashboards/dora-dashboard.json` — fix Lead Time panel queries and repo variable (already done) Files to create: - `terraform/dashboards/uptime-dashboard.json` — uptime matrix, latency trends, availability % Files NOT to touch: - `terraform/variables.tf` — no new variables needed - `terraform/dashboards/pal-e-docs-golden-signals.json` — unrelated ### Acceptance Criteria - [ ] Blackbox Exporter pod is running in monitoring namespace - [ ] `probe_success` metric exists in Prometheus for all configured targets - [ ] PrometheusRule fires alert when `probe_success == 0` for >2 minutes - [ ] Uptime dashboard visible in Grafana with all probed endpoints - [ ] DORA dashboard Lead Time panels render actual data (histogram_quantile) - [ ] DORA dashboard repo dropdown shows all 30 repos (not just 1) - [ ] `tofu validate` passes - [ ] `tofu fmt` clean ### Test Expectations - [ ] `tofu validate` passes - [ ] `tofu plan` shows expected resource additions (Helm release, PrometheusRule, ConfigMap) - [ ] No destructive changes in plan output - Run command: `cd terraform && tofu validate && tofu fmt -check` ### Constraints - Blackbox Exporter in `monitoring` namespace (same as Prometheus) - Use internal service URLs where possible (avoid Tailscale round-trip for in-cluster services) - External probes for services that need funnel validation (pal-e-docs, westside-app, etc.) - Dashboard ConfigMaps must have `grafana_dashboard: "1"` label for sidecar discovery - Follow existing Helm release pattern in main.tf ### Checklist - [ ] PR opened - [ ] `tofu plan` output included in PR - [ ] No unrelated changes ### Related - `project-pal-e-platform` — Platform Hardening project - `phase-pal-e-platform-14-synthetic-monitoring` — synthetic monitoring phase - `phase-pal-e-platform-15-dora-rebaseline` — DORA re-baseline phase - `dora-framework` — axiom document (just updated with real Prometheus data)
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
forgejo_admin/pal-e-platform#66
No description provided.