feat: synthetic monitoring + DORA dashboard fixes (Phases 14+15) #66
Labels
No labels
domain:backend
domain:devops
domain:frontend
status:approved
status:in-progress
status:needs-fix
status:qa
type:bug
type:devops
type:feature
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
forgejo_admin/pal-e-platform#66
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Lineage
plan-pal-e-platform→ Phase 14 (Synthetic Monitoring) + Phase 15 (DORA Re-Baseline)Repo
forgejo_admin/pal-e-platformUser Story
As a platform operator
I want automated uptime probes for all Tailscale funnel endpoints and accurate DORA dashboard panels
So that I detect outages proactively (not when a pipeline fails) and can measure DORA metrics with real Prometheus data
Context
The platform has 22 Tailscale funnel endpoints but zero automated uptime checks. If Forgejo goes down, we find out when a pipeline fails — that's reactive. Blackbox Exporter is the standard Prometheus solution for HTTP probing.
Additionally, the DORA dashboard has broken Lead Time panels — they query
dora_pr_lead_time_seconds{quantile="0.5"}which doesn't exist. The exporter produces histogram buckets (_bucketsuffix), requiringhistogram_quantile(). The repo variable also only shows repos with Woodpecker pipelines (1 repo) instead of all 30 repos with PR data.File Targets
Files to modify:
terraform/main.tf— add Blackbox Exporter Helm release, PrometheusRule for downtime alerts, uptime dashboard ConfigMapterraform/dashboards/dora-dashboard.json— fix Lead Time panel queries and repo variable (already done)Files to create:
terraform/dashboards/uptime-dashboard.json— uptime matrix, latency trends, availability %Files NOT to touch:
terraform/variables.tf— no new variables neededterraform/dashboards/pal-e-docs-golden-signals.json— unrelatedAcceptance Criteria
probe_successmetric exists in Prometheus for all configured targetsprobe_success == 0for >2 minutestofu validatepassestofu fmtcleanTest Expectations
tofu validatepassestofu planshows expected resource additions (Helm release, PrometheusRule, ConfigMap)cd terraform && tofu validate && tofu fmt -checkConstraints
monitoringnamespace (same as Prometheus)grafana_dashboard: "1"label for sidecar discoveryChecklist
tofu planoutput included in PRRelated
project-pal-e-platform— Platform Hardening projectphase-pal-e-platform-14-synthetic-monitoring— synthetic monitoring phasephase-pal-e-platform-15-dora-rebaseline— DORA re-baseline phasedora-framework— axiom document (just updated with real Prometheus data)