feat: synthetic monitoring + DORA dashboard fixes (Phases 14+15) #67

Merged
forgejo_admin merged 1 commit from 66-feat-synthetic-monitoring-dora-dashboard into main 2026-03-14 21:27:01 +00:00

Summary

Adds Blackbox Exporter for synthetic monitoring of all platform and application services, with alerting rules and a Grafana uptime dashboard. Also fixes DORA dashboard PromQL queries to use correct histogram_quantile syntax and the correct metric for the repo variable dropdown.

Changes

  • terraform/main.tf:
    • Added helm_release.blackbox_exporter -- Blackbox Exporter v9.1.0 with ServiceMonitor targeting 13 endpoints (8 platform internal, 5 app external via Tailscale funnels), 60s probe interval
    • Added kubernetes_manifest.blackbox_alerts -- PrometheusRule with EndpointDown (critical, 2m) and EndpointSlowResponse (warning, 5m) alert rules
    • Added kubernetes_config_map_v1.uptime_dashboard -- Grafana dashboard ConfigMap for service uptime
  • terraform/dashboards/dora-dashboard.json:
    • Fixed lead time p50 overview stat: quantile() -> histogram_quantile() with _bucket suffix
    • Fixed lead time p50/p95 per-repo timeseries: summary quantile selectors -> histogram_quantile() with proper by (le, repo) grouping
    • Fixed repo variable dropdown: dora_deployments_total -> dora_pr_merges_total (deployments metric doesn't exist yet)
  • terraform/dashboards/uptime-dashboard.json (new):
    • Row 1: Overview stats (Services Up, Services Down, Avg Response Time, 24h Uptime %)
    • Row 2: Uptime matrix showing UP/DOWN per target
    • Row 3: Response latency timeseries per target
    • Row 4: Probe success history (stacked bars)

Test Plan

  • tofu fmt -recursive -- passed (no formatting issues)
  • tofu validate -- passed ("The configuration is valid")
  • tofu plan -- state lock held by concurrent apply; plan deferred to post-merge. Expect 3 new resources: helm_release.blackbox_exporter, kubernetes_manifest.blackbox_alerts, kubernetes_config_map_v1.uptime_dashboard
  • After apply: verify Blackbox Exporter pod running in monitoring namespace
  • After apply: verify uptime dashboard appears in Grafana
  • After apply: verify DORA dashboard repo dropdown populates from dora_pr_merges_total

Review Checklist

  • Passed automated review-fix loop
  • No secrets committed
  • No unnecessary file changes
  • Commit messages are descriptive
  • Closes #66
  • Plan: plan-pal-e-platform (Phases 14+15)
## Summary Adds Blackbox Exporter for synthetic monitoring of all platform and application services, with alerting rules and a Grafana uptime dashboard. Also fixes DORA dashboard PromQL queries to use correct histogram_quantile syntax and the correct metric for the repo variable dropdown. ## Changes - **`terraform/main.tf`**: - Added `helm_release.blackbox_exporter` -- Blackbox Exporter v9.1.0 with ServiceMonitor targeting 13 endpoints (8 platform internal, 5 app external via Tailscale funnels), 60s probe interval - Added `kubernetes_manifest.blackbox_alerts` -- PrometheusRule with `EndpointDown` (critical, 2m) and `EndpointSlowResponse` (warning, 5m) alert rules - Added `kubernetes_config_map_v1.uptime_dashboard` -- Grafana dashboard ConfigMap for service uptime - **`terraform/dashboards/dora-dashboard.json`**: - Fixed lead time p50 overview stat: `quantile()` -> `histogram_quantile()` with `_bucket` suffix - Fixed lead time p50/p95 per-repo timeseries: summary quantile selectors -> `histogram_quantile()` with proper `by (le, repo)` grouping - Fixed repo variable dropdown: `dora_deployments_total` -> `dora_pr_merges_total` (deployments metric doesn't exist yet) - **`terraform/dashboards/uptime-dashboard.json`** (new): - Row 1: Overview stats (Services Up, Services Down, Avg Response Time, 24h Uptime %) - Row 2: Uptime matrix showing UP/DOWN per target - Row 3: Response latency timeseries per target - Row 4: Probe success history (stacked bars) ## Test Plan - [x] `tofu fmt -recursive` -- passed (no formatting issues) - [x] `tofu validate` -- passed ("The configuration is valid") - [ ] `tofu plan` -- state lock held by concurrent apply; plan deferred to post-merge. Expect 3 new resources: `helm_release.blackbox_exporter`, `kubernetes_manifest.blackbox_alerts`, `kubernetes_config_map_v1.uptime_dashboard` - [ ] After apply: verify Blackbox Exporter pod running in `monitoring` namespace - [ ] After apply: verify uptime dashboard appears in Grafana - [ ] After apply: verify DORA dashboard repo dropdown populates from `dora_pr_merges_total` ## Review Checklist - [x] Passed automated review-fix loop - [x] No secrets committed - [x] No unnecessary file changes - [x] Commit messages are descriptive ## Related - Closes #66 - Plan: `plan-pal-e-platform` (Phases 14+15)
feat: synthetic monitoring + DORA dashboard fixes (Phases 14+15)
All checks were successful
ci/woodpecker/push/woodpecker Pipeline was successful
ci/woodpecker/pr/woodpecker Pipeline was successful
ci/woodpecker/pull_request_closed/woodpecker Pipeline was successful
0c76c65c40
- Add Blackbox Exporter helm release with 13 probe targets (8 platform, 5 app)
- Add PrometheusRule for EndpointDown (critical, 2m) and EndpointSlowResponse (warning, 5m) alerts
- Add Service Uptime & Availability dashboard with overview stats, uptime matrix, response latency, and probe history
- Fix DORA dashboard: use histogram_quantile() for lead time queries (was using wrong quantile/summary syntax)
- Fix DORA dashboard: use dora_pr_merges_total for repo variable (was using dora_deployments_total which doesn't exist yet)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Author
Owner

Tofu Plan Output

tailscale_acl.this: Refreshing state... [id=acl]
helm_release.nvidia_device_plugin: Refreshing state... [id=nvidia-device-plugin]
kubernetes_namespace_v1.postgres: Refreshing state... [id=postgres]
kubernetes_namespace_v1.ollama: Refreshing state... [id=ollama]
kubernetes_namespace_v1.minio: Refreshing state... [id=minio]
kubernetes_namespace_v1.monitoring: Refreshing state... [id=monitoring]
kubernetes_namespace_v1.keycloak: Refreshing state... [id=keycloak]
kubernetes_namespace_v1.harbor: Refreshing state... [id=harbor]
kubernetes_namespace_v1.tailscale: Refreshing state... [id=tailscale]
kubernetes_namespace_v1.forgejo: Refreshing state... [id=forgejo]
data.kubernetes_namespace_v1.tofu_state: Reading...
data.kubernetes_namespace_v1.pal_e_docs: Reading...
kubernetes_namespace_v1.woodpecker: Refreshing state... [id=woodpecker]
kubernetes_namespace_v1.cnpg_system: Refreshing state... [id=cnpg-system]
kubernetes_secret_v1.dora_exporter: Refreshing state... [id=monitoring/dora-exporter]
data.kubernetes_namespace_v1.tofu_state: Read complete after 0s [id=tofu-state]
data.kubernetes_namespace_v1.pal_e_docs: Read complete after 0s [id=pal-e-docs]
kubernetes_service_v1.dora_exporter: Refreshing state... [id=monitoring/dora-exporter]
helm_release.kube_prometheus_stack: Refreshing state... [id=kube-prometheus-stack]
kubernetes_service_v1.keycloak: Refreshing state... [id=keycloak/keycloak]
kubernetes_secret_v1.keycloak_admin: Refreshing state... [id=keycloak/keycloak-admin]
helm_release.loki_stack: Refreshing state... [id=loki-stack]
kubernetes_persistent_volume_claim_v1.keycloak_data: Refreshing state... [id=keycloak/keycloak-data]
helm_release.forgejo: Refreshing state... [id=forgejo]
kubernetes_role_v1.tf_backup: Refreshing state... [id=tofu-state/tf-state-backup]
helm_release.tailscale_operator: Refreshing state... [id=tailscale-operator]
kubernetes_secret_v1.paledocs_db_url: Refreshing state... [id=pal-e-docs/paledocs-db-url]
kubernetes_service_account_v1.tf_backup: Refreshing state... [id=tofu-state/tf-state-backup]
kubernetes_secret_v1.woodpecker_db_credentials: Refreshing state... [id=woodpecker/woodpecker-db-credentials]
helm_release.cnpg: Refreshing state... [id=cnpg]
kubernetes_role_binding_v1.tf_backup: Refreshing state... [id=tofu-state/tf-state-backup]
kubernetes_deployment_v1.keycloak: Refreshing state... [id=keycloak/keycloak]
helm_release.ollama: Refreshing state... [id=ollama]
kubernetes_config_map_v1.dora_dashboard: Refreshing state... [id=monitoring/dora-dashboard]
helm_release.minio: Refreshing state... [id=minio]
helm_release.harbor: Refreshing state... [id=harbor]
kubernetes_deployment_v1.dora_exporter: Refreshing state... [id=monitoring/dora-exporter]
kubernetes_config_map_v1.pal_e_docs_dashboard: Refreshing state... [id=monitoring/pal-e-docs-dashboard]
kubernetes_manifest.dora_exporter_service_monitor: Refreshing state...
kubernetes_config_map_v1.grafana_loki_datasource: Refreshing state... [id=monitoring/grafana-loki-datasource]
kubernetes_ingress_v1.grafana_funnel: Refreshing state... [id=monitoring/grafana-funnel]
kubernetes_ingress_v1.keycloak_funnel: Refreshing state... [id=keycloak/keycloak-funnel]
kubernetes_ingress_v1.forgejo_funnel: Refreshing state... [id=forgejo/forgejo-funnel]
kubernetes_ingress_v1.alertmanager_funnel: Refreshing state... [id=monitoring/alertmanager-funnel]
minio_iam_policy.cnpg_wal: Refreshing state... [id=cnpg-wal]
minio_s3_bucket.postgres_wal: Refreshing state... [id=postgres-wal]
minio_s3_bucket.tf_state_backups: Refreshing state... [id=tf-state-backups]
minio_iam_user.cnpg: Refreshing state... [id=cnpg]
minio_s3_bucket.assets: Refreshing state... [id=assets]
minio_iam_policy.tf_backup: Refreshing state... [id=tf-backup]
minio_iam_user.tf_backup: Refreshing state... [id=tf-backup]
kubernetes_ingress_v1.minio_api_funnel: Refreshing state... [id=minio/minio-api-funnel]
kubernetes_ingress_v1.minio_funnel: Refreshing state... [id=minio/minio-funnel]
minio_iam_user_policy_attachment.cnpg: Refreshing state... [id=cnpg-20260302210642491000000001]
minio_iam_user_policy_attachment.tf_backup: Refreshing state... [id=tf-backup-20260314163610110100000001]
kubernetes_secret_v1.tf_backup_s3_creds: Refreshing state... [id=tofu-state/tf-backup-s3-creds]
kubernetes_secret_v1.woodpecker_cnpg_s3_creds: Refreshing state... [id=woodpecker/cnpg-s3-creds]
kubernetes_secret_v1.cnpg_s3_creds: Refreshing state... [id=postgres/cnpg-s3-creds]
kubernetes_cron_job_v1.tf_state_backup: Refreshing state... [id=tofu-state/tf-state-backup]
kubernetes_cron_job_v1.cnpg_backup_verify: Refreshing state... [id=postgres/cnpg-backup-verify]
kubernetes_manifest.woodpecker_postgres: Refreshing state...
kubernetes_ingress_v1.harbor_funnel: Refreshing state... [id=harbor/harbor-funnel]
helm_release.woodpecker: Refreshing state... [id=woodpecker]
kubernetes_ingress_v1.woodpecker_funnel: Refreshing state... [id=woodpecker/woodpecker-funnel]

OpenTofu used the selected providers to generate the following execution
plan. Resource actions are indicated with the following symbols:
  + create
  ~ update in-place

OpenTofu will perform the following actions:

  # helm_release.blackbox_exporter will be created
  + resource "helm_release" "blackbox_exporter" {
      + atomic                     = false
      + chart                      = "prometheus-blackbox-exporter"
      + cleanup_on_fail            = false
      + create_namespace           = false
      + dependency_update          = false
      + disable_crd_hooks          = false
      + disable_openapi_validation = false
      + disable_webhooks           = false
      + force_update               = false
      + id                         = (known after apply)
      + lint                       = false
      + manifest                   = (known after apply)
      + max_history                = 0
      + metadata                   = (known after apply)
      + name                       = "blackbox-exporter"
      + namespace                  = "monitoring"
      + pass_credentials           = false
      + recreate_pods              = false
      + render_subchart_notes      = true
      + replace                    = false
      + repository                 = "https://prometheus-community.github.io/helm-charts"
      + reset_values               = false
      + reuse_values               = false
      + skip_crds                  = false
      + status                     = "deployed"
      + timeout                    = 300
      + values                     = [
          + <<-EOT
                "resources":
                  "limits":
                    "memory": "64Mi"
                  "requests":
                    "cpu": "10m"
                    "memory": "32Mi"
                "serviceMonitor":
                  "defaults":
                    "interval": "60s"
                  "enabled": true
                  "targets":
                  - "labels":
                      "service": "forgejo"
                      "tier": "platform"
                    "name": "forgejo"
                    "url": "http://forgejo-http.forgejo.svc.cluster.local:3000"
                  - "labels":
                      "service": "woodpecker"
                      "tier": "platform"
                    "name": "woodpecker"
                    "url": "http://woodpecker-server.woodpecker.svc.cluster.local:80"
                  - "labels":
                      "service": "grafana"
                      "tier": "platform"
                    "name": "grafana"
                    "url": "http://kube-prometheus-stack-grafana.monitoring.svc.cluster.local:80"
                  - "labels":
                      "service": "alertmanager"
                      "tier": "platform"
                    "name": "alertmanager"
                    "url": "http://kube-prometheus-stack-alertmanager.monitoring.svc.cluster.local:9093"
                  - "labels":
                      "service": "harbor"
                      "tier": "platform"
                    "name": "harbor"
                    "url": "http://harbor-core.harbor.svc.cluster.local:80/api/v2.0/health"
                  - "labels":
                      "service": "argocd"
                      "tier": "platform"
                    "name": "argocd"
                    "url": "http://argocd-server.argocd.svc.cluster.local:80"
                  - "labels":
                      "service": "keycloak"
                      "tier": "platform"
                    "name": "keycloak"
                    "url": "http://keycloak.keycloak.svc.cluster.local:8080/health/ready"
                  - "labels":
                      "service": "minio"
                      "tier": "platform"
                    "name": "minio-api"
                    "url": "http://minio.minio.svc.cluster.local:9000/minio/health/live"
                  - "labels":
                      "service": "pal-e-docs"
                      "tier": "app"
                    "name": "pal-e-docs"
                    "url": "https://pal-e-docs.tail5b443a.ts.net/api/health"
                  - "labels":
                      "service": "pal-e-app"
                      "tier": "app"
                    "name": "pal-e-app"
                    "url": "https://pal-e-app.tail5b443a.ts.net"
                  - "labels":
                      "service": "basketball-api"
                      "tier": "app"
                    "name": "basketball-api"
                    "url": "https://basketball-api.tail5b443a.ts.net/api/health"
                  - "labels":
                      "service": "westside-app"
                      "tier": "app"
                    "name": "westside-app"
                    "url": "https://westsidekingsandqueens.tail5b443a.ts.net"
                  - "labels":
                      "service": "platform-validation"
                      "tier": "app"
                    "name": "platform-validation"
                    "url": "https://platform-validation.tail5b443a.ts.net"
            EOT,
        ]
      + verify                     = false
      + version                    = "9.1.0"
      + wait                       = true
      + wait_for_jobs              = false
    }

  # helm_release.kube_prometheus_stack will be updated in-place
  ~ resource "helm_release" "kube_prometheus_stack" {
        id                         = "kube-prometheus-stack"
      ~ metadata                   = [
          - {
              - app_version    = "v0.89.0"
              - chart          = "kube-prometheus-stack"
              - first_deployed = 1771560679
              - last_deployed  = 1773513092
              - name           = "kube-prometheus-stack"
              - namespace      = "monitoring"
              - notes          = <<-EOT
                    kube-prometheus-stack has been installed. Check its status by running:
                      kubectl --namespace monitoring get pods -l "release=kube-prometheus-stack"
                    
                    Get Grafana 'admin' user password by running:
                    
                      kubectl --namespace monitoring get secrets kube-prometheus-stack-grafana -o jsonpath="{.data.admin-password}" | base64 -d ; echo
                    
                    Access Grafana local instance:
                    
                      export POD_NAME=$(kubectl --namespace monitoring get pod -l "app.kubernetes.io/name=grafana,app.kubernetes.io/instance=kube-prometheus-stack" -oname)
                      kubectl --namespace monitoring port-forward $POD_NAME 3000
                    
                    Get your grafana admin user password by running:
                    
                      kubectl get secret --namespace monitoring -l app.kubernetes.io/component=admin-secret -o jsonpath="{.items[0].data.admin-password}" | base64 --decode ; echo
                    
                    
                    Visit https://github.com/prometheus-operator/kube-prometheus for instructions on how to create & configure Alertmanager and Prometheus instances using the Operator.
                    
                    1. Get your 'admin' user password by running:
                    
                       kubectl get secret --namespace monitoring kube-prometheus-stack-grafana -o jsonpath="{.data.admin-password}" | base64 --decode ; echo
                    
                    
                    2. The Grafana server can be accessed via port 80 on the following DNS name from within your cluster:
                    
                       kube-prometheus-stack-grafana.monitoring.svc.cluster.local
                    
                       Get the Grafana URL to visit by running these commands in the same shell:
                         export POD_NAME=$(kubectl get pods --namespace monitoring -l "app.kubernetes.io/name=grafana,app.kubernetes.io/instance=kube-prometheus-stack" -o jsonpath="{.items[0].metadata.name}")
                         kubectl --namespace monitoring port-forward $POD_NAME 3000
                    
                    3. Login with the password from step 1 and the username: admin
                    
                    1. Get the application URL by running these commands:
                      export POD_NAME=$(kubectl get pods --namespace monitoring -l "app.kubernetes.io/name=prometheus-node-exporter,app.kubernetes.io/instance=kube-prometheus-stack" -o jsonpath="{.items[0].metadata.name}")
                      echo "Visit http://127.0.0.1:9100 to use your application"
                      kubectl port-forward --namespace monitoring $POD_NAME 9100
                    kube-state-metrics is a simple service that listens to the Kubernetes API server and generates metrics about the state of the objects.
                    The exposed metrics can be found here:
                    https://github.com/kubernetes/kube-state-metrics/blob/master/docs/README.md#exposed-metrics
                    
                    The metrics are exported on the HTTP endpoint /metrics on the listening port.
                    In your case, kube-prometheus-stack-kube-state-metrics.monitoring.svc.cluster.local:8080/metrics
                    
                    They are served either as plaintext or protobuf depending on the Accept header.
                    They are designed to be consumed either by Prometheus itself or by a scraper that is compatible with scraping a Prometheus client endpoint.
                EOT
              - revision       = 13
              - values         = jsonencode(
                    {
                      - additionalPrometheusRules = [
                          - {
                              - groups = [
                                  - {
                                      - name  = "pod-health"
                                      - rules = [
                                          - {
                                              - alert       = "PodRestartStorm"
                                              - annotations = {
                                                  - description = "Pod {{ $labels.namespace }}/{{ $labels.pod }} has restarted {{ $value }} times in the last 15 minutes."
                                                  - summary     = "Pod {{ $labels.namespace }}/{{ $labels.pod }} restarting frequently"
                                                }
                                              - expr        = "increase(kube_pod_container_status_restarts_total[15m]) > 3"
                                              - for         = "0m"
                                              - labels      = {
                                                  - severity = "warning"
                                                }
                                            },
                                          - {
                                              - alert       = "OOMKilled"
                                              - annotations = {
                                                  - description = "Container {{ $labels.container }} in pod {{ $labels.namespace }}/{{ $labels.pod }} was OOMKilled."
                                                  - summary     = "Pod {{ $labels.namespace }}/{{ $labels.pod }} OOMKilled"
                                                }
                                              - expr        = "kube_pod_container_status_last_terminated_reason{reason=\"OOMKilled\"} > 0"
                                              - for         = "0m"
                                              - labels      = {
                                                  - severity = "critical"
                                                }
                                            },
                                        ]
                                    },
                                  - {
                                      - name  = "node-health"
                                      - rules = [
                                          - {
                                              - alert       = "DiskPressure"
                                              - annotations = {
                                                  - description = "Filesystem {{ $labels.mountpoint }} on {{ $labels.instance }} has only {{ $value | printf \"%.1f\" }}% space remaining."
                                                  - summary     = "Disk pressure on {{ $labels.instance }}"
                                                }
                                              - expr        = "(node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 < 15"
                                              - for         = "5m"
                                              - labels      = {
                                                  - severity = "critical"
                                                }
                                            },
                                        ]
                                    },
                                  - {
                                      - name  = "target-health"
                                      - rules = [
                                          - {
                                              - alert       = "TargetDown"
                                              - annotations = {
                                                  - description = "Target {{ $labels.job }}/{{ $labels.instance }} has been down for more than 5 minutes."
                                                  - summary     = "Target {{ $labels.instance }} is down"
                                                }
                                              - expr        = "up == 0"
                                              - for         = "5m"
                                              - labels      = {
                                                  - severity = "warning"
                                                }
                                            },
                                        ]
                                    },
                                ]
                              - name   = "platform-alerts"
                            },
                        ]
                      - alertmanager              = {
                          - alertmanagerSpec = {
                              - resources = {
                                  - limits   = {
                                      - memory = "128Mi"
                                    }
                                  - requests = {
                                      - cpu    = "10m"
                                      - memory = "64Mi"
                                    }
                                }
                              - storage   = {
                                  - volumeClaimTemplate = {
                                      - spec = {
                                          - accessModes      = [
                                              - "ReadWriteOnce",
                                            ]
                                          - resources        = {
                                              - requests = {
                                                  - storage = "1Gi"
                                                }
                                            }
                                          - storageClassName = "local-path"
                                        }
                                    }
                                }
                            }
                          - config           = {
                              - global    = {
                                  - resolve_timeout = "5m"
                                }
                              - receivers = [
                                  - {
                                      - name = "default"
                                    },
                                  - {
                                      - name             = "telegram"
                                      - telegram_configs = [
                                          - {
                                              - bot_token     = "8256326037:AAEZ-LlhhkyaDs8TtWhGqm9dUzYj_7hkpiE"
                                              - chat_id       = -5200965094
                                              - parse_mode    = "HTML"
                                              - send_resolved = true
                                            },
                                        ]
                                    },
                                ]
                              - route     = {
                                  - group_by        = [
                                      - "alertname",
                                      - "namespace",
                                    ]
                                  - group_interval  = "5m"
                                  - group_wait      = "30s"
                                  - receiver        = "telegram"
                                  - repeat_interval = "12h"
                                  - routes          = []
                                }
                            }
                        }
                      - grafana                   = {
                          - adminPassword = "(sensitive value)"
                          - persistence   = {
                              - enabled          = true
                              - size             = "2Gi"
                              - storageClassName = "local-path"
                            }
                          - resources     = {
                              - limits   = {
                                  - memory = "256Mi"
                                }
                              - requests = {
                                  - cpu    = "50m"
                                  - memory = "128Mi"
                                }
                            }
                          - sidecar       = {
                              - dashboards  = {
                                  - enabled         = true
                                  - searchNamespace = "ALL"
                                }
                              - datasources = {
                                  - enabled         = true
                                  - searchNamespace = "ALL"
                                }
                            }
                        }
                      - kube-state-metrics        = {
                          - resources = {
                              - limits   = {
                                  - memory = "128Mi"
                                }
                              - requests = {
                                  - cpu    = "10m"
                                  - memory = "32Mi"
                                }
                            }
                        }
                      - kubeControllerManager     = {
                          - enabled = false
                        }
                      - kubeEtcd                  = {
                          - enabled = false
                        }
                      - kubeProxy                 = {
                          - enabled = false
                        }
                      - kubeScheduler             = {
                          - enabled = false
                        }
                      - nodeExporter              = {
                          - resources = {
                              - limits   = {
                                  - memory = "64Mi"
                                }
                              - requests = {
                                  - cpu    = "20m"
                                  - memory = "32Mi"
                                }
                            }
                        }
                      - prometheus                = {
                          - prometheusSpec = {
                              - podMonitorSelectorNilUsesHelmValues     = false
                              - resources                               = {
                                  - limits   = {
                                      - memory = "1Gi"
                                    }
                                  - requests = {
                                      - cpu    = "200m"
                                      - memory = "512Mi"
                                    }
                                }
                              - retention                               = "15d"
                              - retentionSize                           = "10GB"
                              - ruleSelectorNilUsesHelmValues           = false
                              - serviceMonitorSelectorNilUsesHelmValues = false
                              - storageSpec                             = {
                                  - volumeClaimTemplate = {
                                      - spec = {
                                          - accessModes      = [
                                              - "ReadWriteOnce",
                                            ]
                                          - resources        = {
                                              - requests = {
                                                  - storage = "15Gi"
                                                }
                                            }
                                          - storageClassName = "local-path"
                                        }
                                    }
                                }
                            }
                        }
                    }
                )
              - version        = "82.0.0"
            },
        ] -> (known after apply)
        name                       = "kube-prometheus-stack"
      ~ values                     = [
          - (sensitive value),
          + <<-EOT
                "additionalPrometheusRules":
                - "groups":
                  - "name": "pod-health"
                    "rules":
                    - "alert": "PodRestartStorm"
                      "annotations":
                        "description": "Pod {{ $labels.namespace }}/{{ $labels.pod }} has restarted
                          {{ $value }} times in the last 15 minutes."
                        "summary": "Pod {{ $labels.namespace }}/{{ $labels.pod }} restarting frequently"
                      "expr": "increase(kube_pod_container_status_restarts_total[15m]) > 3"
                      "for": "0m"
                      "labels":
                        "severity": "warning"
                    - "alert": "OOMKilled"
                      "annotations":
                        "description": "Container {{ $labels.container }} in pod {{ $labels.namespace
                          }}/{{ $labels.pod }} was OOMKilled."
                        "summary": "Pod {{ $labels.namespace }}/{{ $labels.pod }} OOMKilled"
                      "expr": "kube_pod_container_status_last_terminated_reason{reason=\"OOMKilled\"}
                        > 0"
                      "for": "0m"
                      "labels":
                        "severity": "critical"
                  - "name": "node-health"
                    "rules":
                    - "alert": "DiskPressure"
                      "annotations":
                        "description": "Filesystem {{ $labels.mountpoint }} on {{ $labels.instance
                          }} has only {{ $value | printf \"%.1f\" }}% space remaining."
                        "summary": "Disk pressure on {{ $labels.instance }}"
                      "expr": "(node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 <
                        15"
                      "for": "5m"
                      "labels":
                        "severity": "critical"
                  - "name": "target-health"
                    "rules":
                    - "alert": "TargetDown"
                      "annotations":
                        "description": "Target {{ $labels.job }}/{{ $labels.instance }} has been down
                          for more than 5 minutes."
                        "summary": "Target {{ $labels.instance }} is down"
                      "expr": "up == 0"
                      "for": "5m"
                      "labels":
                        "severity": "warning"
                  "name": "platform-alerts"
                "alertmanager":
                  "alertmanagerSpec":
                    "resources":
                      "limits":
                        "memory": "128Mi"
                      "requests":
                        "cpu": "10m"
                        "memory": "64Mi"
                    "storage":
                      "volumeClaimTemplate":
                        "spec":
                          "accessModes":
                          - "ReadWriteOnce"
                          "resources":
                            "requests":
                              "storage": "1Gi"
                          "storageClassName": "local-path"
                  "config":
                    "global":
                      "resolve_timeout": "5m"
                    "receivers":
                    - "name": "default"
                    - "name": "telegram"
                      "telegram_configs":
                      - "parse_mode": "HTML"
                        "send_resolved": true
                    - "name": "slack"
                      "slack_configs":
                      - "channel": "#alerts"
                        "send_resolved": true
                        "text": |-
                          {{ range .Alerts }}*{{ .Annotations.summary }}*
                          {{ .Annotations.description }}
                          {{ end }}
                        "title": "{{ .GroupLabels.alertname }}"
                    "route":
                      "group_by":
                      - "alertname"
                      - "namespace"
                      "group_interval": "5m"
                      "group_wait": "30s"
                      "receiver": "telegram"
                      "repeat_interval": "12h"
                      "routes":
                      - "matchers":
                        - "severity=~\"critical|warning\""
                        "receiver": "slack"
                "grafana":
                  "persistence":
                    "enabled": true
                    "size": "2Gi"
                    "storageClassName": "local-path"
                  "resources":
                    "limits":
                      "memory": "256Mi"
                    "requests":
                      "cpu": "50m"
                      "memory": "128Mi"
                  "sidecar":
                    "dashboards":
                      "enabled": true
                      "searchNamespace": "ALL"
                    "datasources":
                      "enabled": true
                      "searchNamespace": "ALL"
                "kube-state-metrics":
                  "resources":
                    "limits":
                      "memory": "128Mi"
                    "requests":
                      "cpu": "10m"
                      "memory": "32Mi"
                "kubeControllerManager":
                  "enabled": false
                "kubeEtcd":
                  "enabled": false
                "kubeProxy":
                  "enabled": false
                "kubeScheduler":
                  "enabled": false
                "nodeExporter":
                  "resources":
                    "limits":
                      "memory": "64Mi"
                    "requests":
                      "cpu": "20m"
                      "memory": "32Mi"
                "prometheus":
                  "prometheusSpec":
                    "podMonitorSelectorNilUsesHelmValues": false
                    "resources":
                      "limits":
                        "memory": "1Gi"
                      "requests":
                        "cpu": "200m"
                        "memory": "512Mi"
                    "retention": "15d"
                    "retentionSize": "10GB"
                    "ruleSelectorNilUsesHelmValues": false
                    "serviceMonitorSelectorNilUsesHelmValues": false
                    "storageSpec":
                      "volumeClaimTemplate":
                        "spec":
                          "accessModes":
                          - "ReadWriteOnce"
                          "resources":
                            "requests":
                              "storage": "15Gi"
                          "storageClassName": "local-path"
            EOT,
        ]
        # (27 unchanged attributes hidden)

      + set_sensitive {
          # At least one attribute in this block is (or was) sensitive,
          # so its contents will not be displayed.
        }

        # (3 unchanged blocks hidden)
    }

  # kubernetes_config_map_v1.dora_dashboard will be updated in-place
  ~ resource "kubernetes_config_map_v1" "dora_dashboard" {
      ~ data        = {
          ~ "dora-dashboard.json" = jsonencode(
              ~ {
                    id                   = null
                  ~ panels               = [
                        # (1 unchanged element hidden)
                        {
                            datasource  = {
                                type = "prometheus"
                                uid  = "${DS_PROMETHEUS}"
                            }
                            fieldConfig = {
                                defaults  = {
                                    color      = {
                                        mode = "thresholds"
                                    }
                                    mappings   = []
                                    thresholds = {
                                        mode  = "absolute"
                                        steps = [
                                            {
                                                color = "red"
                                                value = null
                                            },
                                            {
                                                color = "orange"
                                                value = 0.14
                                            },
                                            {
                                                color = "yellow"
                                                value = 0.5
                                            },
                                            {
                                                color = "green"
                                                value = 1
                                            },
                                        ]
                                    }
                                    unit       = "short"
                                }
                                overrides = []
                            }
                            gridPos     = {
                                h = 4
                                w = 6
                                x = 0
                                y = 1
                            }
                            id          = 2
                            options     = {
                                colorMode     = "background"
                                graphMode     = "none"
                                justifyMode   = "auto"
                                orientation   = "auto"
                                reduceOptions = {
                                    calcs  = [
                                        "lastNotNull",
                                    ]
                                    fields = ""
                                    values = false
                                }
                                textMode      = "auto"
                                wideLayout    = true
                            }
                            targets     = [
                                {
                                    datasource   = {
                                        type = "prometheus"
                                        uid  = "${DS_PROMETHEUS}"
                                    }
                                    expr         = "sum(rate(dora_deployments_total{status=\"success\"}[1d])) * 86400"
                                    legendFormat = "deploys/day"
                                    refId        = "A"
                                },
                            ]
                            title       = "Deploys / Day (All Repos)"
                            type        = "stat"
                        },
                      ~ {
                            id          = 3
                          ~ targets     = [
                              ~ {
                                  ~ expr         = "quantile(0.5, dora_pr_lead_time_seconds) / 3600" -> "histogram_quantile(0.5, sum(dora_pr_lead_time_seconds_bucket) by (le)) / 3600"
                                    # (3 unchanged attributes hidden)
                                },
                            ]
                            # (6 unchanged attributes hidden)
                        },
                        {
                            datasource  = {
                                type = "prometheus"
                                uid  = "${DS_PROMETHEUS}"
                            }
                            fieldConfig = {
                                defaults  = {
                                    color      = {
                                        mode = "thresholds"
                                    }
                                    mappings   = []
                                    max        = 100
                                    min        = 0
                                    thresholds = {
                                        mode  = "absolute"
                                        steps = [
                                            {
                                                color = "green"
                                                value = null
                                            },
                                            {
                                                color = "yellow"
                                                value = 5
                                            },
                                            {
                                                color = "orange"
                                                value = 15
                                            },
                                            {
                                                color = "red"
                                                value = 30
                                            },
                                        ]
                                    }
                                    unit       = "percent"
                                }
                                overrides = []
                            }
                            gridPos     = {
                                h = 4
                                w = 6
                                x = 12
                                y = 1
                            }
                            id          = 4
                            options     = {
                                colorMode     = "background"
                                graphMode     = "none"
                                justifyMode   = "auto"
                                orientation   = "auto"
                                reduceOptions = {
                                    calcs  = [
                                        "lastNotNull",
                                    ]
                                    fields = ""
                                    values = false
                                }
                                textMode      = "auto"
                                wideLayout    = true
                            }
                            targets     = [
                                {
                                    datasource   = {
                                        type = "prometheus"
                                        uid  = "${DS_PROMETHEUS}"
                                    }
                                    expr         = "(sum(dora_deployments_total{status=\"failure\"}) / clamp_min(sum(dora_deployments_total), 1)) * 100"
                                    legendFormat = "CFR %"
                                    refId        = "A"
                                },
                            ]
                            title       = "Change Failure Rate (All Repos)"
                            type        = "stat"
                        },
                        # (3 unchanged elements hidden)
                        {
                            collapsed = false
                            gridPos   = {
                                h = 1
                                w = 24
                                x = 0
                                y = 14
                            }
                            id        = 8
                            title     = "Lead Time for Changes"
                            type      = "row"
                        },
                      ~ {
                            id          = 9
                          ~ targets     = [
                              ~ {
                                  ~ expr         = "dora_pr_lead_time_seconds{quantile=\"0.5\", repo=~\"$repo\"} / 3600" -> "histogram_quantile(0.5, sum(dora_pr_lead_time_seconds_bucket{repo=~\"$repo\"}) by (le, repo)) / 3600"
                                    # (3 unchanged attributes hidden)
                                },
                              ~ {
                                  ~ expr         = "dora_pr_lead_time_seconds{quantile=\"0.95\", repo=~\"$repo\"} / 3600" -> "histogram_quantile(0.95, sum(dora_pr_lead_time_seconds_bucket{repo=~\"$repo\"}) by (le, repo)) / 3600"
                                    # (3 unchanged attributes hidden)
                                },
                            ]
                            # (6 unchanged attributes hidden)
                        },
                        {
                            collapsed = false
                            gridPos   = {
                                h = 1
                                w = 24
                                x = 0
                                y = 23
                            }
                            id        = 10
                            title     = "Change Failure Rate"
                            type      = "row"
                        },
                        # (3 unchanged elements hidden)
                    ]
                    tags                 = [
                        "dora",
                        "platform",
                        "metrics",
                    ]
                  ~ templating           = {
                      ~ list = [
                            {
                                current     = {}
                                hide        = 0
                                includeAll  = false
                                label       = "Prometheus"
                                multi       = false
                                name        = "DS_PROMETHEUS"
                                options     = []
                                query       = "prometheus"
                                queryValue  = ""
                                refresh     = 1
                                regex       = ""
                                skipUrlSync = false
                                type        = "datasource"
                            },
                          ~ {
                              ~ definition  = "label_values(dora_deployments_total, repo)" -> "label_values(dora_pr_merges_total, repo)"
                                name        = "repo"
                              ~ query       = {
                                  ~ query   = "label_values(dora_deployments_total, repo)" -> "label_values(dora_pr_merges_total, repo)"
                                    # (1 unchanged attribute hidden)
                                }
                                # (12 unchanged attributes hidden)
                            },
                        ]
                    }
                    # (12 unchanged attributes hidden)
                }
            )
        }
        id          = "monitoring/dora-dashboard"
        # (2 unchanged attributes hidden)

        # (1 unchanged block hidden)
    }

  # kubernetes_config_map_v1.uptime_dashboard will be created
  + resource "kubernetes_config_map_v1" "uptime_dashboard" {
      + data = {
          + "uptime-dashboard.json" = jsonencode(
                {
                  + annotations          = {
                      + list = [
                          + {
                              + builtIn    = 1
                              + datasource = {
                                  + type = "grafana"
                                  + uid  = "-- Grafana --"
                                }
                              + enable     = true
                              + hide       = true
                              + iconColor  = "rgba(0, 211, 255, 1)"
                              + name       = "Annotations & Alerts"
                              + type       = "dashboard"
                            },
                        ]
                    }
                  + editable             = true
                  + fiscalYearStartMonth = 0
                  + graphTooltip         = 1
                  + id                   = null
                  + links                = []
                  + panels               = [
                      + {
                          + collapsed = false
                          + gridPos   = {
                              + h = 1
                              + w = 24
                              + x = 0
                              + y = 0
                            }
                          + id        = 1
                          + title     = "Overview Stats"
                          + type      = "row"
                        },
                      + {
                          + datasource  = {
                              + type = "prometheus"
                              + uid  = "${DS_PROMETHEUS}"
                            }
                          + fieldConfig = {
                              + defaults  = {
                                  + color      = {
                                      + mode = "thresholds"
                                    }
                                  + mappings   = []
                                  + thresholds = {
                                      + mode  = "absolute"
                                      + steps = [
                                          + {
                                              + color = "red"
                                              + value = null
                                            },
                                          + {
                                              + color = "green"
                                              + value = 1
                                            },
                                        ]
                                    }
                                  + unit       = "short"
                                }
                              + overrides = []
                            }
                          + gridPos     = {
                              + h = 4
                              + w = 6
                              + x = 0
                              + y = 1
                            }
                          + id          = 2
                          + options     = {
                              + colorMode     = "background"
                              + graphMode     = "none"
                              + justifyMode   = "auto"
                              + orientation   = "auto"
                              + reduceOptions = {
                                  + calcs  = [
                                      + "lastNotNull",
                                    ]
                                  + fields = ""
                                  + values = false
                                }
                              + textMode      = "auto"
                              + wideLayout    = true
                            }
                          + targets     = [
                              + {
                                  + datasource   = {
                                      + type = "prometheus"
                                      + uid  = "${DS_PROMETHEUS}"
                                    }
                                  + expr         = "count(probe_success == 1)"
                                  + legendFormat = "up"
                                  + refId        = "A"
                                },
                            ]
                          + title       = "Services Up"
                          + type        = "stat"
                        },
                      + {
                          + datasource  = {
                              + type = "prometheus"
                              + uid  = "${DS_PROMETHEUS}"
                            }
                          + fieldConfig = {
                              + defaults  = {
                                  + color      = {
                                      + mode = "thresholds"
                                    }
                                  + mappings   = []
                                  + thresholds = {
                                      + mode  = "absolute"
                                      + steps = [
                                          + {
                                              + color = "green"
                                              + value = null
                                            },
                                          + {
                                              + color = "red"
                                              + value = 1
                                            },
                                        ]
                                    }
                                  + unit       = "short"
                                }
                              + overrides = []
                            }
                          + gridPos     = {
                              + h = 4
                              + w = 6
                              + x = 6
                              + y = 1
                            }
                          + id          = 3
                          + options     = {
                              + colorMode     = "background"
                              + graphMode     = "none"
                              + justifyMode   = "auto"
                              + orientation   = "auto"
                              + reduceOptions = {
                                  + calcs  = [
                                      + "lastNotNull",
                                    ]
                                  + fields = ""
                                  + values = false
                                }
                              + textMode      = "auto"
                              + wideLayout    = true
                            }
                          + targets     = [
                              + {
                                  + datasource   = {
                                      + type = "prometheus"
                                      + uid  = "${DS_PROMETHEUS}"
                                    }
                                  + expr         = "count(probe_success == 0)"
                                  + legendFormat = "down"
                                  + refId        = "A"
                                },
                            ]
                          + title       = "Services Down"
                          + type        = "stat"
                        },
                      + {
                          + datasource  = {
                              + type = "prometheus"
                              + uid  = "${DS_PROMETHEUS}"
                            }
                          + fieldConfig = {
                              + defaults  = {
                                  + color      = {
                                      + mode = "thresholds"
                                    }
                                  + mappings   = []
                                  + thresholds = {
                                      + mode  = "absolute"
                                      + steps = [
                                          + {
                                              + color = "green"
                                              + value = null
                                            },
                                          + {
                                              + color = "yellow"
                                              + value = 1
                                            },
                                          + {
                                              + color = "orange"
                                              + value = 3
                                            },
                                          + {
                                              + color = "red"
                                              + value = 5
                                            },
                                        ]
                                    }
                                  + unit       = "s"
                                }
                              + overrides = []
                            }
                          + gridPos     = {
                              + h = 4
                              + w = 6
                              + x = 12
                              + y = 1
                            }
                          + id          = 4
                          + options     = {
                              + colorMode     = "background"
                              + graphMode     = "none"
                              + justifyMode   = "auto"
                              + orientation   = "auto"
                              + reduceOptions = {
                                  + calcs  = [
                                      + "lastNotNull",
                                    ]
                                  + fields = ""
                                  + values = false
                                }
                              + textMode      = "auto"
                              + wideLayout    = true
                            }
                          + targets     = [
                              + {
                                  + datasource   = {
                                      + type = "prometheus"
                                      + uid  = "${DS_PROMETHEUS}"
                                    }
                                  + expr         = "avg(probe_duration_seconds)"
                                  + legendFormat = "avg"
                                  + refId        = "A"
                                },
                            ]
                          + title       = "Average Response Time"
                          + type        = "stat"
                        },
                      + {
                          + datasource  = {
                              + type = "prometheus"
                              + uid  = "${DS_PROMETHEUS}"
                            }
                          + fieldConfig = {
                              + defaults  = {
                                  + color      = {
                                      + mode = "thresholds"
                                    }
                                  + mappings   = []
                                  + max        = 100
                                  + min        = 0
                                  + thresholds = {
                                      + mode  = "absolute"
                                      + steps = [
                                          + {
                                              + color = "red"
                                              + value = null
                                            },
                                          + {
                                              + color = "yellow"
                                              + value = 99
                                            },
                                          + {
                                              + color = "green"
                                              + value = 99.9
                                            },
                                        ]
                                    }
                                  + unit       = "percent"
                                }
                              + overrides = []
                            }
                          + gridPos     = {
                              + h = 4
                              + w = 6
                              + x = 18
                              + y = 1
                            }
                          + id          = 5
                          + options     = {
                              + colorMode     = "background"
                              + graphMode     = "none"
                              + justifyMode   = "auto"
                              + orientation   = "auto"
                              + reduceOptions = {
                                  + calcs  = [
                                      + "lastNotNull",
                                    ]
                 ...(truncated)
## Tofu Plan Output ``` tailscale_acl.this: Refreshing state... [id=acl] helm_release.nvidia_device_plugin: Refreshing state... [id=nvidia-device-plugin] kubernetes_namespace_v1.postgres: Refreshing state... [id=postgres] kubernetes_namespace_v1.ollama: Refreshing state... [id=ollama] kubernetes_namespace_v1.minio: Refreshing state... [id=minio] kubernetes_namespace_v1.monitoring: Refreshing state... [id=monitoring] kubernetes_namespace_v1.keycloak: Refreshing state... [id=keycloak] kubernetes_namespace_v1.harbor: Refreshing state... [id=harbor] kubernetes_namespace_v1.tailscale: Refreshing state... [id=tailscale] kubernetes_namespace_v1.forgejo: Refreshing state... [id=forgejo] data.kubernetes_namespace_v1.tofu_state: Reading... data.kubernetes_namespace_v1.pal_e_docs: Reading... kubernetes_namespace_v1.woodpecker: Refreshing state... [id=woodpecker] kubernetes_namespace_v1.cnpg_system: Refreshing state... [id=cnpg-system] kubernetes_secret_v1.dora_exporter: Refreshing state... [id=monitoring/dora-exporter] data.kubernetes_namespace_v1.tofu_state: Read complete after 0s [id=tofu-state] data.kubernetes_namespace_v1.pal_e_docs: Read complete after 0s [id=pal-e-docs] kubernetes_service_v1.dora_exporter: Refreshing state... [id=monitoring/dora-exporter] helm_release.kube_prometheus_stack: Refreshing state... [id=kube-prometheus-stack] kubernetes_service_v1.keycloak: Refreshing state... [id=keycloak/keycloak] kubernetes_secret_v1.keycloak_admin: Refreshing state... [id=keycloak/keycloak-admin] helm_release.loki_stack: Refreshing state... [id=loki-stack] kubernetes_persistent_volume_claim_v1.keycloak_data: Refreshing state... [id=keycloak/keycloak-data] helm_release.forgejo: Refreshing state... [id=forgejo] kubernetes_role_v1.tf_backup: Refreshing state... [id=tofu-state/tf-state-backup] helm_release.tailscale_operator: Refreshing state... [id=tailscale-operator] kubernetes_secret_v1.paledocs_db_url: Refreshing state... [id=pal-e-docs/paledocs-db-url] kubernetes_service_account_v1.tf_backup: Refreshing state... [id=tofu-state/tf-state-backup] kubernetes_secret_v1.woodpecker_db_credentials: Refreshing state... [id=woodpecker/woodpecker-db-credentials] helm_release.cnpg: Refreshing state... [id=cnpg] kubernetes_role_binding_v1.tf_backup: Refreshing state... [id=tofu-state/tf-state-backup] kubernetes_deployment_v1.keycloak: Refreshing state... [id=keycloak/keycloak] helm_release.ollama: Refreshing state... [id=ollama] kubernetes_config_map_v1.dora_dashboard: Refreshing state... [id=monitoring/dora-dashboard] helm_release.minio: Refreshing state... [id=minio] helm_release.harbor: Refreshing state... [id=harbor] kubernetes_deployment_v1.dora_exporter: Refreshing state... [id=monitoring/dora-exporter] kubernetes_config_map_v1.pal_e_docs_dashboard: Refreshing state... [id=monitoring/pal-e-docs-dashboard] kubernetes_manifest.dora_exporter_service_monitor: Refreshing state... kubernetes_config_map_v1.grafana_loki_datasource: Refreshing state... [id=monitoring/grafana-loki-datasource] kubernetes_ingress_v1.grafana_funnel: Refreshing state... [id=monitoring/grafana-funnel] kubernetes_ingress_v1.keycloak_funnel: Refreshing state... [id=keycloak/keycloak-funnel] kubernetes_ingress_v1.forgejo_funnel: Refreshing state... [id=forgejo/forgejo-funnel] kubernetes_ingress_v1.alertmanager_funnel: Refreshing state... [id=monitoring/alertmanager-funnel] minio_iam_policy.cnpg_wal: Refreshing state... [id=cnpg-wal] minio_s3_bucket.postgres_wal: Refreshing state... [id=postgres-wal] minio_s3_bucket.tf_state_backups: Refreshing state... [id=tf-state-backups] minio_iam_user.cnpg: Refreshing state... [id=cnpg] minio_s3_bucket.assets: Refreshing state... [id=assets] minio_iam_policy.tf_backup: Refreshing state... [id=tf-backup] minio_iam_user.tf_backup: Refreshing state... [id=tf-backup] kubernetes_ingress_v1.minio_api_funnel: Refreshing state... [id=minio/minio-api-funnel] kubernetes_ingress_v1.minio_funnel: Refreshing state... [id=minio/minio-funnel] minio_iam_user_policy_attachment.cnpg: Refreshing state... [id=cnpg-20260302210642491000000001] minio_iam_user_policy_attachment.tf_backup: Refreshing state... [id=tf-backup-20260314163610110100000001] kubernetes_secret_v1.tf_backup_s3_creds: Refreshing state... [id=tofu-state/tf-backup-s3-creds] kubernetes_secret_v1.woodpecker_cnpg_s3_creds: Refreshing state... [id=woodpecker/cnpg-s3-creds] kubernetes_secret_v1.cnpg_s3_creds: Refreshing state... [id=postgres/cnpg-s3-creds] kubernetes_cron_job_v1.tf_state_backup: Refreshing state... [id=tofu-state/tf-state-backup] kubernetes_cron_job_v1.cnpg_backup_verify: Refreshing state... [id=postgres/cnpg-backup-verify] kubernetes_manifest.woodpecker_postgres: Refreshing state... kubernetes_ingress_v1.harbor_funnel: Refreshing state... [id=harbor/harbor-funnel] helm_release.woodpecker: Refreshing state... [id=woodpecker] kubernetes_ingress_v1.woodpecker_funnel: Refreshing state... [id=woodpecker/woodpecker-funnel] OpenTofu used the selected providers to generate the following execution plan. Resource actions are indicated with the following symbols: + create ~ update in-place OpenTofu will perform the following actions: # helm_release.blackbox_exporter will be created + resource "helm_release" "blackbox_exporter" { + atomic = false + chart = "prometheus-blackbox-exporter" + cleanup_on_fail = false + create_namespace = false + dependency_update = false + disable_crd_hooks = false + disable_openapi_validation = false + disable_webhooks = false + force_update = false + id = (known after apply) + lint = false + manifest = (known after apply) + max_history = 0 + metadata = (known after apply) + name = "blackbox-exporter" + namespace = "monitoring" + pass_credentials = false + recreate_pods = false + render_subchart_notes = true + replace = false + repository = "https://prometheus-community.github.io/helm-charts" + reset_values = false + reuse_values = false + skip_crds = false + status = "deployed" + timeout = 300 + values = [ + <<-EOT "resources": "limits": "memory": "64Mi" "requests": "cpu": "10m" "memory": "32Mi" "serviceMonitor": "defaults": "interval": "60s" "enabled": true "targets": - "labels": "service": "forgejo" "tier": "platform" "name": "forgejo" "url": "http://forgejo-http.forgejo.svc.cluster.local:3000" - "labels": "service": "woodpecker" "tier": "platform" "name": "woodpecker" "url": "http://woodpecker-server.woodpecker.svc.cluster.local:80" - "labels": "service": "grafana" "tier": "platform" "name": "grafana" "url": "http://kube-prometheus-stack-grafana.monitoring.svc.cluster.local:80" - "labels": "service": "alertmanager" "tier": "platform" "name": "alertmanager" "url": "http://kube-prometheus-stack-alertmanager.monitoring.svc.cluster.local:9093" - "labels": "service": "harbor" "tier": "platform" "name": "harbor" "url": "http://harbor-core.harbor.svc.cluster.local:80/api/v2.0/health" - "labels": "service": "argocd" "tier": "platform" "name": "argocd" "url": "http://argocd-server.argocd.svc.cluster.local:80" - "labels": "service": "keycloak" "tier": "platform" "name": "keycloak" "url": "http://keycloak.keycloak.svc.cluster.local:8080/health/ready" - "labels": "service": "minio" "tier": "platform" "name": "minio-api" "url": "http://minio.minio.svc.cluster.local:9000/minio/health/live" - "labels": "service": "pal-e-docs" "tier": "app" "name": "pal-e-docs" "url": "https://pal-e-docs.tail5b443a.ts.net/api/health" - "labels": "service": "pal-e-app" "tier": "app" "name": "pal-e-app" "url": "https://pal-e-app.tail5b443a.ts.net" - "labels": "service": "basketball-api" "tier": "app" "name": "basketball-api" "url": "https://basketball-api.tail5b443a.ts.net/api/health" - "labels": "service": "westside-app" "tier": "app" "name": "westside-app" "url": "https://westsidekingsandqueens.tail5b443a.ts.net" - "labels": "service": "platform-validation" "tier": "app" "name": "platform-validation" "url": "https://platform-validation.tail5b443a.ts.net" EOT, ] + verify = false + version = "9.1.0" + wait = true + wait_for_jobs = false } # helm_release.kube_prometheus_stack will be updated in-place ~ resource "helm_release" "kube_prometheus_stack" { id = "kube-prometheus-stack" ~ metadata = [ - { - app_version = "v0.89.0" - chart = "kube-prometheus-stack" - first_deployed = 1771560679 - last_deployed = 1773513092 - name = "kube-prometheus-stack" - namespace = "monitoring" - notes = <<-EOT kube-prometheus-stack has been installed. Check its status by running: kubectl --namespace monitoring get pods -l "release=kube-prometheus-stack" Get Grafana 'admin' user password by running: kubectl --namespace monitoring get secrets kube-prometheus-stack-grafana -o jsonpath="{.data.admin-password}" | base64 -d ; echo Access Grafana local instance: export POD_NAME=$(kubectl --namespace monitoring get pod -l "app.kubernetes.io/name=grafana,app.kubernetes.io/instance=kube-prometheus-stack" -oname) kubectl --namespace monitoring port-forward $POD_NAME 3000 Get your grafana admin user password by running: kubectl get secret --namespace monitoring -l app.kubernetes.io/component=admin-secret -o jsonpath="{.items[0].data.admin-password}" | base64 --decode ; echo Visit https://github.com/prometheus-operator/kube-prometheus for instructions on how to create & configure Alertmanager and Prometheus instances using the Operator. 1. Get your 'admin' user password by running: kubectl get secret --namespace monitoring kube-prometheus-stack-grafana -o jsonpath="{.data.admin-password}" | base64 --decode ; echo 2. The Grafana server can be accessed via port 80 on the following DNS name from within your cluster: kube-prometheus-stack-grafana.monitoring.svc.cluster.local Get the Grafana URL to visit by running these commands in the same shell: export POD_NAME=$(kubectl get pods --namespace monitoring -l "app.kubernetes.io/name=grafana,app.kubernetes.io/instance=kube-prometheus-stack" -o jsonpath="{.items[0].metadata.name}") kubectl --namespace monitoring port-forward $POD_NAME 3000 3. Login with the password from step 1 and the username: admin 1. Get the application URL by running these commands: export POD_NAME=$(kubectl get pods --namespace monitoring -l "app.kubernetes.io/name=prometheus-node-exporter,app.kubernetes.io/instance=kube-prometheus-stack" -o jsonpath="{.items[0].metadata.name}") echo "Visit http://127.0.0.1:9100 to use your application" kubectl port-forward --namespace monitoring $POD_NAME 9100 kube-state-metrics is a simple service that listens to the Kubernetes API server and generates metrics about the state of the objects. The exposed metrics can be found here: https://github.com/kubernetes/kube-state-metrics/blob/master/docs/README.md#exposed-metrics The metrics are exported on the HTTP endpoint /metrics on the listening port. In your case, kube-prometheus-stack-kube-state-metrics.monitoring.svc.cluster.local:8080/metrics They are served either as plaintext or protobuf depending on the Accept header. They are designed to be consumed either by Prometheus itself or by a scraper that is compatible with scraping a Prometheus client endpoint. EOT - revision = 13 - values = jsonencode( { - additionalPrometheusRules = [ - { - groups = [ - { - name = "pod-health" - rules = [ - { - alert = "PodRestartStorm" - annotations = { - description = "Pod {{ $labels.namespace }}/{{ $labels.pod }} has restarted {{ $value }} times in the last 15 minutes." - summary = "Pod {{ $labels.namespace }}/{{ $labels.pod }} restarting frequently" } - expr = "increase(kube_pod_container_status_restarts_total[15m]) > 3" - for = "0m" - labels = { - severity = "warning" } }, - { - alert = "OOMKilled" - annotations = { - description = "Container {{ $labels.container }} in pod {{ $labels.namespace }}/{{ $labels.pod }} was OOMKilled." - summary = "Pod {{ $labels.namespace }}/{{ $labels.pod }} OOMKilled" } - expr = "kube_pod_container_status_last_terminated_reason{reason=\"OOMKilled\"} > 0" - for = "0m" - labels = { - severity = "critical" } }, ] }, - { - name = "node-health" - rules = [ - { - alert = "DiskPressure" - annotations = { - description = "Filesystem {{ $labels.mountpoint }} on {{ $labels.instance }} has only {{ $value | printf \"%.1f\" }}% space remaining." - summary = "Disk pressure on {{ $labels.instance }}" } - expr = "(node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 < 15" - for = "5m" - labels = { - severity = "critical" } }, ] }, - { - name = "target-health" - rules = [ - { - alert = "TargetDown" - annotations = { - description = "Target {{ $labels.job }}/{{ $labels.instance }} has been down for more than 5 minutes." - summary = "Target {{ $labels.instance }} is down" } - expr = "up == 0" - for = "5m" - labels = { - severity = "warning" } }, ] }, ] - name = "platform-alerts" }, ] - alertmanager = { - alertmanagerSpec = { - resources = { - limits = { - memory = "128Mi" } - requests = { - cpu = "10m" - memory = "64Mi" } } - storage = { - volumeClaimTemplate = { - spec = { - accessModes = [ - "ReadWriteOnce", ] - resources = { - requests = { - storage = "1Gi" } } - storageClassName = "local-path" } } } } - config = { - global = { - resolve_timeout = "5m" } - receivers = [ - { - name = "default" }, - { - name = "telegram" - telegram_configs = [ - { - bot_token = "8256326037:AAEZ-LlhhkyaDs8TtWhGqm9dUzYj_7hkpiE" - chat_id = -5200965094 - parse_mode = "HTML" - send_resolved = true }, ] }, ] - route = { - group_by = [ - "alertname", - "namespace", ] - group_interval = "5m" - group_wait = "30s" - receiver = "telegram" - repeat_interval = "12h" - routes = [] } } } - grafana = { - adminPassword = "(sensitive value)" - persistence = { - enabled = true - size = "2Gi" - storageClassName = "local-path" } - resources = { - limits = { - memory = "256Mi" } - requests = { - cpu = "50m" - memory = "128Mi" } } - sidecar = { - dashboards = { - enabled = true - searchNamespace = "ALL" } - datasources = { - enabled = true - searchNamespace = "ALL" } } } - kube-state-metrics = { - resources = { - limits = { - memory = "128Mi" } - requests = { - cpu = "10m" - memory = "32Mi" } } } - kubeControllerManager = { - enabled = false } - kubeEtcd = { - enabled = false } - kubeProxy = { - enabled = false } - kubeScheduler = { - enabled = false } - nodeExporter = { - resources = { - limits = { - memory = "64Mi" } - requests = { - cpu = "20m" - memory = "32Mi" } } } - prometheus = { - prometheusSpec = { - podMonitorSelectorNilUsesHelmValues = false - resources = { - limits = { - memory = "1Gi" } - requests = { - cpu = "200m" - memory = "512Mi" } } - retention = "15d" - retentionSize = "10GB" - ruleSelectorNilUsesHelmValues = false - serviceMonitorSelectorNilUsesHelmValues = false - storageSpec = { - volumeClaimTemplate = { - spec = { - accessModes = [ - "ReadWriteOnce", ] - resources = { - requests = { - storage = "15Gi" } } - storageClassName = "local-path" } } } } } } ) - version = "82.0.0" }, ] -> (known after apply) name = "kube-prometheus-stack" ~ values = [ - (sensitive value), + <<-EOT "additionalPrometheusRules": - "groups": - "name": "pod-health" "rules": - "alert": "PodRestartStorm" "annotations": "description": "Pod {{ $labels.namespace }}/{{ $labels.pod }} has restarted {{ $value }} times in the last 15 minutes." "summary": "Pod {{ $labels.namespace }}/{{ $labels.pod }} restarting frequently" "expr": "increase(kube_pod_container_status_restarts_total[15m]) > 3" "for": "0m" "labels": "severity": "warning" - "alert": "OOMKilled" "annotations": "description": "Container {{ $labels.container }} in pod {{ $labels.namespace }}/{{ $labels.pod }} was OOMKilled." "summary": "Pod {{ $labels.namespace }}/{{ $labels.pod }} OOMKilled" "expr": "kube_pod_container_status_last_terminated_reason{reason=\"OOMKilled\"} > 0" "for": "0m" "labels": "severity": "critical" - "name": "node-health" "rules": - "alert": "DiskPressure" "annotations": "description": "Filesystem {{ $labels.mountpoint }} on {{ $labels.instance }} has only {{ $value | printf \"%.1f\" }}% space remaining." "summary": "Disk pressure on {{ $labels.instance }}" "expr": "(node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 < 15" "for": "5m" "labels": "severity": "critical" - "name": "target-health" "rules": - "alert": "TargetDown" "annotations": "description": "Target {{ $labels.job }}/{{ $labels.instance }} has been down for more than 5 minutes." "summary": "Target {{ $labels.instance }} is down" "expr": "up == 0" "for": "5m" "labels": "severity": "warning" "name": "platform-alerts" "alertmanager": "alertmanagerSpec": "resources": "limits": "memory": "128Mi" "requests": "cpu": "10m" "memory": "64Mi" "storage": "volumeClaimTemplate": "spec": "accessModes": - "ReadWriteOnce" "resources": "requests": "storage": "1Gi" "storageClassName": "local-path" "config": "global": "resolve_timeout": "5m" "receivers": - "name": "default" - "name": "telegram" "telegram_configs": - "parse_mode": "HTML" "send_resolved": true - "name": "slack" "slack_configs": - "channel": "#alerts" "send_resolved": true "text": |- {{ range .Alerts }}*{{ .Annotations.summary }}* {{ .Annotations.description }} {{ end }} "title": "{{ .GroupLabels.alertname }}" "route": "group_by": - "alertname" - "namespace" "group_interval": "5m" "group_wait": "30s" "receiver": "telegram" "repeat_interval": "12h" "routes": - "matchers": - "severity=~\"critical|warning\"" "receiver": "slack" "grafana": "persistence": "enabled": true "size": "2Gi" "storageClassName": "local-path" "resources": "limits": "memory": "256Mi" "requests": "cpu": "50m" "memory": "128Mi" "sidecar": "dashboards": "enabled": true "searchNamespace": "ALL" "datasources": "enabled": true "searchNamespace": "ALL" "kube-state-metrics": "resources": "limits": "memory": "128Mi" "requests": "cpu": "10m" "memory": "32Mi" "kubeControllerManager": "enabled": false "kubeEtcd": "enabled": false "kubeProxy": "enabled": false "kubeScheduler": "enabled": false "nodeExporter": "resources": "limits": "memory": "64Mi" "requests": "cpu": "20m" "memory": "32Mi" "prometheus": "prometheusSpec": "podMonitorSelectorNilUsesHelmValues": false "resources": "limits": "memory": "1Gi" "requests": "cpu": "200m" "memory": "512Mi" "retention": "15d" "retentionSize": "10GB" "ruleSelectorNilUsesHelmValues": false "serviceMonitorSelectorNilUsesHelmValues": false "storageSpec": "volumeClaimTemplate": "spec": "accessModes": - "ReadWriteOnce" "resources": "requests": "storage": "15Gi" "storageClassName": "local-path" EOT, ] # (27 unchanged attributes hidden) + set_sensitive { # At least one attribute in this block is (or was) sensitive, # so its contents will not be displayed. } # (3 unchanged blocks hidden) } # kubernetes_config_map_v1.dora_dashboard will be updated in-place ~ resource "kubernetes_config_map_v1" "dora_dashboard" { ~ data = { ~ "dora-dashboard.json" = jsonencode( ~ { id = null ~ panels = [ # (1 unchanged element hidden) { datasource = { type = "prometheus" uid = "${DS_PROMETHEUS}" } fieldConfig = { defaults = { color = { mode = "thresholds" } mappings = [] thresholds = { mode = "absolute" steps = [ { color = "red" value = null }, { color = "orange" value = 0.14 }, { color = "yellow" value = 0.5 }, { color = "green" value = 1 }, ] } unit = "short" } overrides = [] } gridPos = { h = 4 w = 6 x = 0 y = 1 } id = 2 options = { colorMode = "background" graphMode = "none" justifyMode = "auto" orientation = "auto" reduceOptions = { calcs = [ "lastNotNull", ] fields = "" values = false } textMode = "auto" wideLayout = true } targets = [ { datasource = { type = "prometheus" uid = "${DS_PROMETHEUS}" } expr = "sum(rate(dora_deployments_total{status=\"success\"}[1d])) * 86400" legendFormat = "deploys/day" refId = "A" }, ] title = "Deploys / Day (All Repos)" type = "stat" }, ~ { id = 3 ~ targets = [ ~ { ~ expr = "quantile(0.5, dora_pr_lead_time_seconds) / 3600" -> "histogram_quantile(0.5, sum(dora_pr_lead_time_seconds_bucket) by (le)) / 3600" # (3 unchanged attributes hidden) }, ] # (6 unchanged attributes hidden) }, { datasource = { type = "prometheus" uid = "${DS_PROMETHEUS}" } fieldConfig = { defaults = { color = { mode = "thresholds" } mappings = [] max = 100 min = 0 thresholds = { mode = "absolute" steps = [ { color = "green" value = null }, { color = "yellow" value = 5 }, { color = "orange" value = 15 }, { color = "red" value = 30 }, ] } unit = "percent" } overrides = [] } gridPos = { h = 4 w = 6 x = 12 y = 1 } id = 4 options = { colorMode = "background" graphMode = "none" justifyMode = "auto" orientation = "auto" reduceOptions = { calcs = [ "lastNotNull", ] fields = "" values = false } textMode = "auto" wideLayout = true } targets = [ { datasource = { type = "prometheus" uid = "${DS_PROMETHEUS}" } expr = "(sum(dora_deployments_total{status=\"failure\"}) / clamp_min(sum(dora_deployments_total), 1)) * 100" legendFormat = "CFR %" refId = "A" }, ] title = "Change Failure Rate (All Repos)" type = "stat" }, # (3 unchanged elements hidden) { collapsed = false gridPos = { h = 1 w = 24 x = 0 y = 14 } id = 8 title = "Lead Time for Changes" type = "row" }, ~ { id = 9 ~ targets = [ ~ { ~ expr = "dora_pr_lead_time_seconds{quantile=\"0.5\", repo=~\"$repo\"} / 3600" -> "histogram_quantile(0.5, sum(dora_pr_lead_time_seconds_bucket{repo=~\"$repo\"}) by (le, repo)) / 3600" # (3 unchanged attributes hidden) }, ~ { ~ expr = "dora_pr_lead_time_seconds{quantile=\"0.95\", repo=~\"$repo\"} / 3600" -> "histogram_quantile(0.95, sum(dora_pr_lead_time_seconds_bucket{repo=~\"$repo\"}) by (le, repo)) / 3600" # (3 unchanged attributes hidden) }, ] # (6 unchanged attributes hidden) }, { collapsed = false gridPos = { h = 1 w = 24 x = 0 y = 23 } id = 10 title = "Change Failure Rate" type = "row" }, # (3 unchanged elements hidden) ] tags = [ "dora", "platform", "metrics", ] ~ templating = { ~ list = [ { current = {} hide = 0 includeAll = false label = "Prometheus" multi = false name = "DS_PROMETHEUS" options = [] query = "prometheus" queryValue = "" refresh = 1 regex = "" skipUrlSync = false type = "datasource" }, ~ { ~ definition = "label_values(dora_deployments_total, repo)" -> "label_values(dora_pr_merges_total, repo)" name = "repo" ~ query = { ~ query = "label_values(dora_deployments_total, repo)" -> "label_values(dora_pr_merges_total, repo)" # (1 unchanged attribute hidden) } # (12 unchanged attributes hidden) }, ] } # (12 unchanged attributes hidden) } ) } id = "monitoring/dora-dashboard" # (2 unchanged attributes hidden) # (1 unchanged block hidden) } # kubernetes_config_map_v1.uptime_dashboard will be created + resource "kubernetes_config_map_v1" "uptime_dashboard" { + data = { + "uptime-dashboard.json" = jsonencode( { + annotations = { + list = [ + { + builtIn = 1 + datasource = { + type = "grafana" + uid = "-- Grafana --" } + enable = true + hide = true + iconColor = "rgba(0, 211, 255, 1)" + name = "Annotations & Alerts" + type = "dashboard" }, ] } + editable = true + fiscalYearStartMonth = 0 + graphTooltip = 1 + id = null + links = [] + panels = [ + { + collapsed = false + gridPos = { + h = 1 + w = 24 + x = 0 + y = 0 } + id = 1 + title = "Overview Stats" + type = "row" }, + { + datasource = { + type = "prometheus" + uid = "${DS_PROMETHEUS}" } + fieldConfig = { + defaults = { + color = { + mode = "thresholds" } + mappings = [] + thresholds = { + mode = "absolute" + steps = [ + { + color = "red" + value = null }, + { + color = "green" + value = 1 }, ] } + unit = "short" } + overrides = [] } + gridPos = { + h = 4 + w = 6 + x = 0 + y = 1 } + id = 2 + options = { + colorMode = "background" + graphMode = "none" + justifyMode = "auto" + orientation = "auto" + reduceOptions = { + calcs = [ + "lastNotNull", ] + fields = "" + values = false } + textMode = "auto" + wideLayout = true } + targets = [ + { + datasource = { + type = "prometheus" + uid = "${DS_PROMETHEUS}" } + expr = "count(probe_success == 1)" + legendFormat = "up" + refId = "A" }, ] + title = "Services Up" + type = "stat" }, + { + datasource = { + type = "prometheus" + uid = "${DS_PROMETHEUS}" } + fieldConfig = { + defaults = { + color = { + mode = "thresholds" } + mappings = [] + thresholds = { + mode = "absolute" + steps = [ + { + color = "green" + value = null }, + { + color = "red" + value = 1 }, ] } + unit = "short" } + overrides = [] } + gridPos = { + h = 4 + w = 6 + x = 6 + y = 1 } + id = 3 + options = { + colorMode = "background" + graphMode = "none" + justifyMode = "auto" + orientation = "auto" + reduceOptions = { + calcs = [ + "lastNotNull", ] + fields = "" + values = false } + textMode = "auto" + wideLayout = true } + targets = [ + { + datasource = { + type = "prometheus" + uid = "${DS_PROMETHEUS}" } + expr = "count(probe_success == 0)" + legendFormat = "down" + refId = "A" }, ] + title = "Services Down" + type = "stat" }, + { + datasource = { + type = "prometheus" + uid = "${DS_PROMETHEUS}" } + fieldConfig = { + defaults = { + color = { + mode = "thresholds" } + mappings = [] + thresholds = { + mode = "absolute" + steps = [ + { + color = "green" + value = null }, + { + color = "yellow" + value = 1 }, + { + color = "orange" + value = 3 }, + { + color = "red" + value = 5 }, ] } + unit = "s" } + overrides = [] } + gridPos = { + h = 4 + w = 6 + x = 12 + y = 1 } + id = 4 + options = { + colorMode = "background" + graphMode = "none" + justifyMode = "auto" + orientation = "auto" + reduceOptions = { + calcs = [ + "lastNotNull", ] + fields = "" + values = false } + textMode = "auto" + wideLayout = true } + targets = [ + { + datasource = { + type = "prometheus" + uid = "${DS_PROMETHEUS}" } + expr = "avg(probe_duration_seconds)" + legendFormat = "avg" + refId = "A" }, ] + title = "Average Response Time" + type = "stat" }, + { + datasource = { + type = "prometheus" + uid = "${DS_PROMETHEUS}" } + fieldConfig = { + defaults = { + color = { + mode = "thresholds" } + mappings = [] + max = 100 + min = 0 + thresholds = { + mode = "absolute" + steps = [ + { + color = "red" + value = null }, + { + color = "yellow" + value = 99 }, + { + color = "green" + value = 99.9 }, ] } + unit = "percent" } + overrides = [] } + gridPos = { + h = 4 + w = 6 + x = 18 + y = 1 } + id = 5 + options = { + colorMode = "background" + graphMode = "none" + justifyMode = "auto" + orientation = "auto" + reduceOptions = { + calcs = [ + "lastNotNull", ] ...(truncated) ```
forgejo_admin deleted branch 66-feat-synthetic-monitoring-dora-dashboard 2026-03-14 21:27:01 +00:00
Sign in to join this conversation.
No description provided.