Add landscaping-assistant to blackbox uptime monitoring #21

Closed
opened 2026-05-25 03:05:10 +00:00 by ldraney · 1 comment
Owner

Type

Feature

Lineage

Standalone -- discovered during DORA metrics gap analysis. MTTR calculation relies on uptime probes to detect failures, but landscaping-assistant is not in the blackbox exporter target list.

Repo

ldraney/pal-e-platform

User Story

As a platform operator
I want landscaping-assistant included in blackbox uptime monitoring
So that deployment failures are detected and MTTR can be measured accurately

Context

The platform uses prometheus-blackbox-exporter for synthetic uptime monitoring of all services. Currently monitored apps include basketball-api, westside-app, platform-validation, and playme2k -- but landscaping-assistant is missing from the target list in pal-e-platform/terraform/modules/monitoring/main.tf. Without uptime probes, the DORA MTTR metric has no failure signal for this app. The blackbox exporter fires EndpointDown alerts when probe_success == 0 for 5 minutes, which feeds the change failure rate calculation.

File Targets

Files the agent should modify or create:

  • terraform/modules/monitoring/main.tf -- add landscaping-assistant to serviceMonitor.targets in the blackbox_exporter helm release

Files the agent should NOT touch:

  • terraform/dashboards/dora-dashboard.json -- dashboard is repo-agnostic
  • Any landscaping-assistant app code

Acceptance Criteria

  • landscaping-assistant appears in blackbox exporter targets with correct internal cluster URL
  • probe_success{target=~".*landscaping.*"} metric appears in Prometheus after terraform apply
  • EndpointDown alert would fire if the app goes down
  • Uptime dashboard shows landscaping-assistant alongside other apps

Test Expectations

  • Manual: run terraform plan and verify only the blackbox exporter resource changes
  • Manual: after apply, query probe_success{target=~".*landscaping.*"} in Prometheus
  • Run command: terraform plan -target=helm_release.blackbox_exporter

Constraints

  • Follow existing target pattern: use internal cluster URL (e.g. http://landscaping-assistant.landscaping-assistant.svc.cluster.local:3000)
  • Use tier = "app" label to match other application services
  • Verify the correct namespace and service name from k8s before adding

Checklist

  • PR opened against pal-e-platform
  • terraform plan shows expected diff
  • No unrelated changes
  • project-landscaping-assistant -- project this affects
  • Issue #15 -- ServiceMonitor label mismatch (related but separate)
### Type Feature ### Lineage Standalone -- discovered during DORA metrics gap analysis. MTTR calculation relies on uptime probes to detect failures, but landscaping-assistant is not in the blackbox exporter target list. ### Repo `ldraney/pal-e-platform` ### User Story As a platform operator I want landscaping-assistant included in blackbox uptime monitoring So that deployment failures are detected and MTTR can be measured accurately ### Context The platform uses prometheus-blackbox-exporter for synthetic uptime monitoring of all services. Currently monitored apps include basketball-api, westside-app, platform-validation, and playme2k -- but landscaping-assistant is missing from the target list in `pal-e-platform/terraform/modules/monitoring/main.tf`. Without uptime probes, the DORA MTTR metric has no failure signal for this app. The blackbox exporter fires `EndpointDown` alerts when `probe_success == 0` for 5 minutes, which feeds the change failure rate calculation. ### File Targets Files the agent should modify or create: - `terraform/modules/monitoring/main.tf` -- add landscaping-assistant to `serviceMonitor.targets` in the blackbox_exporter helm release Files the agent should NOT touch: - `terraform/dashboards/dora-dashboard.json` -- dashboard is repo-agnostic - Any landscaping-assistant app code ### Acceptance Criteria - [ ] landscaping-assistant appears in blackbox exporter targets with correct internal cluster URL - [ ] `probe_success{target=~".*landscaping.*"}` metric appears in Prometheus after terraform apply - [ ] EndpointDown alert would fire if the app goes down - [ ] Uptime dashboard shows landscaping-assistant alongside other apps ### Test Expectations - [ ] Manual: run `terraform plan` and verify only the blackbox exporter resource changes - [ ] Manual: after apply, query `probe_success{target=~".*landscaping.*"}` in Prometheus - Run command: `terraform plan -target=helm_release.blackbox_exporter` ### Constraints - Follow existing target pattern: use internal cluster URL (e.g. `http://landscaping-assistant.landscaping-assistant.svc.cluster.local:3000`) - Use `tier = "app"` label to match other application services - Verify the correct namespace and service name from k8s before adding ### Checklist - [ ] PR opened against pal-e-platform - [ ] terraform plan shows expected diff - [ ] No unrelated changes ### Related - `project-landscaping-assistant` -- project this affects - Issue #15 -- ServiceMonitor label mismatch (related but separate)
Author
Owner

Scope Addition: Post-Deploy Smoke Test

Per testing strategy audit (docs/testing-strategy.md), this issue should also cover a post-deploy smoke test — a one-time validation that runs after each ArgoCD sync to verify the deployment is healthy beyond just the k8s readiness probe.

Additional acceptance criteria:

  • After ArgoCD syncs a new image, hit the production URL and verify key pages return 200 with expected content
  • GET / (root) returns 200 with queue content
  • GET /up returns 200
  • GET /uploads returns 200

Implementation consideration: ArgoCD sync is async after Harbor push. Options:

  1. Polling step in Woodpecker that waits for the new SHA to appear in the running pod
  2. ArgoCD webhook triggers a separate Woodpecker pipeline
  3. Woodpecker cron pipeline that checks production health periodically

This overlaps with but is distinct from uptime monitoring — uptime monitors catch ongoing failures, the smoke test catches deploy-specific regressions immediately.

### Scope Addition: Post-Deploy Smoke Test Per testing strategy audit (docs/testing-strategy.md), this issue should also cover a **post-deploy smoke test** — a one-time validation that runs after each ArgoCD sync to verify the deployment is healthy beyond just the k8s readiness probe. **Additional acceptance criteria:** - [ ] After ArgoCD syncs a new image, hit the production URL and verify key pages return 200 with expected content - [ ] `GET /` (root) returns 200 with queue content - [ ] `GET /up` returns 200 - [ ] `GET /uploads` returns 200 **Implementation consideration:** ArgoCD sync is async after Harbor push. Options: 1. Polling step in Woodpecker that waits for the new SHA to appear in the running pod 2. ArgoCD webhook triggers a separate Woodpecker pipeline 3. Woodpecker cron pipeline that checks production health periodically This overlaps with but is distinct from uptime monitoring — uptime monitors catch _ongoing_ failures, the smoke test catches _deploy-specific_ regressions immediately.
Sign in to join this conversation.
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
ldraney/landscaping-assistant#21
No description provided.