Add landscaping-assistant to blackbox uptime monitoring

ldraney commented

2026-05-25 03:05:10 +00:00

Owner

Type

Feature

Lineage

Standalone -- discovered during DORA metrics gap analysis. MTTR calculation relies on uptime probes to detect failures, but landscaping-assistant is not in the blackbox exporter target list.

Repo

ldraney/pal-e-platform

User Story

As a platform operator
I want landscaping-assistant included in blackbox uptime monitoring
So that deployment failures are detected and MTTR can be measured accurately

Context

The platform uses prometheus-blackbox-exporter for synthetic uptime monitoring of all services. Currently monitored apps include basketball-api, westside-app, platform-validation, and playme2k -- but landscaping-assistant is missing from the target list in pal-e-platform/terraform/modules/monitoring/main.tf. Without uptime probes, the DORA MTTR metric has no failure signal for this app. The blackbox exporter fires EndpointDown alerts when probe_success == 0 for 5 minutes, which feeds the change failure rate calculation.

File Targets

Files the agent should modify or create:

terraform/modules/monitoring/main.tf -- add landscaping-assistant to serviceMonitor.targets in the blackbox_exporter helm release

Files the agent should NOT touch:

terraform/dashboards/dora-dashboard.json -- dashboard is repo-agnostic
Any landscaping-assistant app code

Acceptance Criteria

landscaping-assistant appears in blackbox exporter targets with correct internal cluster URL
probe_success{target=~".*landscaping.*"} metric appears in Prometheus after terraform apply
EndpointDown alert would fire if the app goes down
Uptime dashboard shows landscaping-assistant alongside other apps

Test Expectations

Manual: run terraform plan and verify only the blackbox exporter resource changes
Manual: after apply, query probe_success{target=~".*landscaping.*"} in Prometheus
Run command: terraform plan -target=helm_release.blackbox_exporter

Constraints

Follow existing target pattern: use internal cluster URL (e.g. http://landscaping-assistant.landscaping-assistant.svc.cluster.local:3000)
Use tier = "app" label to match other application services
Verify the correct namespace and service name from k8s before adding

Checklist

PR opened against pal-e-platform
terraform plan shows expected diff
No unrelated changes

project-landscaping-assistant -- project this affects
Issue #15 -- ServiceMonitor label mismatch (related but separate)

### Type Feature ### Lineage Standalone -- discovered during DORA metrics gap analysis. MTTR calculation relies on uptime probes to detect failures, but landscaping-assistant is not in the blackbox exporter target list. ### Repo `ldraney/pal-e-platform` ### User Story As a platform operator I want landscaping-assistant included in blackbox uptime monitoring So that deployment failures are detected and MTTR can be measured accurately ### Context The platform uses prometheus-blackbox-exporter for synthetic uptime monitoring of all services. Currently monitored apps include basketball-api, westside-app, platform-validation, and playme2k -- but landscaping-assistant is missing from the target list in `pal-e-platform/terraform/modules/monitoring/main.tf`. Without uptime probes, the DORA MTTR metric has no failure signal for this app. The blackbox exporter fires `EndpointDown` alerts when `probe_success == 0` for 5 minutes, which feeds the change failure rate calculation. ### File Targets Files the agent should modify or create: - `terraform/modules/monitoring/main.tf` -- add landscaping-assistant to `serviceMonitor.targets` in the blackbox_exporter helm release Files the agent should NOT touch: - `terraform/dashboards/dora-dashboard.json` -- dashboard is repo-agnostic - Any landscaping-assistant app code ### Acceptance Criteria - [ ] landscaping-assistant appears in blackbox exporter targets with correct internal cluster URL - [ ] `probe_success{target=~".*landscaping.*"}` metric appears in Prometheus after terraform apply - [ ] EndpointDown alert would fire if the app goes down - [ ] Uptime dashboard shows landscaping-assistant alongside other apps ### Test Expectations - [ ] Manual: run `terraform plan` and verify only the blackbox exporter resource changes - [ ] Manual: after apply, query `probe_success{target=~".*landscaping.*"}` in Prometheus - Run command: `terraform plan -target=helm_release.blackbox_exporter` ### Constraints - Follow existing target pattern: use internal cluster URL (e.g. `http://landscaping-assistant.landscaping-assistant.svc.cluster.local:3000`) - Use `tier = "app"` label to match other application services - Verify the correct namespace and service name from k8s before adding ### Checklist - [ ] PR opened against pal-e-platform - [ ] terraform plan shows expected diff - [ ] No unrelated changes ### Related - `project-landscaping-assistant` -- project this affects - Issue #15 -- ServiceMonitor label mismatch (related but separate)

ldraney referenced this issue

2026-05-25 03:05:31 +00:00

[DUPLICATE of #16] Create golden signals Grafana dashboard #22

ldraney referenced this issue

2026-05-29 12:13:02 +00:00

Observability & DORA metrics stack [PARENT] #43

ldraney commented

2026-05-29 23:36:51 +00:00

Author

Owner

Scope Addition: Post-Deploy Smoke Test

Per testing strategy audit (docs/testing-strategy.md), this issue should also cover a post-deploy smoke test — a one-time validation that runs after each ArgoCD sync to verify the deployment is healthy beyond just the k8s readiness probe.

Additional acceptance criteria:

After ArgoCD syncs a new image, hit the production URL and verify key pages return 200 with expected content
GET / (root) returns 200 with queue content
GET /up returns 200
GET /uploads returns 200

Implementation consideration: ArgoCD sync is async after Harbor push. Options:

Polling step in Woodpecker that waits for the new SHA to appear in the running pod
ArgoCD webhook triggers a separate Woodpecker pipeline
Woodpecker cron pipeline that checks production health periodically

This overlaps with but is distinct from uptime monitoring — uptime monitors catch ongoing failures, the smoke test catches deploy-specific regressions immediately.

### Scope Addition: Post-Deploy Smoke Test Per testing strategy audit (docs/testing-strategy.md), this issue should also cover a **post-deploy smoke test** — a one-time validation that runs after each ArgoCD sync to verify the deployment is healthy beyond just the k8s readiness probe. **Additional acceptance criteria:** - [ ] After ArgoCD syncs a new image, hit the production URL and verify key pages return 200 with expected content - [ ] `GET /` (root) returns 200 with queue content - [ ] `GET /up` returns 200 - [ ] `GET /uploads` returns 200 **Implementation consideration:** ArgoCD sync is async after Harbor push. Options: 1. Polling step in Woodpecker that waits for the new SHA to appear in the running pod 2. ArgoCD webhook triggers a separate Woodpecker pipeline 3. Woodpecker cron pipeline that checks production health periodically This overlaps with but is distinct from uptime monitoring — uptime monitors catch _ongoing_ failures, the smoke test catches _deploy-specific_ regressions immediately.

ldraney referenced this issue

2026-06-01 07:18:02 +00:00

Add system test infrastructure and critical path browser specs #54

ldraney referenced this issue from ldraney/pal-e-platform

2026-06-03 03:52:44 +00:00

Add landscaping-assistant to blackbox uptime monitoring #399

ldraney referenced this issue from a pull request from ldraney/pal-e-platform that will close it,

2026-06-03 03:53:52 +00:00

Add landscaping-assistant to blackbox uptime monitoring #400

ldraney closed this issue

2026-06-03 04:55:14 +00:00

ldraney referenced this issue