Add blackbox probes for westside-contracts, westside-email, westside-ai-assistant #324

Open
opened 2026-05-02 14:51:02 +00:00 by forgejo_admin · 0 comments
Contributor

Type

Feature

Lineage

Standalone — discovered 2026-05-01 during alert-state audit. No parent issue.

Repo

forgejo_admin/pal-e-platform

User Story

As an oncall engineer, I want a single Grafana glance to tell me whether the westside platform is up, so that triage doesn't require knowing which of six namespaces holds the failing service.

Context

Today only westside-app and basketball-api have blackbox probes. The other three westside services — westside-contracts, westside-email, westside-ai-assistant — have no probes and no ServiceMonitor. A contract-signing or email-blast outage would only surface as user complaints because nothing in Prometheus knows the service exists.

Verified state:

$ kubectl get servicemonitor -A | grep westside
basketball-api/basketball-api    (✓ scraped)
monitoring/...basketball-api     (✓ probed)
monitoring/...westside-app       (✓ probed)
# no probes/scrapes for: westside-contracts, westside-email, westside-ai-assistant

The westside-ai-assistant namespace has a healthy pod (westside-ai-assistant-7999594d89-fml4n) and a permanently-broken pod (westside-ai-assistant-8586c7c767-7xv6c in ImagePullBackOff for 27d). The probe must target the service, not a specific pod, so it follows the healthy endpoint.

File Targets

Files to modify:

  • terraform/modules/monitoring/main.tftargets list under blackbox-exporter helm values block (~line 405–430). Add three new entries with consistent labels (tier: app).

Files NOT to touch:

  • existing probe entries (don't reorder or relabel)
  • terraform/dashboards/* — dashboard updates are a separate ticket

Acceptance Criteria

  • Probe target westside-contracts exists with cluster-internal URL and labels service=westside-contracts, tier=app
  • Probe target westside-email exists with same shape
  • Probe target westside-ai-assistant exists with same shape (targets the service, not the broken pod)
  • All three return probe_success=1 after deploy
  • No new alert rules — EndpointDown covers them automatically once probes exist

Test Expectations

  • tofu validate passes
  • tofu plan -lock=false shows only the expected three new probe configurations
  • After deploy: kubectl exec -n monitoring prometheus-... -- wget -qO- 'http://localhost:9090/api/v1/query?query=probe_success{target=~"westside-(contracts|email|ai-assistant)"}' returns three results, all =1

Constraints

  • Match the existing target entry style in main.tf (same labels, same URL pattern)
  • Use cluster-internal URLs (http://<svc>.<ns>.svc.cluster.local:<port>/<health-path>); avoid Tailscale funnel hostnames to dodge TLS hairpin
  • Health-check paths: pick the same path the service's existing readiness probe uses (consult deployment manifest)

Checklist

  • PR opened
  • tofu validate + fmt clean
  • No unrelated changes
  • pal-e-platform — project
  • alert-report-2026-05-01 — alert snapshot
### Type Feature ### Lineage Standalone — discovered 2026-05-01 during alert-state audit. No parent issue. ### Repo `forgejo_admin/pal-e-platform` ### User Story As an oncall engineer, I want a single Grafana glance to tell me whether the westside platform is up, so that triage doesn't require knowing which of six namespaces holds the failing service. ### Context Today only `westside-app` and `basketball-api` have blackbox probes. The other three westside services — `westside-contracts`, `westside-email`, `westside-ai-assistant` — have no probes and no `ServiceMonitor`. A contract-signing or email-blast outage would only surface as user complaints because nothing in Prometheus knows the service exists. Verified state: ``` $ kubectl get servicemonitor -A | grep westside basketball-api/basketball-api (✓ scraped) monitoring/...basketball-api (✓ probed) monitoring/...westside-app (✓ probed) # no probes/scrapes for: westside-contracts, westside-email, westside-ai-assistant ``` The `westside-ai-assistant` namespace has a healthy pod (`westside-ai-assistant-7999594d89-fml4n`) and a permanently-broken pod (`westside-ai-assistant-8586c7c767-7xv6c` in ImagePullBackOff for 27d). The probe must target the service, not a specific pod, so it follows the healthy endpoint. ### File Targets Files to modify: - `terraform/modules/monitoring/main.tf` — `targets` list under blackbox-exporter helm values block (~line 405–430). Add three new entries with consistent labels (`tier: app`). Files NOT to touch: - existing probe entries (don't reorder or relabel) - `terraform/dashboards/*` — dashboard updates are a separate ticket ### Acceptance Criteria - [ ] Probe target `westside-contracts` exists with cluster-internal URL and labels `service=westside-contracts, tier=app` - [ ] Probe target `westside-email` exists with same shape - [ ] Probe target `westside-ai-assistant` exists with same shape (targets the service, not the broken pod) - [ ] All three return `probe_success=1` after deploy - [ ] No new alert rules — `EndpointDown` covers them automatically once probes exist ### Test Expectations - [ ] `tofu validate` passes - [ ] `tofu plan -lock=false` shows only the expected three new probe configurations - [ ] After deploy: `kubectl exec -n monitoring prometheus-... -- wget -qO- 'http://localhost:9090/api/v1/query?query=probe_success{target=~"westside-(contracts|email|ai-assistant)"}'` returns three results, all `=1` ### Constraints - Match the existing target entry style in `main.tf` (same labels, same URL pattern) - Use cluster-internal URLs (`http://<svc>.<ns>.svc.cluster.local:<port>/<health-path>`); avoid Tailscale funnel hostnames to dodge TLS hairpin - Health-check paths: pick the same path the service's existing readiness probe uses (consult deployment manifest) ### Checklist - [ ] PR opened - [ ] tofu validate + fmt clean - [ ] No unrelated changes ### Related - `pal-e-platform` — project - `alert-report-2026-05-01` — alert snapshot
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
ldraney/pal-e-platform#324
No description provided.