Add blackbox probes for westside-contracts, westside-email, westside-ai-assistant #324
Labels
No labels
domain:backend
domain:devops
domain:frontend
status:approved
status:in-progress
status:needs-fix
status:qa
type:bug
type:devops
type:feature
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
ldraney/pal-e-platform#324
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Type
Feature
Lineage
Standalone — discovered 2026-05-01 during alert-state audit. No parent issue.
Repo
forgejo_admin/pal-e-platformUser Story
As an oncall engineer, I want a single Grafana glance to tell me whether the westside platform is up, so that triage doesn't require knowing which of six namespaces holds the failing service.
Context
Today only
westside-appandbasketball-apihave blackbox probes. The other three westside services —westside-contracts,westside-email,westside-ai-assistant— have no probes and noServiceMonitor. A contract-signing or email-blast outage would only surface as user complaints because nothing in Prometheus knows the service exists.Verified state:
The
westside-ai-assistantnamespace has a healthy pod (westside-ai-assistant-7999594d89-fml4n) and a permanently-broken pod (westside-ai-assistant-8586c7c767-7xv6cin ImagePullBackOff for 27d). The probe must target the service, not a specific pod, so it follows the healthy endpoint.File Targets
Files to modify:
terraform/modules/monitoring/main.tf—targetslist under blackbox-exporter helm values block (~line 405–430). Add three new entries with consistent labels (tier: app).Files NOT to touch:
terraform/dashboards/*— dashboard updates are a separate ticketAcceptance Criteria
westside-contractsexists with cluster-internal URL and labelsservice=westside-contracts, tier=appwestside-emailexists with same shapewestside-ai-assistantexists with same shape (targets the service, not the broken pod)probe_success=1after deployEndpointDowncovers them automatically once probes existTest Expectations
tofu validatepassestofu plan -lock=falseshows only the expected three new probe configurationskubectl exec -n monitoring prometheus-... -- wget -qO- 'http://localhost:9090/api/v1/query?query=probe_success{target=~"westside-(contracts|email|ai-assistant)"}'returns three results, all=1Constraints
main.tf(same labels, same URL pattern)http://<svc>.<ns>.svc.cluster.local:<port>/<health-path>); avoid Tailscale funnel hostnames to dodge TLS hairpinChecklist
Related
pal-e-platform— projectalert-report-2026-05-01— alert snapshot