Bug: Blackbox probe TLS failure on pal-e-app Tailscale funnel #168
Labels
No labels
domain:backend
domain:devops
domain:frontend
status:approved
status:in-progress
status:needs-fix
status:qa
type:bug
type:devops
type:feature
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
forgejo_admin/pal-e-platform#168
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Type
Bug
Lineage
standalone — discovered during AlertManager triage 2026-03-26
Repo
forgejo_admin/pal-e-platformWhat Broke
Blackbox exporter inside the cluster gets TLS handshake failure (
unexpected eof while reading) when probinghttps://pal-e-app.tail5b443a.ts.net. The app itself returns 200 on the internal service endpoint (http://pal-e-app.pal-e-app.svc:3000). TheEndpointDowncritical alert has been firing since 2026-03-21 (5 days).Root cause is likely hairpin routing — blackbox exporter pod inside the cluster tries to reach the external Tailscale funnel URL, which doesn't route back correctly through the Tailscale proxy.
Repro Steps
EndpointDownalert firing forpal-e-appcurl https://pal-e-app.tail5b443a.ts.net→ TLS errorcurl http://pal-e-app.pal-e-app.svc:3000→ 200 OKExpected Behavior
Blackbox probe succeeds.
EndpointDownalert does not fire when the service is healthy.Environment
EndpointDown(critical), firing since 2026-03-21Acceptance Criteria
EndpointDownalert clears in AlertManagerRelated
project-pal-e-platform— projectstory:superuser-observe— user storyarch:blackbox-exporter,arch:tailscale-funnel— architecture componentsScope Review: NEEDS_REFINEMENT
Review note:
review-385-2026-03-26Root cause confirmed: pal-e-app probe at
terraform/main.tf:455uses external funnel URL, which fails TLS hairpin from inside the cluster. Fix is to switch to internalhttp://pal-e-app.pal-e-app.svc:3000(same pattern as Keycloak fix in #117).Issues to address before moving to next_up:
terraform/main.tflines 454-457tofu plan -lock=false+ Prometheusprobe_successquery4213fdeScope Correction (post-review)
Per review
review-385-2026-03-26, expanding scope and adding file targets.Expanded Scope
Batch all 4 hairpin probes, not just pal-e-app. Four blackbox probes use external Tailscale funnel URLs that fail TLS from inside the cluster (hairpin routing). Other probes already use internal URLs and work fine.
File Targets
All in
pal-e-platform/terraform/main.tfwithinhelm_release.blackbox_exportervalues:pal-e-docs→ change fromhttps://pal-e-docs.tail5b443a.ts.net/healthzto internal service URLpal-e-app→ change fromhttps://pal-e-app.tail5b443a.ts.nettohttp://pal-e-app.pal-e-app.svc:3000westside-app→ change fromhttps://westsidekingsandqueens.tail5b443a.ts.netto internal service URLwestside-dev→ change fromhttps://westside-dev.tail5b443a.ts.netto internal service URLFiles NOT to touch: Other probe targets already using internal URLs (basketball-api, platform-validation, forgejo, woodpecker, etc.)
Precedent
Issue #117 / commit
4213fde— same fix for Keycloak probe. Switch from external funnel URL to internal service URL.Acceptance Criteria (updated)
EndpointDownalert clears for pal-e-apptofu planshows only the 4 probe URL changesScope Review: NEEDS_REFINEMENT
Review note:
review-385-2026-03-26bFile targets verified — all 4 external funnel probe URLs confirmed at lines 451, 456, 466, 471 of
terraform/main.tf. Precedent commit4213fdeconfirmed. No NetworkPolicy blockers. No board dependencies.Issues to address before next_up:
westside-appandwestside-devservices live in namespacewestsidekingsandqueens, not their own namespace. Agent would guess wrong. Verified URLs:http://pal-e-docs.pal-e-docs.svc.cluster.local:8000/healthzhttp://pal-e-app.pal-e-app.svc.cluster.local:3000(already correct in scope)http://westside-app.westsidekingsandqueens.svc.cluster.local:3000http://westside-dev.westsidekingsandqueens.svc.cluster.local:80tofu plan -lock=false+ PromQLprobe_success{job="blackbox"}verificationRefinement: exact internal URLs (namespace trap)
Per review
review-385-2026-03-26b:Critical: Namespace Mismatch
Both westside services live in
westsidekingsandqueensnamespace, NOT inwestside-app/westside-devnamespaces. An agent would construct wrong URLs without this.Exact Internal URLs (verified by review agent)
pal-e-docs→http://pal-e-docs.pal-e-docs.svc.cluster.local:8000/healthzpal-e-app→http://pal-e-app.pal-e-app.svc.cluster.local:3000westside-app→http://westside-app.westsidekingsandqueens.svc.cluster.local:3000westside-dev→http://westside-dev.westsidekingsandqueens.svc.cluster.local:80Test Expectations (added)
tofu plan -lock=falseshows only the 4 URL changesprobe_success == 1(verify via Prometheus query or AlertManager clearing)Scope Review: READY
Review note:
review-385-2026-03-26cRe-review after 3 scope corrections. All previous NEEDS_REFINEMENT issues resolved:
westsidekingsandqueensnamespace for both westside servicestofu plan -lock=false+ PromQL verificationterraform/main.tf(scope comments say ~451/~456/~466/~471 — off by ~30 lines, but probe names make targets unambiguous)One minor gap: Checklist section still missing (agents follow standard PR workflow regardless). One suggestion for implementing agent: update code comment at line 481 ("external URLs") to reflect new internal URL strategy.