Bug: Blackbox probe TLS failure on pal-e-app Tailscale funnel #168

Closed
opened 2026-03-26 15:22:27 +00:00 by forgejo_admin · 5 comments

Type

Bug

Lineage

standalone — discovered during AlertManager triage 2026-03-26

Repo

forgejo_admin/pal-e-platform

What Broke

Blackbox exporter inside the cluster gets TLS handshake failure (unexpected eof while reading) when probing https://pal-e-app.tail5b443a.ts.net. The app itself returns 200 on the internal service endpoint (http://pal-e-app.pal-e-app.svc:3000). The EndpointDown critical alert has been firing since 2026-03-21 (5 days).

Root cause is likely hairpin routing — blackbox exporter pod inside the cluster tries to reach the external Tailscale funnel URL, which doesn't route back correctly through the Tailscale proxy.

Repro Steps

  1. Check AlertManager: EndpointDown alert firing for pal-e-app
  2. From inside cluster: curl https://pal-e-app.tail5b443a.ts.net → TLS error
  3. From inside cluster: curl http://pal-e-app.pal-e-app.svc:3000 → 200 OK
  4. Observe: external funnel URL fails TLS from inside the cluster

Expected Behavior

Blackbox probe succeeds. EndpointDown alert does not fire when the service is healthy.

Environment

  • Cluster/namespace: monitoring (blackbox-exporter), pal-e-app (target)
  • Related alerts: EndpointDown (critical), firing since 2026-03-21

Acceptance Criteria

  • Blackbox probe succeeds for pal-e-app
  • EndpointDown alert clears in AlertManager
  • No regression on other blackbox probes (13 total)
  • project-pal-e-platform — project
  • story:superuser-observe — user story
  • arch:blackbox-exporter, arch:tailscale-funnel — architecture components
### Type Bug ### Lineage standalone — discovered during AlertManager triage 2026-03-26 ### Repo `forgejo_admin/pal-e-platform` ### What Broke Blackbox exporter inside the cluster gets TLS handshake failure (`unexpected eof while reading`) when probing `https://pal-e-app.tail5b443a.ts.net`. The app itself returns 200 on the internal service endpoint (`http://pal-e-app.pal-e-app.svc:3000`). The `EndpointDown` critical alert has been firing since 2026-03-21 (5 days). Root cause is likely hairpin routing — blackbox exporter pod inside the cluster tries to reach the external Tailscale funnel URL, which doesn't route back correctly through the Tailscale proxy. ### Repro Steps 1. Check AlertManager: `EndpointDown` alert firing for `pal-e-app` 2. From inside cluster: `curl https://pal-e-app.tail5b443a.ts.net` → TLS error 3. From inside cluster: `curl http://pal-e-app.pal-e-app.svc:3000` → 200 OK 4. Observe: external funnel URL fails TLS from inside the cluster ### Expected Behavior Blackbox probe succeeds. `EndpointDown` alert does not fire when the service is healthy. ### Environment - Cluster/namespace: monitoring (blackbox-exporter), pal-e-app (target) - Related alerts: `EndpointDown` (critical), firing since 2026-03-21 ### Acceptance Criteria - [ ] Blackbox probe succeeds for pal-e-app - [ ] `EndpointDown` alert clears in AlertManager - [ ] No regression on other blackbox probes (13 total) ### Related - `project-pal-e-platform` — project - `story:superuser-observe` — user story - `arch:blackbox-exporter`, `arch:tailscale-funnel` — architecture components
Author
Owner

Scope Review: NEEDS_REFINEMENT

Review note: review-385-2026-03-26

Root cause confirmed: pal-e-app probe at terraform/main.tf:455 uses external funnel URL, which fails TLS hairpin from inside the cluster. Fix is to switch to internal http://pal-e-app.pal-e-app.svc:3000 (same pattern as Keycloak fix in #117).

Issues to address before moving to next_up:

  • Missing File Targets section — fix is terraform/main.tf lines 454-457
  • Missing Test Expectationstofu plan -lock=false + Prometheus probe_success query
  • Missing Constraints — follow precedent from #117 / commit 4213fde
  • Missing Checklist — standard PR/test checklist
  • Blast radius: 3 other probes use external funnel URLs with same hairpin risk (pal-e-docs, westside-app, westside-dev) — scope decision needed
## Scope Review: NEEDS_REFINEMENT Review note: `review-385-2026-03-26` Root cause confirmed: pal-e-app probe at `terraform/main.tf:455` uses external funnel URL, which fails TLS hairpin from inside the cluster. Fix is to switch to internal `http://pal-e-app.pal-e-app.svc:3000` (same pattern as Keycloak fix in #117). **Issues to address before moving to next_up:** - Missing **File Targets** section — fix is `terraform/main.tf` lines 454-457 - Missing **Test Expectations** — `tofu plan -lock=false` + Prometheus `probe_success` query - Missing **Constraints** — follow precedent from #117 / commit 4213fde - Missing **Checklist** — standard PR/test checklist - **Blast radius**: 3 other probes use external funnel URLs with same hairpin risk (pal-e-docs, westside-app, westside-dev) — scope decision needed
Author
Owner

Scope Correction (post-review)

Per review review-385-2026-03-26, expanding scope and adding file targets.

Expanded Scope

Batch all 4 hairpin probes, not just pal-e-app. Four blackbox probes use external Tailscale funnel URLs that fail TLS from inside the cluster (hairpin routing). Other probes already use internal URLs and work fine.

File Targets

All in pal-e-platform/terraform/main.tf within helm_release.blackbox_exporter values:

  • Line ~451: pal-e-docs → change from https://pal-e-docs.tail5b443a.ts.net/healthz to internal service URL
  • Line ~456: pal-e-app → change from https://pal-e-app.tail5b443a.ts.net to http://pal-e-app.pal-e-app.svc:3000
  • Line ~466: westside-app → change from https://westsidekingsandqueens.tail5b443a.ts.net to internal service URL
  • Line ~471: westside-dev → change from https://westside-dev.tail5b443a.ts.net to internal service URL

Files NOT to touch: Other probe targets already using internal URLs (basketball-api, platform-validation, forgejo, woodpecker, etc.)

Precedent

Issue #117 / commit 4213fde — same fix for Keycloak probe. Switch from external funnel URL to internal service URL.

Acceptance Criteria (updated)

  • All 4 probes switched to internal service URLs
  • EndpointDown alert clears for pal-e-app
  • All 14 blackbox probes succeed (no regression)
  • tofu plan shows only the 4 probe URL changes
## Scope Correction (post-review) Per review `review-385-2026-03-26`, expanding scope and adding file targets. ### Expanded Scope **Batch all 4 hairpin probes**, not just pal-e-app. Four blackbox probes use external Tailscale funnel URLs that fail TLS from inside the cluster (hairpin routing). Other probes already use internal URLs and work fine. ### File Targets All in `pal-e-platform/terraform/main.tf` within `helm_release.blackbox_exporter` values: - **Line ~451**: `pal-e-docs` → change from `https://pal-e-docs.tail5b443a.ts.net/healthz` to internal service URL - **Line ~456**: `pal-e-app` → change from `https://pal-e-app.tail5b443a.ts.net` to `http://pal-e-app.pal-e-app.svc:3000` - **Line ~466**: `westside-app` → change from `https://westsidekingsandqueens.tail5b443a.ts.net` to internal service URL - **Line ~471**: `westside-dev` → change from `https://westside-dev.tail5b443a.ts.net` to internal service URL Files NOT to touch: Other probe targets already using internal URLs (basketball-api, platform-validation, forgejo, woodpecker, etc.) ### Precedent Issue #117 / commit `4213fde` — same fix for Keycloak probe. Switch from external funnel URL to internal service URL. ### Acceptance Criteria (updated) - [ ] All 4 probes switched to internal service URLs - [ ] `EndpointDown` alert clears for pal-e-app - [ ] All 14 blackbox probes succeed (no regression) - [ ] `tofu plan` shows only the 4 probe URL changes
Author
Owner

Scope Review: NEEDS_REFINEMENT

Review note: review-385-2026-03-26b

File targets verified — all 4 external funnel probe URLs confirmed at lines 451, 456, 466, 471 of terraform/main.tf. Precedent commit 4213fde confirmed. No NetworkPolicy blockers. No board dependencies.

Issues to address before next_up:

  • Exact internal URLs missing for 3 of 4 probes. Scope correction says "change to internal service URL" without specifying. Critical trap: westside-app and westside-dev services live in namespace westsidekingsandqueens, not their own namespace. Agent would guess wrong. Verified URLs:
    • http://pal-e-docs.pal-e-docs.svc.cluster.local:8000/healthz
    • http://pal-e-app.pal-e-app.svc.cluster.local:3000 (already correct in scope)
    • http://westside-app.westsidekingsandqueens.svc.cluster.local:3000
    • http://westside-dev.westsidekingsandqueens.svc.cluster.local:80
  • Test Expectations section missing — add tofu plan -lock=false + PromQL probe_success{job="blackbox"} verification
  • Checklist section missing — standard PR/test checklist
## Scope Review: NEEDS_REFINEMENT Review note: `review-385-2026-03-26b` File targets verified — all 4 external funnel probe URLs confirmed at lines 451, 456, 466, 471 of `terraform/main.tf`. Precedent commit `4213fde` confirmed. No NetworkPolicy blockers. No board dependencies. **Issues to address before next_up:** - **Exact internal URLs missing for 3 of 4 probes.** Scope correction says "change to internal service URL" without specifying. Critical trap: `westside-app` and `westside-dev` services live in namespace `westsidekingsandqueens`, not their own namespace. Agent would guess wrong. Verified URLs: - `http://pal-e-docs.pal-e-docs.svc.cluster.local:8000/healthz` - `http://pal-e-app.pal-e-app.svc.cluster.local:3000` (already correct in scope) - `http://westside-app.westsidekingsandqueens.svc.cluster.local:3000` - `http://westside-dev.westsidekingsandqueens.svc.cluster.local:80` - **Test Expectations section missing** — add `tofu plan -lock=false` + PromQL `probe_success{job="blackbox"}` verification - **Checklist section missing** — standard PR/test checklist
Author
Owner

Refinement: exact internal URLs (namespace trap)

Per review review-385-2026-03-26b:

Critical: Namespace Mismatch

Both westside services live in westsidekingsandqueens namespace, NOT in westside-app/westside-dev namespaces. An agent would construct wrong URLs without this.

Exact Internal URLs (verified by review agent)

  • Line ~451: pal-e-docshttp://pal-e-docs.pal-e-docs.svc.cluster.local:8000/healthz
  • Line ~456: pal-e-apphttp://pal-e-app.pal-e-app.svc.cluster.local:3000
  • Line ~466: westside-apphttp://westside-app.westsidekingsandqueens.svc.cluster.local:3000
  • Line ~471: westside-devhttp://westside-dev.westsidekingsandqueens.svc.cluster.local:80

Test Expectations (added)

  • tofu plan -lock=false shows only the 4 URL changes
  • After apply: all 14 blackbox probes return probe_success == 1 (verify via Prometheus query or AlertManager clearing)
## Refinement: exact internal URLs (namespace trap) Per review `review-385-2026-03-26b`: ### Critical: Namespace Mismatch Both westside services live in `westsidekingsandqueens` namespace, NOT in `westside-app`/`westside-dev` namespaces. An agent would construct wrong URLs without this. ### Exact Internal URLs (verified by review agent) - Line ~451: `pal-e-docs` → `http://pal-e-docs.pal-e-docs.svc.cluster.local:8000/healthz` - Line ~456: `pal-e-app` → `http://pal-e-app.pal-e-app.svc.cluster.local:3000` - Line ~466: `westside-app` → `http://westside-app.westsidekingsandqueens.svc.cluster.local:3000` - Line ~471: `westside-dev` → `http://westside-dev.westsidekingsandqueens.svc.cluster.local:80` ### Test Expectations (added) - `tofu plan -lock=false` shows only the 4 URL changes - After apply: all 14 blackbox probes return `probe_success == 1` (verify via Prometheus query or AlertManager clearing)
Author
Owner

Scope Review: READY

Review note: review-385-2026-03-26c

Re-review after 3 scope corrections. All previous NEEDS_REFINEMENT issues resolved:

  • Namespace trap addressed: Comment #4 correctly documents westsidekingsandqueens namespace for both westside services
  • Exact internal URLs provided and independently verified against pal-e-deployments kustomization overlays (ports, namespaces, service names all confirmed)
  • Test Expectations added: tofu plan -lock=false + PromQL verification
  • File targets verified: All 4 external-URL probes confirmed at lines 483, 488, 498, 503 of terraform/main.tf (scope comments say ~451/~456/~466/~471 — off by ~30 lines, but probe names make targets unambiguous)

One minor gap: Checklist section still missing (agents follow standard PR workflow regardless). One suggestion for implementing agent: update code comment at line 481 ("external URLs") to reflect new internal URL strategy.

## Scope Review: READY Review note: `review-385-2026-03-26c` Re-review after 3 scope corrections. All previous NEEDS_REFINEMENT issues resolved: - **Namespace trap addressed**: Comment #4 correctly documents `westsidekingsandqueens` namespace for both westside services - **Exact internal URLs provided and independently verified** against pal-e-deployments kustomization overlays (ports, namespaces, service names all confirmed) - **Test Expectations added**: `tofu plan -lock=false` + PromQL verification - **File targets verified**: All 4 external-URL probes confirmed at lines 483, 488, 498, 503 of `terraform/main.tf` (scope comments say ~451/~456/~466/~471 — off by ~30 lines, but probe names make targets unambiguous) One minor gap: Checklist section still missing (agents follow standard PR workflow regardless). One suggestion for implementing agent: update code comment at line 481 ("external URLs") to reflect new internal URL strategy.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
forgejo_admin/pal-e-platform#168
No description provided.