fix: switch 4 blackbox probes to internal service URLs #178

Merged
forgejo_admin merged 1 commit from 168-fix-blackbox-hairpin-probes into main 2026-03-26 23:12:14 +00:00

Summary

Blackbox exporter pods inside the cluster get TLS handshake failures (unexpected eof while reading) when probing external Tailscale funnel URLs due to hairpin routing. This switches all 4 affected probes to internal k8s service URLs, eliminating the TLS hairpin path entirely. Same fix pattern as Keycloak probe in #117 / commit 4213fde.

Changes

  • terraform/main.tf — switched 4 blackbox probe targets from external funnel URLs to internal service URLs:
    • pal-e-docs: https://pal-e-docs.tail5b443a.ts.net/healthz -> http://pal-e-docs.pal-e-docs.svc.cluster.local:8000/healthz
    • pal-e-app: https://pal-e-app.tail5b443a.ts.net -> http://pal-e-app.pal-e-app.svc.cluster.local:3000
    • westside-app: https://westsidekingsandqueens.tail5b443a.ts.net -> http://westside-app.westsidekingsandqueens.svc.cluster.local:3000
    • westside-dev: https://westside-dev.tail5b443a.ts.net -> http://westside-dev.westsidekingsandqueens.svc.cluster.local:80
  • Updated code comment from "external URLs" to "internal URLs — avoids TLS hairpin through Tailscale funnel"

tofu plan Output

  # helm_release.blackbox_exporter will be updated in-place
      - "url": "https://pal-e-docs.tail5b443a.ts.net/healthz"
      + "url": "http://pal-e-docs.pal-e-docs.svc.cluster.local:8000/healthz"

      - "url": "https://pal-e-app.tail5b443a.ts.net"
      + "url": "http://pal-e-app.pal-e-app.svc.cluster.local:3000"

      - "url": "https://westsidekingsandqueens.tail5b443a.ts.net"
      + "url": "http://westside-app.westsidekingsandqueens.svc.cluster.local:3000"

      - "url": "https://westside-dev.tail5b443a.ts.net"
      + "url": "http://westside-dev.westsidekingsandqueens.svc.cluster.local:80"

Plan: 0 to add, 1 to change (blackbox_exporter), 0 to destroy. Two other unrelated changes in plan (woodpecker pending-upgrade status, harbor netpol argocd ingress) are pre-existing drift, not introduced by this PR.

Test Plan

  • tofu fmt clean (no changes)
  • tofu plan -lock=false shows only the 4 probe URL changes on helm_release.blackbox_exporter
  • After apply: all 14 blackbox probes return probe_success == 1 via PromQL
  • EndpointDown alert for pal-e-app clears in AlertManager

Review Checklist

  • Passed automated review-fix loop
  • No secrets committed
  • No unnecessary file changes
  • Commit messages are descriptive
  • Closes #168
  • Precedent: #117 / commit 4213fde (Keycloak probe hairpin fix)
## Summary Blackbox exporter pods inside the cluster get TLS handshake failures (`unexpected eof while reading`) when probing external Tailscale funnel URLs due to hairpin routing. This switches all 4 affected probes to internal k8s service URLs, eliminating the TLS hairpin path entirely. Same fix pattern as Keycloak probe in #117 / commit `4213fde`. ## Changes - `terraform/main.tf` — switched 4 blackbox probe targets from external funnel URLs to internal service URLs: - `pal-e-docs`: `https://pal-e-docs.tail5b443a.ts.net/healthz` -> `http://pal-e-docs.pal-e-docs.svc.cluster.local:8000/healthz` - `pal-e-app`: `https://pal-e-app.tail5b443a.ts.net` -> `http://pal-e-app.pal-e-app.svc.cluster.local:3000` - `westside-app`: `https://westsidekingsandqueens.tail5b443a.ts.net` -> `http://westside-app.westsidekingsandqueens.svc.cluster.local:3000` - `westside-dev`: `https://westside-dev.tail5b443a.ts.net` -> `http://westside-dev.westsidekingsandqueens.svc.cluster.local:80` - Updated code comment from "external URLs" to "internal URLs — avoids TLS hairpin through Tailscale funnel" ## tofu plan Output ``` # helm_release.blackbox_exporter will be updated in-place - "url": "https://pal-e-docs.tail5b443a.ts.net/healthz" + "url": "http://pal-e-docs.pal-e-docs.svc.cluster.local:8000/healthz" - "url": "https://pal-e-app.tail5b443a.ts.net" + "url": "http://pal-e-app.pal-e-app.svc.cluster.local:3000" - "url": "https://westsidekingsandqueens.tail5b443a.ts.net" + "url": "http://westside-app.westsidekingsandqueens.svc.cluster.local:3000" - "url": "https://westside-dev.tail5b443a.ts.net" + "url": "http://westside-dev.westsidekingsandqueens.svc.cluster.local:80" ``` Plan: 0 to add, 1 to change (blackbox_exporter), 0 to destroy. Two other unrelated changes in plan (woodpecker pending-upgrade status, harbor netpol argocd ingress) are pre-existing drift, not introduced by this PR. ## Test Plan - [x] `tofu fmt` clean (no changes) - [x] `tofu plan -lock=false` shows only the 4 probe URL changes on `helm_release.blackbox_exporter` - [ ] After apply: all 14 blackbox probes return `probe_success == 1` via PromQL - [ ] `EndpointDown` alert for pal-e-app clears in AlertManager ## Review Checklist - [x] Passed automated review-fix loop - [x] No secrets committed - [x] No unnecessary file changes - [x] Commit messages are descriptive ## Related - Closes #168 - Precedent: #117 / commit `4213fde` (Keycloak probe hairpin fix)
fix: switch 4 blackbox probes from external funnel URLs to internal service URLs
All checks were successful
ci/woodpecker/push/woodpecker Pipeline was successful
ci/woodpecker/pr/woodpecker Pipeline was successful
e651974694
Blackbox exporter inside the cluster gets TLS handshake failures when
probing external Tailscale funnel URLs (hairpin routing). Switch pal-e-docs,
pal-e-app, westside-app, and westside-dev probes to internal k8s service
URLs, matching the precedent from Keycloak fix in #117 / 4213fde.

Closes #168

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Author
Owner

Self-Review: APPROVED

Diff: 5 lines changed in 1 file (terraform/main.tf)

Verification

  • All 4 internal URLs match exactly what was specified in issue #168 comment #4 (the refinement with verified namespace-correct URLs)
  • Namespace trap handled correctly: both westside services use westsidekingsandqueens namespace, not westside-app/westside-dev
  • tofu fmt clean -- no formatting changes needed
  • tofu plan -lock=false confirms only helm_release.blackbox_exporter is affected by this change (4 URL swaps)
  • No secrets committed, no unrelated file changes
  • Untouched probes (forgejo, woodpecker, grafana, alertmanager, harbor, argocd, keycloak, minio-api, basketball-api, platform-validation) all already use internal URLs -- no regression risk

No issues found. Ready for merge.

## Self-Review: APPROVED **Diff: 5 lines changed in 1 file (`terraform/main.tf`)** ### Verification - All 4 internal URLs match exactly what was specified in issue #168 comment #4 (the refinement with verified namespace-correct URLs) - Namespace trap handled correctly: both westside services use `westsidekingsandqueens` namespace, not `westside-app`/`westside-dev` - `tofu fmt` clean -- no formatting changes needed - `tofu plan -lock=false` confirms only `helm_release.blackbox_exporter` is affected by this change (4 URL swaps) - No secrets committed, no unrelated file changes - Untouched probes (forgejo, woodpecker, grafana, alertmanager, harbor, argocd, keycloak, minio-api, basketball-api, platform-validation) all already use internal URLs -- no regression risk ### No issues found. Ready for merge.
Author
Owner

PR #178 Review

DOMAIN REVIEW

Tech stack: OpenTofu / Helm / k8s (Terraform HCL managing blackbox exporter probe targets via Helm values).

Internal URL correctness -- All 4 URLs verified against pal-e-deployments kustomize overlays and base service definitions:

Probe Service Name Namespace Port Verified Source
pal-e-docs pal-e-docs pal-e-docs 8000 Base service default (not overridden in kustomization)
pal-e-app pal-e-app pal-e-app 3000 Kustomize port override in overlays/pal-e-app/prod/kustomization.yaml
westside-app westside-app westsidekingsandqueens 3000 Kustomize port override in overlays/westsidekingsandqueens/prod/kustomization.yaml
westside-dev westside-dev westsidekingsandqueens 80 Explicit in overlays/westsidekingsandqueens/dev/service.yaml

All 4 internal URLs are correctly formed following the http://<service>.<namespace>.svc.cluster.local:<port> pattern.

Blackbox module compatibility -- The helm chart uses the default http_2xx module for all targets. Switching from https:// external to http:// internal requires no module config change. All existing platform probes (forgejo, woodpecker, grafana, alertmanager, harbor, argocd, keycloak, minio) already use this same http:// internal pattern.

Comment accuracy -- The updated comment ("internal URLs -- avoids TLS hairpin through Tailscale funnel") is now accurate. The previous comment ("external URLs -- validates full funnel path") was already partially wrong since basketball-api and platform-validation were already on internal URLs within the same block.

Precedent -- PR references #117 / commit 4213fde (Keycloak hairpin fix) as the established pattern. Consistent approach.

No state-breaking changes -- tofu plan output confirms 0 add, 1 in-place update (blackbox_exporter helm release), 0 destroy. Safe change.

BLOCKERS

None.

This is a pure configuration change (4 URL string swaps in Helm values). No new functionality requiring tests. No user input. No secrets. No auth logic.

NITS

  1. pal-e-docs healthz path: The probe keeps /healthz in the internal URL (http://pal-e-docs.pal-e-docs.svc.cluster.local:8000/healthz), which is correct -- pal-e-docs exposes a health endpoint there. The other probes hit root or no path, which is fine for static SvelteKit apps. No action needed, just noting the intentional asymmetry.

  2. Minor: funnel-path monitoring gap: By moving all probes to internal URLs, there is no longer any synthetic monitoring of the Tailscale funnel TLS path itself. This is a known and accepted tradeoff (the funnel path was causing false-positive EndpointDown alerts due to hairpin failures). If external endpoint monitoring becomes desired later, a separate probe set with a longer for duration could be added. Not blocking -- just noting for future consideration.

SOP COMPLIANCE

  • Branch named after issue: 168-fix-blackbox-hairpin-probes references issue #168
  • PR body follows template: Summary, Changes, tofu plan Output, Test Plan, Review Checklist, Related sections all present
  • Related references parent issue: "Closes #168" and cites precedent #117
  • tofu plan output included (PR convention for Terraform changes)
  • tofu fmt confirmed clean (Test Plan checkbox)
  • -lock=false used in plan (per memory: feedback_tofu_lock_false.md)
  • No secrets committed (no credentials, tokens, or .env files in diff)
  • No unnecessary file changes (1 file, 5 additions, 5 deletions -- all scoped to the bug fix)
  • Commit messages are descriptive

PROCESS OBSERVATIONS

  • MTTR impact: This fix directly improves Mean Time To Recovery by eliminating false-positive EndpointDown alerts. When the alert fires now, it means the service is actually down, not that the Tailscale funnel hairpin failed. Signal-to-noise improvement for on-call response.
  • Change failure risk: Very low. In-place Helm values update, no resource creation or destruction. Rollback is a single tofu apply reverting the URLs.
  • Pattern consolidation: After this PR, all 14 blackbox probes use internal svc.cluster.local URLs. Zero external funnel probes remain. Consistent, predictable configuration.

VERDICT: APPROVED

## PR #178 Review ### DOMAIN REVIEW **Tech stack**: OpenTofu / Helm / k8s (Terraform HCL managing blackbox exporter probe targets via Helm values). **Internal URL correctness** -- All 4 URLs verified against `pal-e-deployments` kustomize overlays and base service definitions: | Probe | Service Name | Namespace | Port | Verified Source | |-------|-------------|-----------|------|-----------------| | pal-e-docs | `pal-e-docs` | `pal-e-docs` | 8000 | Base service default (not overridden in kustomization) | | pal-e-app | `pal-e-app` | `pal-e-app` | 3000 | Kustomize port override in `overlays/pal-e-app/prod/kustomization.yaml` | | westside-app | `westside-app` | `westsidekingsandqueens` | 3000 | Kustomize port override in `overlays/westsidekingsandqueens/prod/kustomization.yaml` | | westside-dev | `westside-dev` | `westsidekingsandqueens` | 80 | Explicit in `overlays/westsidekingsandqueens/dev/service.yaml` | All 4 internal URLs are correctly formed following the `http://<service>.<namespace>.svc.cluster.local:<port>` pattern. **Blackbox module compatibility** -- The helm chart uses the default `http_2xx` module for all targets. Switching from `https://` external to `http://` internal requires no module config change. All existing platform probes (forgejo, woodpecker, grafana, alertmanager, harbor, argocd, keycloak, minio) already use this same `http://` internal pattern. **Comment accuracy** -- The updated comment ("internal URLs -- avoids TLS hairpin through Tailscale funnel") is now accurate. The previous comment ("external URLs -- validates full funnel path") was already partially wrong since `basketball-api` and `platform-validation` were already on internal URLs within the same block. **Precedent** -- PR references #117 / commit `4213fde` (Keycloak hairpin fix) as the established pattern. Consistent approach. **No state-breaking changes** -- `tofu plan` output confirms 0 add, 1 in-place update (blackbox_exporter helm release), 0 destroy. Safe change. ### BLOCKERS None. This is a pure configuration change (4 URL string swaps in Helm values). No new functionality requiring tests. No user input. No secrets. No auth logic. ### NITS 1. **pal-e-docs healthz path**: The probe keeps `/healthz` in the internal URL (`http://pal-e-docs.pal-e-docs.svc.cluster.local:8000/healthz`), which is correct -- pal-e-docs exposes a health endpoint there. The other probes hit root or no path, which is fine for static SvelteKit apps. No action needed, just noting the intentional asymmetry. 2. **Minor: funnel-path monitoring gap**: By moving all probes to internal URLs, there is no longer any synthetic monitoring of the Tailscale funnel TLS path itself. This is a known and accepted tradeoff (the funnel path was causing false-positive `EndpointDown` alerts due to hairpin failures). If external endpoint monitoring becomes desired later, a separate probe set with a longer `for` duration could be added. Not blocking -- just noting for future consideration. ### SOP COMPLIANCE - [x] Branch named after issue: `168-fix-blackbox-hairpin-probes` references issue #168 - [x] PR body follows template: Summary, Changes, tofu plan Output, Test Plan, Review Checklist, Related sections all present - [x] Related references parent issue: "Closes #168" and cites precedent #117 - [x] `tofu plan` output included (PR convention for Terraform changes) - [x] `tofu fmt` confirmed clean (Test Plan checkbox) - [x] `-lock=false` used in plan (per memory: `feedback_tofu_lock_false.md`) - [x] No secrets committed (no credentials, tokens, or `.env` files in diff) - [x] No unnecessary file changes (1 file, 5 additions, 5 deletions -- all scoped to the bug fix) - [x] Commit messages are descriptive ### PROCESS OBSERVATIONS - **MTTR impact**: This fix directly improves Mean Time To Recovery by eliminating false-positive `EndpointDown` alerts. When the alert fires now, it means the service is actually down, not that the Tailscale funnel hairpin failed. Signal-to-noise improvement for on-call response. - **Change failure risk**: Very low. In-place Helm values update, no resource creation or destruction. Rollback is a single `tofu apply` reverting the URLs. - **Pattern consolidation**: After this PR, all 14 blackbox probes use internal `svc.cluster.local` URLs. Zero external funnel probes remain. Consistent, predictable configuration. ### VERDICT: APPROVED
forgejo_admin force-pushed 168-fix-blackbox-hairpin-probes from e651974694
All checks were successful
ci/woodpecker/push/woodpecker Pipeline was successful
ci/woodpecker/pr/woodpecker Pipeline was successful
to 9eb107cc68
All checks were successful
ci/woodpecker/push/woodpecker Pipeline was successful
ci/woodpecker/pr/woodpecker Pipeline was successful
ci/woodpecker/pull_request_closed/woodpecker Pipeline was successful
2026-03-26 23:00:22 +00:00
Compare
forgejo_admin deleted branch 168-fix-blackbox-hairpin-probes 2026-03-26 23:12:14 +00:00
Sign in to join this conversation.
No description provided.