Feature: Split-horizon DNS — prevent intra-cluster TLS hairpin through DERP relays #138

Closed
opened 2026-03-21 19:55:33 +00:00 by forgejo_admin · 0 comments

Type

Feature

Lineage

plan-pal-e-platform — discovered during CI pipeline investigation (#133)

Repo

forgejo_admin/pal-e-platform

User Story

As the platform operator
I want intra-cluster traffic to Tailscale funnel hostnames to stay inside the cluster
So that server-to-server communication (OAuth, API calls, image pushes) is reliable and doesn't hairpin through the public internet

Context

All Tailscale funnel hostnames (e.g., forgejo.tail5b443a.ts.net) resolve to public DERP relay IPs (208.111.35.209, 208.111.34.11) even from inside the cluster. This causes intra-cluster HTTPS traffic to hairpin through the public internet, resulting in ~66% TLS failure rate.

Proven impact:

  • Woodpecker OAuth token refresh fails (Post "https://forgejo.tail5b443a.ts.net/login/oauth/access_token": EOF) → dead token → can't read .woodpecker.yaml from private repos → "no steps" on PR events
  • CI clone steps failed before internal URL override (#133)
  • Potentially affects Harbor pushes, Keycloak auth, and any cross-service communication via funnel hostnames

Root cause proven via DNS test from inside cluster: nslookup forgejo.tail5b443a.ts.net returns 208.111.35.209 (public DERP IP), not the internal ClusterIP 10.43.106.198.

File Targets

Files to modify:

  • terraform/main.tf — add Forgejo internal HTTPS service, cert secret, Helm values for TLS
  • terraform/main.tf — CoreDNS ConfigMap customization resource
  • terraform/main.tf — CronJob for cert renewal via tailscale cert

Files NOT to touch:

  • .woodpecker.yaml — clone fix is separate (PR #134)
  • terraform/network-policies.tf — NetworkPolicies are correct, not the issue

Acceptance Criteria

  • From a cluster pod, nslookup forgejo.tail5b443a.ts.net returns internal ClusterIP
  • Woodpecker OAuth token refresh succeeds reliably (server logs show no TLS EOF)
  • PR events work for private repos (Woodpecker can read config from Forgejo API)
  • External browser access still works through Tailscale funnel (no regression)
  • Pattern documented and replicable for Harbor, Keycloak, other funnels

Test Expectations

  • DNS test: kubectl run dns-test --rm -it --image=alpine -- nslookup forgejo.tail5b443a.ts.net returns internal IP
  • TLS test: kubectl run tls-test --rm -it --image=alpine/curl -- curl -sI https://forgejo.tail5b443a.ts.net succeeds 5/5
  • OAuth test: Woodpecker server logs show successful token refresh
  • Pipeline test: PR event on pal-e-platform creates steps (not "no steps" error)
  • External test: Browser can still access https://forgejo.tail5b443a.ts.net

Constraints

  • Cert must match forgejo.tail5b443a.ts.net hostname for TLS verification
  • tailscale cert generates Let's Encrypt certs for Tailscale hostnames — use this
  • Cert renewal needs automation (CronJob or similar)
  • CoreDNS customization must not break other DNS resolution
  • Same pattern must be extensible to Harbor, Keycloak, and future funnels

Checklist

  • PR opened
  • Tests pass
  • No unrelated changes
  • #133 — CI clone fix (workaround: internal URL in .woodpecker.yaml)
  • #127 — kube-router ipset sync (separate issue, not root cause here)
  • plan-pal-e-platform — should become new phase: Split-Horizon DNS
### Type Feature ### Lineage `plan-pal-e-platform` — discovered during CI pipeline investigation (#133) ### Repo `forgejo_admin/pal-e-platform` ### User Story As the platform operator I want intra-cluster traffic to Tailscale funnel hostnames to stay inside the cluster So that server-to-server communication (OAuth, API calls, image pushes) is reliable and doesn't hairpin through the public internet ### Context All Tailscale funnel hostnames (e.g., `forgejo.tail5b443a.ts.net`) resolve to public DERP relay IPs (`208.111.35.209`, `208.111.34.11`) even from inside the cluster. This causes intra-cluster HTTPS traffic to hairpin through the public internet, resulting in ~66% TLS failure rate. Proven impact: - Woodpecker OAuth token refresh fails (`Post "https://forgejo.tail5b443a.ts.net/login/oauth/access_token": EOF`) → dead token → can't read `.woodpecker.yaml` from private repos → "no steps" on PR events - CI clone steps failed before internal URL override (#133) - Potentially affects Harbor pushes, Keycloak auth, and any cross-service communication via funnel hostnames Root cause proven via DNS test from inside cluster: `nslookup forgejo.tail5b443a.ts.net` returns `208.111.35.209` (public DERP IP), not the internal ClusterIP `10.43.106.198`. ### File Targets Files to modify: - `terraform/main.tf` — add Forgejo internal HTTPS service, cert secret, Helm values for TLS - `terraform/main.tf` — CoreDNS ConfigMap customization resource - `terraform/main.tf` — CronJob for cert renewal via `tailscale cert` Files NOT to touch: - `.woodpecker.yaml` — clone fix is separate (PR #134) - `terraform/network-policies.tf` — NetworkPolicies are correct, not the issue ### Acceptance Criteria - [ ] From a cluster pod, `nslookup forgejo.tail5b443a.ts.net` returns internal ClusterIP - [ ] Woodpecker OAuth token refresh succeeds reliably (server logs show no TLS EOF) - [ ] PR events work for private repos (Woodpecker can read config from Forgejo API) - [ ] External browser access still works through Tailscale funnel (no regression) - [ ] Pattern documented and replicable for Harbor, Keycloak, other funnels ### Test Expectations - [ ] DNS test: `kubectl run dns-test --rm -it --image=alpine -- nslookup forgejo.tail5b443a.ts.net` returns internal IP - [ ] TLS test: `kubectl run tls-test --rm -it --image=alpine/curl -- curl -sI https://forgejo.tail5b443a.ts.net` succeeds 5/5 - [ ] OAuth test: Woodpecker server logs show successful token refresh - [ ] Pipeline test: PR event on pal-e-platform creates steps (not "no steps" error) - [ ] External test: Browser can still access `https://forgejo.tail5b443a.ts.net` ### Constraints - Cert must match `forgejo.tail5b443a.ts.net` hostname for TLS verification - `tailscale cert` generates Let's Encrypt certs for Tailscale hostnames — use this - Cert renewal needs automation (CronJob or similar) - CoreDNS customization must not break other DNS resolution - Same pattern must be extensible to Harbor, Keycloak, and future funnels ### Checklist - [ ] PR opened - [ ] Tests pass - [ ] No unrelated changes ### Related - #133 — CI clone fix (workaround: internal URL in `.woodpecker.yaml`) - #127 — kube-router ipset sync (separate issue, not root cause here) - `plan-pal-e-platform` — should become new phase: Split-Horizon DNS
forgejo_admin 2026-03-21 20:00:23 +00:00
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
forgejo_admin/pal-e-platform#138
No description provided.