bug: Woodpecker CI clone fails with TLS EOF on Tailscale funnel #107

Closed
opened 2026-03-18 16:06:10 +00:00 by forgejo_admin · 1 comment

Lineage

plan-pal-e-platform — platform reliability

Repo

forgejo_admin/pal-e-platform

User Story

As a developer
I want CI pipelines to clone reliably
So that merges deploy without manual retries

Context

During the westside Phase 15 session (2026-03-18), every Woodpecker CI pipeline required 2-3 retries due to TLS clone failures. The error is consistent:

fatal: unable to access 'https://forgejo.../repo.git/': TLS connect error: error:0A000126:SSL routines::unexpected eof while reading

Affected repos: basketball-api (pipelines #22-28), pal-e-platform (pipeline #108). The clone step fetches via HTTPS through the Tailscale funnel. Some runs clone successfully on retry, suggesting an intermittent TLS termination issue — not a permanent misconfiguration.

Observations

  • Failures happen on git fetch --filter=tree:0 (partial clone) — the promisor remote fetch fails mid-stream
  • Retries sometimes succeed immediately
  • Both push and pull_request events affected
  • Pod restarts (Forgejo has 5 restarts in 26d, Woodpecker agent has 3 in 36h) may indicate underlying instability

Possible Causes

  1. Tailscale funnel TLS termination dropping long-running connections
  2. Forgejo pod memory pressure causing connection resets
  3. Woodpecker clone plugin using --filter=tree:0 (partial clone) which requires multiple round-trips
  4. nftables/kube-router dropping established connections

File Targets

  • Investigate Forgejo pod resource limits
  • Check if Woodpecker can be configured to use cluster-internal URL instead of funnel
  • Consider adding retry logic to CI clone step

Acceptance Criteria

  • CI pipelines clone successfully on first attempt (>95% success rate)

Test Expectations

  • Run 5 consecutive pipeline triggers — all should clone successfully

Constraints

  • Don't change Forgejo's external URL — other services depend on the funnel hostname

Checklist

  • Root cause identified
  • Fix implemented
  • Verified over multiple pipeline runs
  • plan-pal-e-platform — platform hardening
  • sop-ci-pipeline-recovery — current workaround is manual retry
### Lineage `plan-pal-e-platform` — platform reliability ### Repo `forgejo_admin/pal-e-platform` ### User Story As a developer I want CI pipelines to clone reliably So that merges deploy without manual retries ### Context During the westside Phase 15 session (2026-03-18), every Woodpecker CI pipeline required 2-3 retries due to TLS clone failures. The error is consistent: ``` fatal: unable to access 'https://forgejo.../repo.git/': TLS connect error: error:0A000126:SSL routines::unexpected eof while reading ``` Affected repos: `basketball-api` (pipelines #22-28), `pal-e-platform` (pipeline #108). The clone step fetches via HTTPS through the Tailscale funnel. Some runs clone successfully on retry, suggesting an intermittent TLS termination issue — not a permanent misconfiguration. ### Observations - Failures happen on `git fetch --filter=tree:0` (partial clone) — the promisor remote fetch fails mid-stream - Retries sometimes succeed immediately - Both push and pull_request events affected - Pod restarts (Forgejo has 5 restarts in 26d, Woodpecker agent has 3 in 36h) may indicate underlying instability ### Possible Causes 1. Tailscale funnel TLS termination dropping long-running connections 2. Forgejo pod memory pressure causing connection resets 3. Woodpecker clone plugin using `--filter=tree:0` (partial clone) which requires multiple round-trips 4. nftables/kube-router dropping established connections ### File Targets - Investigate Forgejo pod resource limits - Check if Woodpecker can be configured to use cluster-internal URL instead of funnel - Consider adding retry logic to CI clone step ### Acceptance Criteria - [ ] CI pipelines clone successfully on first attempt (>95% success rate) ### Test Expectations - [ ] Run 5 consecutive pipeline triggers — all should clone successfully ### Constraints - Don't change Forgejo's external URL — other services depend on the funnel hostname ### Checklist - [ ] Root cause identified - [ ] Fix implemented - [ ] Verified over multiple pipeline runs ### Related - `plan-pal-e-platform` — platform hardening - `sop-ci-pipeline-recovery` — current workaround is manual retry
Author
Owner

Session Findings (2026-03-18)

The TLS fix applied in the previous session only helps server API calls — the git clone step uses a different TLS code path and still fails. Clone succeeds for pal-e-platform but fails for basketball-api and likely other repos.

Why pal-e-platform works but others don't

Needs investigation. Possible difference: pal-e-platform uses the forgejo remote name, other repos use origin. Could also be a cert chain issue specific to certain Tailscale funnel paths.

Options (from previous session)

  1. Custom clone step — Add a custom clone step in each repo's .woodpecker.yaml that uses the internal Forgejo URL (http://forgejo-http.forgejo.svc.cluster.local:80)
  2. Change Forgejo ROOT_URL — Switch to internal URL + reverse proxy for external access
  3. Debug the funnel TLS — Why does pal-e-platform clone succeed but basketball-api doesn't?

Impact

This blocks:

  • Issue #113 (Terraform state drift — 3 alerts)
  • basketball-api PR #100 (structured logging, merged but not deployed)
  • ALL future CI builds for non-platform repos

Option 1 (custom clone step) is the lowest-risk immediate fix. Option 3 (debug funnel TLS) is the proper root cause investigation. Both can be parallelized.

## Session Findings (2026-03-18) The TLS fix applied in the previous session only helps **server API calls** — the git clone step uses a different TLS code path and still fails. Clone succeeds for pal-e-platform but fails for basketball-api and likely other repos. ### Why pal-e-platform works but others don't Needs investigation. Possible difference: pal-e-platform uses the `forgejo` remote name, other repos use `origin`. Could also be a cert chain issue specific to certain Tailscale funnel paths. ### Options (from previous session) 1. **Custom clone step** — Add a custom clone step in each repo's `.woodpecker.yaml` that uses the internal Forgejo URL (`http://forgejo-http.forgejo.svc.cluster.local:80`) 2. **Change Forgejo ROOT_URL** — Switch to internal URL + reverse proxy for external access 3. **Debug the funnel TLS** — Why does pal-e-platform clone succeed but basketball-api doesn't? ### Impact This blocks: - Issue #113 (Terraform state drift — 3 alerts) - basketball-api PR #100 (structured logging, merged but not deployed) - ALL future CI builds for non-platform repos ### Recommended approach Option 1 (custom clone step) is the lowest-risk immediate fix. Option 3 (debug funnel TLS) is the proper root cause investigation. Both can be parallelized.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
forgejo_admin/pal-e-platform#107
No description provided.