ldraney/pal-e-platform

Fork 0

Bug: Woodpecker CI update-kustomize-tag fails — can't reach forgejo-http service #230

New issue

Open

opened 2026-03-28 19:17:24 +00:00 by forgejo_admin · 1 comment

forgejo_admin commented

2026-03-28 19:17:24 +00:00

Contributor

Type

Bug

Lineage

Possibly related to existing "Harbor connectivity timeout from Woodpecker CI agent" board item — same class of failure (CI agent can't reach in-cluster services).

Repo

forgejo_admin/pal-e-platform

What Broke

The update-kustomize-tag step in Woodpecker CI pipelines fails with connection refused when trying to download the shared script from Forgejo's internal service:

wget -O /tmp/update-kustomize-tag.sh --header="Authorization: token " \
  "http://forgejo-http.forgejo.svc.cluster.local:80/forgejo_admin/pal-e-platform/raw/branch/main/scripts/update-kustomize-tag.sh"
wget: can't connect to remote host (10.43.106.198): Connection refused

This blocks all deployments — images build and push to Harbor successfully, but the kustomize tag never gets updated so ArgoCD never syncs.

Observed on:

forgejo_admin/westside-app pipeline #147 (push to main after PR #146 merge)
forgejo_admin/westside-app pipeline #146 (push to main after PR #145 merge) — all steps after clone skipped
forgejo_admin/basketball-api pipeline #214 (push to main after PR #213 merge) — kustomize step skipped

Repro Steps

Merge any PR to main on westside-app or basketball-api
Woodpecker triggers pipeline on main branch
All steps succeed until update-kustomize-tag
Step fails with Connection refused to forgejo-http.forgejo.svc.cluster.local:80

Expected Behavior

The update-kustomize-tag step downloads the shared script from Forgejo, updates the image tag in pal-e-deployments, and commits — triggering ArgoCD sync.

Environment

Cluster/namespace: prod / woodpecker namespace
Service version: current Woodpecker agent
Target service: forgejo-http.forgejo.svc.cluster.local:80 (ClusterIP 10.43.106.198)
Related alerts: none detected

Acceptance Criteria

update-kustomize-tag step succeeds in CI pipelines
Merges to main trigger full build→push→deploy cycle
No regression in other CI steps

project-pal-e-platform — platform infrastructure
Existing board item: "Bug: Harbor connectivity timeout from Woodpecker CI agent" — possibly same root cause (CI agent networking)

### Type Bug ### Lineage Possibly related to existing "Harbor connectivity timeout from Woodpecker CI agent" board item — same class of failure (CI agent can't reach in-cluster services). ### Repo `forgejo_admin/pal-e-platform` ### What Broke The `update-kustomize-tag` step in Woodpecker CI pipelines fails with connection refused when trying to download the shared script from Forgejo's internal service: ``` wget -O /tmp/update-kustomize-tag.sh --header="Authorization: token " \ "http://forgejo-http.forgejo.svc.cluster.local:80/forgejo_admin/pal-e-platform/raw/branch/main/scripts/update-kustomize-tag.sh" wget: can't connect to remote host (10.43.106.198): Connection refused ``` This blocks all deployments — images build and push to Harbor successfully, but the kustomize tag never gets updated so ArgoCD never syncs. Observed on: - `forgejo_admin/westside-app` pipeline #147 (push to main after PR #146 merge) - `forgejo_admin/westside-app` pipeline #146 (push to main after PR #145 merge) — all steps after clone skipped - `forgejo_admin/basketball-api` pipeline #214 (push to main after PR #213 merge) — kustomize step skipped ### Repro Steps 1. Merge any PR to main on westside-app or basketball-api 2. Woodpecker triggers pipeline on main branch 3. All steps succeed until `update-kustomize-tag` 4. Step fails with `Connection refused` to `forgejo-http.forgejo.svc.cluster.local:80` ### Expected Behavior The `update-kustomize-tag` step downloads the shared script from Forgejo, updates the image tag in `pal-e-deployments`, and commits — triggering ArgoCD sync. ### Environment - Cluster/namespace: prod / woodpecker namespace - Service version: current Woodpecker agent - Target service: `forgejo-http.forgejo.svc.cluster.local:80` (ClusterIP `10.43.106.198`) - Related alerts: none detected ### Acceptance Criteria - [ ] `update-kustomize-tag` step succeeds in CI pipelines - [ ] Merges to main trigger full build→push→deploy cycle - [ ] No regression in other CI steps ### Related - `project-pal-e-platform` — platform infrastructure - Existing board item: "Bug: Harbor connectivity timeout from Woodpecker CI agent" — possibly same root cause (CI agent networking)

forgejo_admin commented

2026-03-28 19:22:58 +00:00

Author

Contributor

Investigation Report

Summary

The update-kustomize-tag "Connection refused" failure was transient -- not caused by a persistent networking or configuration problem. The failed pipeline (#147) was retried and succeeded as pipeline #148 with all steps passing, including update-kustomize-tag.

Findings

1. Service is healthy and reachable

forgejo-http ClusterIP service exists at 10.43.106.198:80 with a live endpoint (10.42.0.153:80)
Forgejo pod forgejo-646b68f9d4-hpc42 is Running (1/1), 0 restarts, started 2026-03-21T15:10:37Z
Live connectivity test from a pod in the woodpecker namespace to forgejo-http.forgejo.svc.cluster.local:80 succeeded immediately

2. Network policies are correctly configured

Forgejo namespace has default-deny-ingress with explicit allow rules for: tailscale, woodpecker, monitoring, argocd
Woodpecker pipeline pods run in the woodpecker namespace (WOODPECKER_BACKEND_K8S_NAMESPACE=woodpecker), so they match the allow rule
Namespace labels are correct: kubernetes.io/metadata.name=woodpecker matches the policy selector

3. Failure pattern suggests brief Forgejo unavailability during heavy load

basketball-api #214: clone step successfully connected to forgejo-http (git fetch worked), but update-kustomize-tag was skipped (likely because it ran during a brief Forgejo hiccup at a later point)
westside-app #147: clone and build succeeded, but update-kustomize-tag wget got "Connection refused" -- Forgejo was momentarily unreachable during that specific step
westside-app #143 (earlier): all steps including update-kustomize-tag succeeded
basketball-api #215 (later): all steps including update-kustomize-tag succeeded
The Woodpecker server itself restarted at 2026-03-28T18:07:28Z, close to the failure window. Heavy agent activity (80+ agents mentioned in session notes for 2026-03-28) likely created load spikes

4. Not related to issue #226 (auth/404 problem)

Issue #226 was a different bug (missing auth header causing 404 on private repo). That was fixed by adding --header="Authorization: token ${FORGEJO_TOKEN}" to wget.
This issue #230 is "Connection refused" at the TCP level -- the connection never reached Forgejo's HTTP layer

5. Node resources are fine

archbox: 23% CPU, 20% memory
Forgejo pod: 227Mi / 2Gi memory, no CPU limit

Root Cause (Most Likely)

Transient service unavailability during a period of heavy platform activity (the session notes mention 16+ PRs, 80+ agents, and Woodpecker server restart on 2026-03-28). The Forgejo pod itself didn't restart, but brief TCP connection refusal can occur when:

The Forgejo process was under heavy git/API load and couldn't accept new connections
kube-proxy iptables rules had a brief stale window during high pod churn on the node

Resolution

Retried westside-app #147 -> pipeline #148 succeeded (all steps including update-kustomize-tag)
No code changes needed -- the pipeline config and networking are correct

Recommendation: Add retry logic to wget

To make the pipeline resilient to transient failures, add a retry loop to the wget command in the update-kustomize-tag step. The script itself (update-kustomize-tag.sh) already has retry logic for git push, but the initial wget to download the script has no retries.

Option A -- Add wget retry flags in the .woodpecker.yaml step:

commands:
  - >-
    for i in 1 2 3; do
      wget -O /tmp/update-kustomize-tag.sh
        --header="Authorization: token ${FORGEJO_TOKEN}"
        "http://forgejo-http.forgejo.svc.cluster.local:80/forgejo_admin/pal-e-platform/raw/branch/main/scripts/update-kustomize-tag.sh"
      && break || sleep 5;
    done
  - sh /tmp/update-kustomize-tag.sh

Option B -- Use wget's built-in retry (-t 3 --waitretry=5):

commands:
  - >-
    wget -t 3 --waitretry=5
      -O /tmp/update-kustomize-tag.sh
      --header="Authorization: token ${FORGEJO_TOKEN}"
      "http://forgejo-http.forgejo.svc.cluster.local:80/forgejo_admin/pal-e-platform/raw/branch/main/scripts/update-kustomize-tag.sh"
  - sh /tmp/update-kustomize-tag.sh

Note: BusyBox wget (used in alpine/git) does not support -t or --waitretry. Option A (shell loop) is the correct approach for this image.

The Harbor connectivity board item is a separate class of failure. This was specifically about reaching Forgejo's HTTP service, which has correct NetworkPolicy rules already in place. The Harbor issue may involve a missing NetworkPolicy allow rule from woodpecker to harbor namespace (separate investigation needed).

## Investigation Report ### Summary The `update-kustomize-tag` "Connection refused" failure was **transient** -- not caused by a persistent networking or configuration problem. The failed pipeline (#147) was retried and succeeded as pipeline #148 with all steps passing, including `update-kustomize-tag`. ### Findings **1. Service is healthy and reachable** - `forgejo-http` ClusterIP service exists at `10.43.106.198:80` with a live endpoint (`10.42.0.153:80`) - Forgejo pod `forgejo-646b68f9d4-hpc42` is Running (1/1), 0 restarts, started `2026-03-21T15:10:37Z` - Live connectivity test from a pod in the `woodpecker` namespace to `forgejo-http.forgejo.svc.cluster.local:80` succeeded immediately **2. Network policies are correctly configured** - Forgejo namespace has `default-deny-ingress` with explicit allow rules for: `tailscale`, `woodpecker`, `monitoring`, `argocd` - Woodpecker pipeline pods run in the `woodpecker` namespace (`WOODPECKER_BACKEND_K8S_NAMESPACE=woodpecker`), so they match the allow rule - Namespace labels are correct: `kubernetes.io/metadata.name=woodpecker` matches the policy selector **3. Failure pattern suggests brief Forgejo unavailability during heavy load** - basketball-api #214: clone step successfully connected to forgejo-http (git fetch worked), but `update-kustomize-tag` was skipped (likely because it ran during a brief Forgejo hiccup at a later point) - westside-app #147: clone and build succeeded, but `update-kustomize-tag` wget got "Connection refused" -- Forgejo was momentarily unreachable during that specific step - westside-app #143 (earlier): all steps including `update-kustomize-tag` succeeded - basketball-api #215 (later): all steps including `update-kustomize-tag` succeeded - The Woodpecker server itself restarted at `2026-03-28T18:07:28Z`, close to the failure window. Heavy agent activity (80+ agents mentioned in session notes for 2026-03-28) likely created load spikes **4. Not related to issue #226 (auth/404 problem)** - Issue #226 was a different bug (missing auth header causing 404 on private repo). That was fixed by adding `--header="Authorization: token ${FORGEJO_TOKEN}"` to wget. - This issue #230 is "Connection refused" at the TCP level -- the connection never reached Forgejo's HTTP layer **5. Node resources are fine** - archbox: 23% CPU, 20% memory - Forgejo pod: 227Mi / 2Gi memory, no CPU limit ### Root Cause (Most Likely) Transient service unavailability during a period of heavy platform activity (the session notes mention 16+ PRs, 80+ agents, and Woodpecker server restart on 2026-03-28). The Forgejo pod itself didn't restart, but brief TCP connection refusal can occur when: - The Forgejo process was under heavy git/API load and couldn't accept new connections - kube-proxy iptables rules had a brief stale window during high pod churn on the node ### Resolution - **Retried westside-app #147 -> pipeline #148 succeeded** (all steps including `update-kustomize-tag`) - No code changes needed -- the pipeline config and networking are correct ### Recommendation: Add retry logic to wget To make the pipeline resilient to transient failures, add a retry loop to the wget command in the `update-kustomize-tag` step. The script itself (`update-kustomize-tag.sh`) already has retry logic for `git push`, but the initial `wget` to download the script has no retries. **Option A** -- Add wget retry flags in the `.woodpecker.yaml` step: ```yaml commands: - >- for i in 1 2 3; do wget -O /tmp/update-kustomize-tag.sh --header="Authorization: token ${FORGEJO_TOKEN}" "http://forgejo-http.forgejo.svc.cluster.local:80/forgejo_admin/pal-e-platform/raw/branch/main/scripts/update-kustomize-tag.sh" && break || sleep 5; done - sh /tmp/update-kustomize-tag.sh ``` **Option B** -- Use wget's built-in retry (`-t 3 --waitretry=5`): ```yaml commands: - >- wget -t 3 --waitretry=5 -O /tmp/update-kustomize-tag.sh --header="Authorization: token ${FORGEJO_TOKEN}" "http://forgejo-http.forgejo.svc.cluster.local:80/forgejo_admin/pal-e-platform/raw/branch/main/scripts/update-kustomize-tag.sh" - sh /tmp/update-kustomize-tag.sh ``` Note: BusyBox wget (used in `alpine/git`) does **not** support `-t` or `--waitretry`. Option A (shell loop) is the correct approach for this image. ### Not Related to Harbor Connectivity Issue The Harbor connectivity board item is a separate class of failure. This was specifically about reaching Forgejo's HTTP service, which has correct NetworkPolicy rules already in place. The Harbor issue may involve a missing NetworkPolicy allow rule from woodpecker to harbor namespace (separate investigation needed).