Bug: Woodpecker CI update-kustomize-tag fails — can't reach forgejo-http service #230
Labels
No labels
domain:backend
domain:devops
domain:frontend
status:approved
status:in-progress
status:needs-fix
status:qa
type:bug
type:devops
type:feature
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
forgejo_admin/pal-e-platform#230
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Type
Bug
Lineage
Possibly related to existing "Harbor connectivity timeout from Woodpecker CI agent" board item — same class of failure (CI agent can't reach in-cluster services).
Repo
forgejo_admin/pal-e-platformWhat Broke
The
update-kustomize-tagstep in Woodpecker CI pipelines fails with connection refused when trying to download the shared script from Forgejo's internal service:This blocks all deployments — images build and push to Harbor successfully, but the kustomize tag never gets updated so ArgoCD never syncs.
Observed on:
forgejo_admin/westside-apppipeline #147 (push to main after PR #146 merge)forgejo_admin/westside-apppipeline #146 (push to main after PR #145 merge) — all steps after clone skippedforgejo_admin/basketball-apipipeline #214 (push to main after PR #213 merge) — kustomize step skippedRepro Steps
update-kustomize-tagConnection refusedtoforgejo-http.forgejo.svc.cluster.local:80Expected Behavior
The
update-kustomize-tagstep downloads the shared script from Forgejo, updates the image tag inpal-e-deployments, and commits — triggering ArgoCD sync.Environment
forgejo-http.forgejo.svc.cluster.local:80(ClusterIP10.43.106.198)Acceptance Criteria
update-kustomize-tagstep succeeds in CI pipelinesRelated
project-pal-e-platform— platform infrastructureInvestigation Report
Summary
The
update-kustomize-tag"Connection refused" failure was transient -- not caused by a persistent networking or configuration problem. The failed pipeline (#147) was retried and succeeded as pipeline #148 with all steps passing, includingupdate-kustomize-tag.Findings
1. Service is healthy and reachable
forgejo-httpClusterIP service exists at10.43.106.198:80with a live endpoint (10.42.0.153:80)forgejo-646b68f9d4-hpc42is Running (1/1), 0 restarts, started2026-03-21T15:10:37Zwoodpeckernamespace toforgejo-http.forgejo.svc.cluster.local:80succeeded immediately2. Network policies are correctly configured
default-deny-ingresswith explicit allow rules for:tailscale,woodpecker,monitoring,argocdwoodpeckernamespace (WOODPECKER_BACKEND_K8S_NAMESPACE=woodpecker), so they match the allow rulekubernetes.io/metadata.name=woodpeckermatches the policy selector3. Failure pattern suggests brief Forgejo unavailability during heavy load
update-kustomize-tagwas skipped (likely because it ran during a brief Forgejo hiccup at a later point)update-kustomize-tagwget got "Connection refused" -- Forgejo was momentarily unreachable during that specific stepupdate-kustomize-tagsucceededupdate-kustomize-tagsucceeded2026-03-28T18:07:28Z, close to the failure window. Heavy agent activity (80+ agents mentioned in session notes for 2026-03-28) likely created load spikes4. Not related to issue #226 (auth/404 problem)
--header="Authorization: token ${FORGEJO_TOKEN}"to wget.5. Node resources are fine
Root Cause (Most Likely)
Transient service unavailability during a period of heavy platform activity (the session notes mention 16+ PRs, 80+ agents, and Woodpecker server restart on 2026-03-28). The Forgejo pod itself didn't restart, but brief TCP connection refusal can occur when:
Resolution
update-kustomize-tag)Recommendation: Add retry logic to wget
To make the pipeline resilient to transient failures, add a retry loop to the wget command in the
update-kustomize-tagstep. The script itself (update-kustomize-tag.sh) already has retry logic forgit push, but the initialwgetto download the script has no retries.Option A -- Add wget retry flags in the
.woodpecker.yamlstep:Option B -- Use wget's built-in retry (
-t 3 --waitretry=5):Note: BusyBox wget (used in
alpine/git) does not support-tor--waitretry. Option A (shell loop) is the correct approach for this image.Not Related to Harbor Connectivity Issue
The Harbor connectivity board item is a separate class of failure. This was specifically about reaching Forgejo's HTTP service, which has correct NetworkPolicy rules already in place. The Harbor issue may involve a missing NetworkPolicy allow rule from woodpecker to harbor namespace (separate investigation needed).