Bug: Harbor unreachable from CI pods — investigate and fix connectivity #135
Labels
No labels
domain:backend
domain:devops
domain:frontend
status:approved
status:in-progress
status:needs-fix
status:qa
type:bug
type:devops
type:feature
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
forgejo_admin/pal-e-platform#135
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Type
Bug
Lineage
plan-pal-e-platform— standalone, discovered during operations/monitoringRepo
forgejo_admin/pal-e-platform(NetworkPolicy + Harbor config), plus service repos with.woodpecker.yamlpush stepsUser Story
As a CI pipeline, I want to reach Harbor from build pods, so that image push/pull works reliably.
What Broke
Harbor container registry is unreachable from Woodpecker CI pods. Service repo pipelines that push images to Harbor fail with:
(from westside-app pipeline #73; pipeline #74 succeeded on the same commit, confirming intermittent failure)
Current state of Harbor URLs across repos:
insecureharbor.harbor.svc.cluster.localtrueharbor.harbor.svc.cluster.localtrueharbor.harbor.svc.cluster.localtrueharbor.harbor.svc.cluster.localtrueharbor.tail5b443a.ts.netharbor.tail5b443a.ts.netharbor.tail5b443a.ts.netNetworkPolicy (
network-policies.tf:89): woodpecker -> harbor is explicitly allowed in Terraform, BUT the Harbor namespace currently has NO NetworkPolicy deployed (kubectl get networkpolicies -n harborreturns empty). This is Terraform state drift — the policy is defined but not applied. Whentofu applyeventually runs, the policy will activate. This is a latent state drift, not a policy blocking issue.Repro Steps
Expected Behavior
CI pods should push images to Harbor via the internal cluster URL (
harbor.harbor.svc.cluster.local) without TLS issues.Environment
File Targets
terraform/network-policies.tf— Harbor netpol block (lines 75-95), woodpecker-to-harbor ingress rulemcd-tracker-api/.woodpecker.yaml— migrate Harbor URL from external to internal (child issue needed)mcd-tracker-app/.woodpecker.yaml— migrate Harbor URL from external to internal (child issue needed)minio-api/.woodpecker.yaml— migrate Harbor URL from external to internal (child issue needed)Acceptance Criteria
harbor.harbor.svc.cluster.local+insecure: truepatternTest Expectations
kubectl get networkpolicies -n harborshows the policy aftertofu applyConstraints
Checklist
Related
Update from parallel session: The problem is wider than Harbor. CI pods can't resolve DNS at all —
pip installfails with "Temporary failure in name resolution." This is a complete network failure in woodpecker agent pods, not just Harbor connectivity.Root cause is almost certainly kube-router ipset sync (#127). The
sleep 2workaround in the clone step only helps that one container. Later step containers (build, pip install) may hit the same NetworkPolicy race.This likely needs:
sleep 2at the start of EVERY step (ugly but works), ORUpgrading priority — this blocks ALL CI across ALL service repos.
Scope Review: NEEDS_REFINEMENT
Review note:
review-254-2026-03-22Pipeline logs confirm failure is intermittent (westside-app #73 failed, #74 succeeded on same commit), pointing to kube-router ipset stale (#127) as probable root cause rather than a config issue.
Issues found:
dial tcp 10.43.131.178:443: i/o timeoutpattern is the ipset signature. URL migration alone may not fix this.harbor.tail5b443a.ts.netwithoutinsecureflag, same pattern as mcd-tracker repos.kubectl get networkpolicies -n harborreturns empty despite being defined in Terraform. Ticket assumes it is active.Root Cause Analysis (2026-03-24)
Root cause identified: No split-horizon DNS or internal URL convention for Harbor.
The platform pipeline uses internal URLs everywhere:
forgejo-http.forgejo.svc.cluster.local:80for git clone/APIminio.minio.svc.cluster.local:9000for MinIO stateBut service repo pipelines push images to
harbor.tail5b443a.ts.net(external URL). Inside the cluster, this routes through Tailscale DERP relay — unreliable and slow.The NetworkPolicy already allows
woodpecker → harboringress (network-policies.tf line 89). This is a DNS/routing issue, not a firewall issue.Fix Options
CoreDNS rewrite rule — add
rewrite name harbor.tail5b443a.ts.net harbor-core.harbor.svc.cluster.localto CoreDNS config. One-time platform fix but k3s manages CoreDNS and may overwrite.Service repo convention — update each repo's
.woodpecker.yamlto useharbor-core.harbor.svc.cluster.localas the registry URL.Woodpecker global env — set a
HARBOR_INTERNALenv var in Woodpecker agent config, reference it in service repo pipelines.Recommendation: Option 3 + 2. Centralize the URL, then migrate repos.