Bug: Harbor unreachable from CI pods — investigate and fix connectivity #135

Closed
opened 2026-03-21 17:39:18 +00:00 by forgejo_admin · 3 comments

Type

Bug

Lineage

plan-pal-e-platform — standalone, discovered during operations/monitoring

Repo

forgejo_admin/pal-e-platform (NetworkPolicy + Harbor config), plus service repos with .woodpecker.yaml push steps

User Story

As a CI pipeline, I want to reach Harbor from build pods, so that image push/pull works reliably.

What Broke

Harbor container registry is unreachable from Woodpecker CI pods. Service repo pipelines that push images to Harbor fail with:

dial tcp 10.43.131.178:443: i/o timeout

(from westside-app pipeline #73; pipeline #74 succeeded on the same commit, confirming intermittent failure)

Current state of Harbor URLs across repos:

Repo Registry URL insecure Status
westside-app harbor.harbor.svc.cluster.local true Internal, should work
basketball-api harbor.harbor.svc.cluster.local true Internal, should work
pal-e-docs harbor.harbor.svc.cluster.local true Internal, should work
pal-e-app harbor.harbor.svc.cluster.local true Internal, should work
mcd-tracker-api harbor.tail5b443a.ts.net NOT SET External TLS — same failure pattern as Forgejo clone
mcd-tracker-app harbor.tail5b443a.ts.net NOT SET External TLS — same failure pattern as Forgejo clone
minio-api harbor.tail5b443a.ts.net NOT SET External TLS — same failure pattern as mcd-tracker repos

NetworkPolicy (network-policies.tf:89): woodpecker -> harbor is explicitly allowed in Terraform, BUT the Harbor namespace currently has NO NetworkPolicy deployed (kubectl get networkpolicies -n harbor returns empty). This is Terraform state drift — the policy is defined but not applied. When tofu apply eventually runs, the policy will activate. This is a latent state drift, not a policy blocking issue.

Repro Steps

  1. Push to a service repo (e.g., westside-app) to trigger CI
  2. Clone step succeeds (internal Forgejo URL)
  3. Build-and-push step fails — cannot reach Harbor

Expected Behavior

CI pods should push images to Harbor via the internal cluster URL (harbor.harbor.svc.cluster.local) without TLS issues.

Environment

  • Cluster/namespace: prod / woodpecker -> harbor
  • Related alerts: pipeline failures on service repos
  • NetworkPolicy: defined in Terraform but NOT deployed in cluster (state drift)
  • Previous related issue: #110 (westside-app Harbor auth — marked done)

File Targets

  • terraform/network-policies.tf — Harbor netpol block (lines 75-95), woodpecker-to-harbor ingress rule
  • mcd-tracker-api/.woodpecker.yaml — migrate Harbor URL from external to internal (child issue needed)
  • mcd-tracker-app/.woodpecker.yaml — migrate Harbor URL from external to internal (child issue needed)
  • minio-api/.woodpecker.yaml — migrate Harbor URL from external to internal (child issue needed)

Multi-repo child issues needed: each of the three repos above requires an independent PR to update its .woodpecker.yaml.

Acceptance Criteria

  • Identify root cause (auth vs network vs kube-router ipset vs service resolution)
  • Service repo pipelines can push images to Harbor reliably
  • mcd-tracker repos and minio-api migrated from external to internal Harbor URL
  • All repos use consistent harbor.harbor.svc.cluster.local + insecure: true pattern

Test Expectations

  • Verify kubectl get networkpolicies -n harbor shows the policy after tofu apply
  • Trigger a pipeline on each affected repo and verify the build-and-push step succeeds
  • Confirm at least 3 consecutive successful pushes to rule out intermittent kube-router failures

Constraints

  • BLOCKER: #127 (kube-router ipset sync stale) must be addressed first. The intermittent failure pattern (pipeline #73 failed, #74 succeeded on the same commit) is the exact signature of kube-router not adding short-lived pod IPs to ipsets. If #127 is the root cause, migrating URLs alone will not fix the problem. URL migration is necessary but not sufficient.
  • If the root cause is kube-router ipset stale, ALL CI steps connecting to ANY NetworkPolicy-protected namespace are affected — not just Harbor.

Checklist

  • PR opened
  • Tests pass
  • No unrelated changes
  • #127 — kube-router ipset sync stale (BLOCKER) — intermittent failure pattern points to this as root cause
  • #133 — same class of bug (external URL unreliable from within cluster)
  • #110 — previous Harbor auth fix (westside-app, marked done)
  • #138 — split-horizon DNS (done, but only fixes host-level DNS, not CoreDNS inside cluster)
  • Repos affected: westside-app, basketball-api, pal-e-docs, pal-e-app, mcd-tracker-api, mcd-tracker-app, minio-api
### Type Bug ### Lineage `plan-pal-e-platform` — standalone, discovered during operations/monitoring ### Repo `forgejo_admin/pal-e-platform` (NetworkPolicy + Harbor config), plus service repos with `.woodpecker.yaml` push steps ### User Story As a CI pipeline, I want to reach Harbor from build pods, so that image push/pull works reliably. ### What Broke Harbor container registry is unreachable from Woodpecker CI pods. Service repo pipelines that push images to Harbor fail with: ``` dial tcp 10.43.131.178:443: i/o timeout ``` (from westside-app pipeline #73; pipeline #74 succeeded on the same commit, confirming intermittent failure) **Current state of Harbor URLs across repos:** | Repo | Registry URL | `insecure` | Status | |------|-------------|-----------|--------| | westside-app | `harbor.harbor.svc.cluster.local` | `true` | Internal, should work | | basketball-api | `harbor.harbor.svc.cluster.local` | `true` | Internal, should work | | pal-e-docs | `harbor.harbor.svc.cluster.local` | `true` | Internal, should work | | pal-e-app | `harbor.harbor.svc.cluster.local` | `true` | Internal, should work | | mcd-tracker-api | `harbor.tail5b443a.ts.net` | NOT SET | External TLS — same failure pattern as Forgejo clone | | mcd-tracker-app | `harbor.tail5b443a.ts.net` | NOT SET | External TLS — same failure pattern as Forgejo clone | | minio-api | `harbor.tail5b443a.ts.net` | NOT SET | External TLS — same failure pattern as mcd-tracker repos | **NetworkPolicy** (`network-policies.tf:89`): woodpecker -> harbor is explicitly allowed in Terraform, BUT the Harbor namespace currently has NO NetworkPolicy deployed (`kubectl get networkpolicies -n harbor` returns empty). This is Terraform state drift — the policy is defined but not applied. When `tofu apply` eventually runs, the policy will activate. This is a latent state drift, not a policy blocking issue. ### Repro Steps 1. Push to a service repo (e.g., westside-app) to trigger CI 2. Clone step succeeds (internal Forgejo URL) 3. Build-and-push step fails — cannot reach Harbor ### Expected Behavior CI pods should push images to Harbor via the internal cluster URL (`harbor.harbor.svc.cluster.local`) without TLS issues. ### Environment - Cluster/namespace: prod / woodpecker -> harbor - Related alerts: pipeline failures on service repos - NetworkPolicy: defined in Terraform but NOT deployed in cluster (state drift) - Previous related issue: #110 (westside-app Harbor auth — marked done) ### File Targets - `terraform/network-policies.tf` — Harbor netpol block (lines 75-95), woodpecker-to-harbor ingress rule - `mcd-tracker-api/.woodpecker.yaml` — migrate Harbor URL from external to internal (child issue needed) - `mcd-tracker-app/.woodpecker.yaml` — migrate Harbor URL from external to internal (child issue needed) - `minio-api/.woodpecker.yaml` — migrate Harbor URL from external to internal (child issue needed) > Multi-repo child issues needed: each of the three repos above requires an independent PR to update its `.woodpecker.yaml`. ### Acceptance Criteria - [ ] Identify root cause (auth vs network vs kube-router ipset vs service resolution) - [ ] Service repo pipelines can push images to Harbor reliably - [ ] mcd-tracker repos and minio-api migrated from external to internal Harbor URL - [ ] All repos use consistent `harbor.harbor.svc.cluster.local` + `insecure: true` pattern ### Test Expectations - Verify `kubectl get networkpolicies -n harbor` shows the policy after `tofu apply` - Trigger a pipeline on each affected repo and verify the build-and-push step succeeds - Confirm at least 3 consecutive successful pushes to rule out intermittent kube-router failures ### Constraints - **BLOCKER: #127 (kube-router ipset sync stale) must be addressed first.** The intermittent failure pattern (pipeline #73 failed, #74 succeeded on the same commit) is the exact signature of kube-router not adding short-lived pod IPs to ipsets. If #127 is the root cause, migrating URLs alone will not fix the problem. URL migration is necessary but not sufficient. - If the root cause is kube-router ipset stale, ALL CI steps connecting to ANY NetworkPolicy-protected namespace are affected — not just Harbor. ### Checklist - [ ] PR opened - [ ] Tests pass - [ ] No unrelated changes ### Related - **#127 — kube-router ipset sync stale (BLOCKER)** — intermittent failure pattern points to this as root cause - #133 — same class of bug (external URL unreliable from within cluster) - #110 — previous Harbor auth fix (westside-app, marked done) - #138 — split-horizon DNS (done, but only fixes host-level DNS, not CoreDNS inside cluster) - Repos affected: westside-app, basketball-api, pal-e-docs, pal-e-app, mcd-tracker-api, mcd-tracker-app, minio-api
Author
Owner

Update from parallel session: The problem is wider than Harbor. CI pods can't resolve DNS at all — pip install fails with "Temporary failure in name resolution." This is a complete network failure in woodpecker agent pods, not just Harbor connectivity.

Root cause is almost certainly kube-router ipset sync (#127). The sleep 2 workaround in the clone step only helps that one container. Later step containers (build, pip install) may hit the same NetworkPolicy race.

This likely needs:

  • sleep 2 at the start of EVERY step (ugly but works), OR
  • A kube-router fix (ipset pre-population or sync delay), OR
  • NetworkPolicy egress rules that are more permissive for woodpecker namespace

Upgrading priority — this blocks ALL CI across ALL service repos.

**Update from parallel session:** The problem is wider than Harbor. CI pods can't resolve DNS at all — `pip install` fails with "Temporary failure in name resolution." This is a complete network failure in woodpecker agent pods, not just Harbor connectivity. Root cause is almost certainly kube-router ipset sync (#127). The `sleep 2` workaround in the clone step only helps that one container. Later step containers (build, pip install) may hit the same NetworkPolicy race. This likely needs: - `sleep 2` at the start of EVERY step (ugly but works), OR - A kube-router fix (ipset pre-population or sync delay), OR - NetworkPolicy egress rules that are more permissive for woodpecker namespace Upgrading priority — this blocks ALL CI across ALL service repos.
Author
Owner

Scope Review: NEEDS_REFINEMENT

Review note: review-254-2026-03-22

Pipeline logs confirm failure is intermittent (westside-app #73 failed, #74 succeeded on same commit), pointing to kube-router ipset stale (#127) as probable root cause rather than a config issue.

Issues found:

  • #127 (kube-router ipset sync stale) is a potential blocker — not documented as a dependency. The intermittent dial tcp 10.43.131.178:443: i/o timeout pattern is the ipset signature. URL migration alone may not fix this.
  • minio-api missing from scope — also uses harbor.tail5b443a.ts.net without insecure flag, same pattern as mcd-tracker repos.
  • Harbor NetworkPolicy is NOT deployedkubectl get networkpolicies -n harbor returns empty despite being defined in Terraform. Ticket assumes it is active.
  • "Exact error message TBD" — error IS available in pipeline #73 logs. Should be added to ticket body.
  • Missing template sections: User Story, File Targets (structured), Test Expectations, Constraints, Checklist.
  • Multi-repo strategy unclear — fix touches 3+ repos but no guidance on whether to create child issues per repo.
## Scope Review: NEEDS_REFINEMENT Review note: `review-254-2026-03-22` Pipeline logs confirm failure is **intermittent** (westside-app #73 failed, #74 succeeded on same commit), pointing to kube-router ipset stale (#127) as probable root cause rather than a config issue. Issues found: - **#127 (kube-router ipset sync stale) is a potential blocker** — not documented as a dependency. The intermittent `dial tcp 10.43.131.178:443: i/o timeout` pattern is the ipset signature. URL migration alone may not fix this. - **minio-api missing from scope** — also uses `harbor.tail5b443a.ts.net` without `insecure` flag, same pattern as mcd-tracker repos. - **Harbor NetworkPolicy is NOT deployed** — `kubectl get networkpolicies -n harbor` returns empty despite being defined in Terraform. Ticket assumes it is active. - **"Exact error message TBD"** — error IS available in pipeline #73 logs. Should be added to ticket body. - **Missing template sections**: User Story, File Targets (structured), Test Expectations, Constraints, Checklist. - **Multi-repo strategy unclear** — fix touches 3+ repos but no guidance on whether to create child issues per repo.
Author
Owner

Root Cause Analysis (2026-03-24)

Root cause identified: No split-horizon DNS or internal URL convention for Harbor.

The platform pipeline uses internal URLs everywhere:

  • forgejo-http.forgejo.svc.cluster.local:80 for git clone/API
  • minio.minio.svc.cluster.local:9000 for MinIO state

But service repo pipelines push images to harbor.tail5b443a.ts.net (external URL). Inside the cluster, this routes through Tailscale DERP relay — unreliable and slow.

The NetworkPolicy already allows woodpecker → harbor ingress (network-policies.tf line 89). This is a DNS/routing issue, not a firewall issue.

Fix Options

  1. CoreDNS rewrite rule — add rewrite name harbor.tail5b443a.ts.net harbor-core.harbor.svc.cluster.local to CoreDNS config. One-time platform fix but k3s manages CoreDNS and may overwrite.

  2. Service repo convention — update each repo's .woodpecker.yaml to use harbor-core.harbor.svc.cluster.local as the registry URL.

  3. Woodpecker global env — set a HARBOR_INTERNAL env var in Woodpecker agent config, reference it in service repo pipelines.

Recommendation: Option 3 + 2. Centralize the URL, then migrate repos.

## Root Cause Analysis (2026-03-24) **Root cause identified:** No split-horizon DNS or internal URL convention for Harbor. The platform pipeline uses internal URLs everywhere: - `forgejo-http.forgejo.svc.cluster.local:80` for git clone/API - `minio.minio.svc.cluster.local:9000` for MinIO state But service repo pipelines push images to `harbor.tail5b443a.ts.net` (external URL). Inside the cluster, this routes through Tailscale DERP relay — unreliable and slow. The NetworkPolicy already allows `woodpecker → harbor` ingress (network-policies.tf line 89). This is a DNS/routing issue, not a firewall issue. ### Fix Options 1. **CoreDNS rewrite rule** — add `rewrite name harbor.tail5b443a.ts.net harbor-core.harbor.svc.cluster.local` to CoreDNS config. One-time platform fix but k3s manages CoreDNS and may overwrite. 2. **Service repo convention** — update each repo's `.woodpecker.yaml` to use `harbor-core.harbor.svc.cluster.local` as the registry URL. 3. **Woodpecker global env** — set a `HARBOR_INTERNAL` env var in Woodpecker agent config, reference it in service repo pipelines. **Recommendation:** Option 3 + 2. Centralize the URL, then migrate repos.
forgejo_admin 2026-03-24 20:56:21 +00:00
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
forgejo_admin/pal-e-platform#135
No description provided.