ldraney/pal-e-platform

Fork 0

Bug: Harbor unreachable from CI pods — investigate and fix connectivity #135

New issue

Closed

opened 2026-03-21 17:39:18 +00:00 by forgejo_admin · 3 comments

forgejo_admin commented

2026-03-21 17:39:18 +00:00

Contributor

Type

Bug

Lineage

plan-pal-e-platform — standalone, discovered during operations/monitoring

Repo

forgejo_admin/pal-e-platform (NetworkPolicy + Harbor config), plus service repos with .woodpecker.yaml push steps

User Story

As a CI pipeline, I want to reach Harbor from build pods, so that image push/pull works reliably.

What Broke

Harbor container registry is unreachable from Woodpecker CI pods. Service repo pipelines that push images to Harbor fail with:

dial tcp 10.43.131.178:443: i/o timeout

(from westside-app pipeline #73; pipeline #74 succeeded on the same commit, confirming intermittent failure)

Current state of Harbor URLs across repos:

Repo	Registry URL	`insecure`	Status
westside-app	`harbor.harbor.svc.cluster.local`	`true`	Internal, should work
basketball-api	`harbor.harbor.svc.cluster.local`	`true`	Internal, should work
pal-e-docs	`harbor.harbor.svc.cluster.local`	`true`	Internal, should work
pal-e-app	`harbor.harbor.svc.cluster.local`	`true`	Internal, should work
mcd-tracker-api	`harbor.tail5b443a.ts.net`	NOT SET	External TLS — same failure pattern as Forgejo clone
mcd-tracker-app	`harbor.tail5b443a.ts.net`	NOT SET	External TLS — same failure pattern as Forgejo clone
minio-api	`harbor.tail5b443a.ts.net`	NOT SET	External TLS — same failure pattern as mcd-tracker repos

NetworkPolicy (network-policies.tf:89): woodpecker -> harbor is explicitly allowed in Terraform, BUT the Harbor namespace currently has NO NetworkPolicy deployed (kubectl get networkpolicies -n harbor returns empty). This is Terraform state drift — the policy is defined but not applied. When tofu apply eventually runs, the policy will activate. This is a latent state drift, not a policy blocking issue.

Repro Steps

Push to a service repo (e.g., westside-app) to trigger CI
Clone step succeeds (internal Forgejo URL)
Build-and-push step fails — cannot reach Harbor

Expected Behavior

CI pods should push images to Harbor via the internal cluster URL (harbor.harbor.svc.cluster.local) without TLS issues.

Environment

Cluster/namespace: prod / woodpecker -> harbor
Related alerts: pipeline failures on service repos
NetworkPolicy: defined in Terraform but NOT deployed in cluster (state drift)
Previous related issue: #110 (westside-app Harbor auth — marked done)

File Targets

terraform/network-policies.tf — Harbor netpol block (lines 75-95), woodpecker-to-harbor ingress rule
mcd-tracker-api/.woodpecker.yaml — migrate Harbor URL from external to internal (child issue needed)
mcd-tracker-app/.woodpecker.yaml — migrate Harbor URL from external to internal (child issue needed)
minio-api/.woodpecker.yaml — migrate Harbor URL from external to internal (child issue needed)

Multi-repo child issues needed: each of the three repos above requires an independent PR to update its .woodpecker.yaml.

Acceptance Criteria

Identify root cause (auth vs network vs kube-router ipset vs service resolution)
Service repo pipelines can push images to Harbor reliably
mcd-tracker repos and minio-api migrated from external to internal Harbor URL
All repos use consistent harbor.harbor.svc.cluster.local + insecure: true pattern

Test Expectations

Verify kubectl get networkpolicies -n harbor shows the policy after tofu apply
Trigger a pipeline on each affected repo and verify the build-and-push step succeeds
Confirm at least 3 consecutive successful pushes to rule out intermittent kube-router failures

Constraints

BLOCKER: #127 (kube-router ipset sync stale) must be addressed first. The intermittent failure pattern (pipeline #73 failed, #74 succeeded on the same commit) is the exact signature of kube-router not adding short-lived pod IPs to ipsets. If #127 is the root cause, migrating URLs alone will not fix the problem. URL migration is necessary but not sufficient.
If the root cause is kube-router ipset stale, ALL CI steps connecting to ANY NetworkPolicy-protected namespace are affected — not just Harbor.

Checklist

PR opened
Tests pass
No unrelated changes

#127 — kube-router ipset sync stale (BLOCKER) — intermittent failure pattern points to this as root cause
#133 — same class of bug (external URL unreliable from within cluster)
#110 — previous Harbor auth fix (westside-app, marked done)
#138 — split-horizon DNS (done, but only fixes host-level DNS, not CoreDNS inside cluster)
Repos affected: westside-app, basketball-api, pal-e-docs, pal-e-app, mcd-tracker-api, mcd-tracker-app, minio-api

### Type Bug ### Lineage `plan-pal-e-platform` — standalone, discovered during operations/monitoring ### Repo `forgejo_admin/pal-e-platform` (NetworkPolicy + Harbor config), plus service repos with `.woodpecker.yaml` push steps ### User Story As a CI pipeline, I want to reach Harbor from build pods, so that image push/pull works reliably. ### What Broke Harbor container registry is unreachable from Woodpecker CI pods. Service repo pipelines that push images to Harbor fail with: ``` dial tcp 10.43.131.178:443: i/o timeout ``` (from westside-app pipeline #73; pipeline #74 succeeded on the same commit, confirming intermittent failure) **Current state of Harbor URLs across repos:** | Repo | Registry URL | `insecure` | Status | |------|-------------|-----------|--------| | westside-app | `harbor.harbor.svc.cluster.local` | `true` | Internal, should work | | basketball-api | `harbor.harbor.svc.cluster.local` | `true` | Internal, should work | | pal-e-docs | `harbor.harbor.svc.cluster.local` | `true` | Internal, should work | | pal-e-app | `harbor.harbor.svc.cluster.local` | `true` | Internal, should work | | mcd-tracker-api | `harbor.tail5b443a.ts.net` | NOT SET | External TLS — same failure pattern as Forgejo clone | | mcd-tracker-app | `harbor.tail5b443a.ts.net` | NOT SET | External TLS — same failure pattern as Forgejo clone | | minio-api | `harbor.tail5b443a.ts.net` | NOT SET | External TLS — same failure pattern as mcd-tracker repos | **NetworkPolicy** (`network-policies.tf:89`): woodpecker -> harbor is explicitly allowed in Terraform, BUT the Harbor namespace currently has NO NetworkPolicy deployed (`kubectl get networkpolicies -n harbor` returns empty). This is Terraform state drift — the policy is defined but not applied. When `tofu apply` eventually runs, the policy will activate. This is a latent state drift, not a policy blocking issue. ### Repro Steps 1. Push to a service repo (e.g., westside-app) to trigger CI 2. Clone step succeeds (internal Forgejo URL) 3. Build-and-push step fails — cannot reach Harbor ### Expected Behavior CI pods should push images to Harbor via the internal cluster URL (`harbor.harbor.svc.cluster.local`) without TLS issues. ### Environment - Cluster/namespace: prod / woodpecker -> harbor - Related alerts: pipeline failures on service repos - NetworkPolicy: defined in Terraform but NOT deployed in cluster (state drift) - Previous related issue: #110 (westside-app Harbor auth — marked done) ### File Targets - `terraform/network-policies.tf` — Harbor netpol block (lines 75-95), woodpecker-to-harbor ingress rule - `mcd-tracker-api/.woodpecker.yaml` — migrate Harbor URL from external to internal (child issue needed) - `mcd-tracker-app/.woodpecker.yaml` — migrate Harbor URL from external to internal (child issue needed) - `minio-api/.woodpecker.yaml` — migrate Harbor URL from external to internal (child issue needed) > Multi-repo child issues needed: each of the three repos above requires an independent PR to update its `.woodpecker.yaml`. ### Acceptance Criteria - [ ] Identify root cause (auth vs network vs kube-router ipset vs service resolution) - [ ] Service repo pipelines can push images to Harbor reliably - [ ] mcd-tracker repos and minio-api migrated from external to internal Harbor URL - [ ] All repos use consistent `harbor.harbor.svc.cluster.local` + `insecure: true` pattern ### Test Expectations - Verify `kubectl get networkpolicies -n harbor` shows the policy after `tofu apply` - Trigger a pipeline on each affected repo and verify the build-and-push step succeeds - Confirm at least 3 consecutive successful pushes to rule out intermittent kube-router failures ### Constraints - **BLOCKER: #127 (kube-router ipset sync stale) must be addressed first.** The intermittent failure pattern (pipeline #73 failed, #74 succeeded on the same commit) is the exact signature of kube-router not adding short-lived pod IPs to ipsets. If #127 is the root cause, migrating URLs alone will not fix the problem. URL migration is necessary but not sufficient. - If the root cause is kube-router ipset stale, ALL CI steps connecting to ANY NetworkPolicy-protected namespace are affected — not just Harbor. ### Checklist - [ ] PR opened - [ ] Tests pass - [ ] No unrelated changes ### Related - **#127 — kube-router ipset sync stale (BLOCKER)** — intermittent failure pattern points to this as root cause - #133 — same class of bug (external URL unreliable from within cluster) - #110 — previous Harbor auth fix (westside-app, marked done) - #138 — split-horizon DNS (done, but only fixes host-level DNS, not CoreDNS inside cluster) - Repos affected: westside-app, basketball-api, pal-e-docs, pal-e-app, mcd-tracker-api, mcd-tracker-app, minio-api

forgejo_admin commented

2026-03-21 17:39:47 +00:00

Author

Contributor

Update from parallel session: The problem is wider than Harbor. CI pods can't resolve DNS at all — pip install fails with "Temporary failure in name resolution." This is a complete network failure in woodpecker agent pods, not just Harbor connectivity.

Root cause is almost certainly kube-router ipset sync (#127). The sleep 2 workaround in the clone step only helps that one container. Later step containers (build, pip install) may hit the same NetworkPolicy race.

This likely needs:

sleep 2 at the start of EVERY step (ugly but works), OR
A kube-router fix (ipset pre-population or sync delay), OR
NetworkPolicy egress rules that are more permissive for woodpecker namespace

Upgrading priority — this blocks ALL CI across ALL service repos.

**Update from parallel session:** The problem is wider than Harbor. CI pods can't resolve DNS at all — `pip install` fails with "Temporary failure in name resolution." This is a complete network failure in woodpecker agent pods, not just Harbor connectivity. Root cause is almost certainly kube-router ipset sync (#127). The `sleep 2` workaround in the clone step only helps that one container. Later step containers (build, pip install) may hit the same NetworkPolicy race. This likely needs: - `sleep 2` at the start of EVERY step (ugly but works), OR - A kube-router fix (ipset pre-population or sync delay), OR - NetworkPolicy egress rules that are more permissive for woodpecker namespace Upgrading priority — this blocks ALL CI across ALL service repos.

forgejo_admin commented

2026-03-22 18:14:33 +00:00

Author

Contributor

Scope Review: NEEDS_REFINEMENT

Review note: review-254-2026-03-22

Pipeline logs confirm failure is intermittent (westside-app #73 failed, #74 succeeded on same commit), pointing to kube-router ipset stale (#127) as probable root cause rather than a config issue.

Issues found:

#127 (kube-router ipset sync stale) is a potential blocker — not documented as a dependency. The intermittent dial tcp 10.43.131.178:443: i/o timeout pattern is the ipset signature. URL migration alone may not fix this.
minio-api missing from scope — also uses harbor.tail5b443a.ts.net without insecure flag, same pattern as mcd-tracker repos.
Harbor NetworkPolicy is NOT deployed — kubectl get networkpolicies -n harbor returns empty despite being defined in Terraform. Ticket assumes it is active.
"Exact error message TBD" — error IS available in pipeline #73 logs. Should be added to ticket body.
Missing template sections: User Story, File Targets (structured), Test Expectations, Constraints, Checklist.
Multi-repo strategy unclear — fix touches 3+ repos but no guidance on whether to create child issues per repo.

## Scope Review: NEEDS_REFINEMENT Review note: `review-254-2026-03-22` Pipeline logs confirm failure is **intermittent** (westside-app #73 failed, #74 succeeded on same commit), pointing to kube-router ipset stale (#127) as probable root cause rather than a config issue. Issues found: - **#127 (kube-router ipset sync stale) is a potential blocker** — not documented as a dependency. The intermittent `dial tcp 10.43.131.178:443: i/o timeout` pattern is the ipset signature. URL migration alone may not fix this. - **minio-api missing from scope** — also uses `harbor.tail5b443a.ts.net` without `insecure` flag, same pattern as mcd-tracker repos. - **Harbor NetworkPolicy is NOT deployed** — `kubectl get networkpolicies -n harbor` returns empty despite being defined in Terraform. Ticket assumes it is active. - **"Exact error message TBD"** — error IS available in pipeline #73 logs. Should be added to ticket body. - **Missing template sections**: User Story, File Targets (structured), Test Expectations, Constraints, Checklist. - **Multi-repo strategy unclear** — fix touches 3+ repos but no guidance on whether to create child issues per repo.

forgejo_admin commented

2026-03-24 20:20:35 +00:00

Author

Contributor

Root Cause Analysis (2026-03-24)

Root cause identified: No split-horizon DNS or internal URL convention for Harbor.

The platform pipeline uses internal URLs everywhere:

forgejo-http.forgejo.svc.cluster.local:80 for git clone/API
minio.minio.svc.cluster.local:9000 for MinIO state

But service repo pipelines push images to harbor.tail5b443a.ts.net (external URL). Inside the cluster, this routes through Tailscale DERP relay — unreliable and slow.

The NetworkPolicy already allows woodpecker → harbor ingress (network-policies.tf line 89). This is a DNS/routing issue, not a firewall issue.

Fix Options

CoreDNS rewrite rule — add rewrite name harbor.tail5b443a.ts.net harbor-core.harbor.svc.cluster.local to CoreDNS config. One-time platform fix but k3s manages CoreDNS and may overwrite.
Service repo convention — update each repo's .woodpecker.yaml to use harbor-core.harbor.svc.cluster.local as the registry URL.
Woodpecker global env — set a HARBOR_INTERNAL env var in Woodpecker agent config, reference it in service repo pipelines.

Recommendation: Option 3 + 2. Centralize the URL, then migrate repos.

## Root Cause Analysis (2026-03-24) **Root cause identified:** No split-horizon DNS or internal URL convention for Harbor. The platform pipeline uses internal URLs everywhere: - `forgejo-http.forgejo.svc.cluster.local:80` for git clone/API - `minio.minio.svc.cluster.local:9000` for MinIO state But service repo pipelines push images to `harbor.tail5b443a.ts.net` (external URL). Inside the cluster, this routes through Tailscale DERP relay — unreliable and slow. The NetworkPolicy already allows `woodpecker → harbor` ingress (network-policies.tf line 89). This is a DNS/routing issue, not a firewall issue. ### Fix Options 1. **CoreDNS rewrite rule** — add `rewrite name harbor.tail5b443a.ts.net harbor-core.harbor.svc.cluster.local` to CoreDNS config. One-time platform fix but k3s manages CoreDNS and may overwrite. 2. **Service repo convention** — update each repo's `.woodpecker.yaml` to use `harbor-core.harbor.svc.cluster.local` as the registry URL. 3. **Woodpecker global env** — set a `HARBOR_INTERNAL` env var in Woodpecker agent config, reference it in service repo pipelines. **Recommendation:** Option 3 + 2. Centralize the URL, then migrate repos.