Kaniko build-and-push intermittently fails: cluster-internal Harbor unreachable #82

Closed
opened 2026-06-04 05:22:25 +00:00 by ldraney · 0 comments
Owner

Type

Bug

Lineage

Related to ldraney/landscaping-assistant #23 (CI optimization) and #77 (build-arg regression). This is the deeper root cause that #77's fixes only partially address.

Repo

ldraney/landscaping-assistant

What Broke

Kaniko build-and-push step intermittently fails to connect to harbor.harbor.svc.cluster.local. The failure pattern:

  1. Kaniko tries HTTPS on port 443 — times out (Harbor only serves HTTP on 80)
  2. Falls back to HTTP on port 80 — gets "connection refused"

The --insecure-pull flag (PR #79) does not skip the HTTPS attempt — it only allows HTTP fallback. By the time the HTTPS probe times out (~30s), the HTTP connection is also refused.

Meanwhile, regular pods in the woodpecker namespace (busybox, test pods) can reach Harbor on port 80 without issues. The problem is specific to Kaniko's connection behavior after the HTTPS timeout.

Successful builds (#164, #161, #147) used the Tailscale FQDN for pulls and cluster-internal for push — but the FQDN path also fails intermittently via DERP relay drops.

Repro Steps

  1. Merge any PR to main
  2. Pipeline triggers, lint + test pass
  3. build-and-push starts — ~50% chance of failure
  4. Retry sometimes succeeds (intermittent)

Expected Behavior

Kaniko reliably pulls base images and pushes to Harbor on every pipeline run.

Environment

  • Cluster/namespace: prod / woodpecker
  • Woodpecker agent: single replica (issue #62 tracks scaling)
  • Kaniko plugin: woodpeckerci/plugin-kaniko:2.3.0
  • Harbor service: ClusterIP 10.43.131.178:80 (HTTP only, no HTTPS)

Investigation Notes

What was tried:

  • PR #78: Remove build-arg, pull via Tailscale FQDN → DERP relay drops mid-transfer
  • PR #79: Restore build-arg + --insecure-pull → still tries HTTPS first, times out
  • Pipeline #184: build-and-push started successfully but was canceled by superseding pipeline #185

Possible fixes (not yet tried):

  • Add HTTPS (port 443) to Harbor k8s Service — fast TLS probe instead of 30s timeout
  • Use --registry-mirror with HTTP mirror for cluster-internal Harbor
  • Add retry logic to the build-and-push step
  • Investigate why HTTP on port 80 gets "connection refused" after HTTPS timeout (TCP state issue?)

Auto-cancel interference: Rapid merges to main cause pipelines to supersede each other (pipeline #184 was canceled by #185). Issue #62 tracks Woodpecker agent scaling which would help.

Acceptance Criteria

  • Kaniko build-and-push succeeds reliably (>95% of runs)
  • No DERP relay dependency for image builds
  • Pipeline is not silently canceled mid-build
  • ldraney/landscaping-assistant #23 — parent CI optimization issue
  • ldraney/landscaping-assistant #62 — Woodpecker agent scaling
  • ldraney/landscaping-assistant #77 — build-arg regression (partially addressed)
  • landscaping-assistant — project this affects
### Type Bug ### Lineage Related to `ldraney/landscaping-assistant #23` (CI optimization) and `#77` (build-arg regression). This is the deeper root cause that #77's fixes only partially address. ### Repo `ldraney/landscaping-assistant` ### What Broke Kaniko build-and-push step intermittently fails to connect to `harbor.harbor.svc.cluster.local`. The failure pattern: 1. Kaniko tries HTTPS on port 443 — times out (Harbor only serves HTTP on 80) 2. Falls back to HTTP on port 80 — gets "connection refused" The `--insecure-pull` flag (PR #79) does not skip the HTTPS attempt — it only allows HTTP fallback. By the time the HTTPS probe times out (~30s), the HTTP connection is also refused. Meanwhile, regular pods in the woodpecker namespace (busybox, test pods) can reach Harbor on port 80 without issues. The problem is specific to Kaniko's connection behavior after the HTTPS timeout. Successful builds (#164, #161, #147) used the Tailscale FQDN for pulls and cluster-internal for push — but the FQDN path also fails intermittently via DERP relay drops. ### Repro Steps 1. Merge any PR to main 2. Pipeline triggers, lint + test pass 3. `build-and-push` starts — ~50% chance of failure 4. Retry sometimes succeeds (intermittent) ### Expected Behavior Kaniko reliably pulls base images and pushes to Harbor on every pipeline run. ### Environment - Cluster/namespace: prod / woodpecker - Woodpecker agent: single replica (issue #62 tracks scaling) - Kaniko plugin: woodpeckerci/plugin-kaniko:2.3.0 - Harbor service: ClusterIP 10.43.131.178:80 (HTTP only, no HTTPS) ### Investigation Notes **What was tried:** - PR #78: Remove build-arg, pull via Tailscale FQDN → DERP relay drops mid-transfer - PR #79: Restore build-arg + `--insecure-pull` → still tries HTTPS first, times out - Pipeline #184: build-and-push started successfully but was canceled by superseding pipeline #185 **Possible fixes (not yet tried):** - Add HTTPS (port 443) to Harbor k8s Service — fast TLS probe instead of 30s timeout - Use `--registry-mirror` with HTTP mirror for cluster-internal Harbor - Add retry logic to the build-and-push step - Investigate why HTTP on port 80 gets "connection refused" after HTTPS timeout (TCP state issue?) **Auto-cancel interference:** Rapid merges to main cause pipelines to supersede each other (pipeline #184 was canceled by #185). Issue #62 tracks Woodpecker agent scaling which would help. ### Acceptance Criteria - [ ] Kaniko build-and-push succeeds reliably (>95% of runs) - [ ] No DERP relay dependency for image builds - [ ] Pipeline is not silently canceled mid-build ### Related - `ldraney/landscaping-assistant #23` — parent CI optimization issue - `ldraney/landscaping-assistant #62` — Woodpecker agent scaling - `ldraney/landscaping-assistant #77` — build-arg regression (partially addressed) - `landscaping-assistant` — project this affects
Sign in to join this conversation.
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
ldraney/landscaping-assistant#82
No description provided.