Bug: CI build-and-push fails with Harbor connectivity timeout from Woodpecker agent

forgejo_admin commented

2026-03-27 00:03:58 +00:00

Contributor

Type

Bug

Lineage

standalone — discovered during CI pipeline monitoring for basketball-api PRs #172/#174

Repo

forgejo_admin/pal-e-platform

What Broke

Woodpecker CI pipeline build-and-push step fails with Harbor registry connectivity timeout. Kaniko attempts HTTPS (port 443) on the harbor ClusterIP service which only exposes port 80, then falls back to HTTP which gets connection refused.

error checking push permissions: checking push permission for
"harbor.harbor.svc.cluster.local/basketball-api/api:c3247bf..."
creating push check transport for harbor.harbor.svc.cluster.local failed:
Get "https://harbor.harbor.svc.cluster.local/v2/": dial tcp 10.43.131.178:443: i/o timeout
Get "http://harbor.harbor.svc.cluster.local/v2/": dial tcp 10.43.131.178:80: connect: connection refused

On retry (pipeline #146), the postgres service container also failed to start (empty logs), suggesting broader CI agent resource pressure or networking issues.

Repro Steps

Merge any PR to basketball-api main (or any repo with Woodpecker CI build-and-push)
Tests pass (555/555 on pipeline #145)
build-and-push step times out connecting to Harbor
Retry (pipeline #146) — postgres service container fails to start entirely

Expected Behavior

Kaniko should successfully push images to Harbor via harbor.harbor.svc.cluster.local:80. All Harbor pods are Running, services are up. This worked previously.

Environment

Cluster/namespace: prod / harbor, woodpecker
Harbor pods: all 10 Running (checked 2026-03-26)
Harbor service: ClusterIP 10.43.131.178, port 80/TCP only (no 443)
Affected pipelines: basketball-api #143, #145, #146
CI agent: Woodpecker (k8s or Mac agent — needs investigation)

Acceptance Criteria

Identify whether this is network policy, DNS, or agent resource issue
CI pipeline successfully builds and pushes image to Harbor
basketball-api deploys with migrations 022 + 023

forgejo_admin/basketball-api#170 — jersey sync fix waiting on deploy
forgejo_admin/basketball-api#173 — teams/save fix waiting on deploy
feedback_ci_pipeline_lessons.md — 12 prior CI root causes (Harbor hairpin was one)
sop-ci-pipeline-recovery — CI recovery SOP

### Type Bug ### Lineage standalone — discovered during CI pipeline monitoring for basketball-api PRs #172/#174 ### Repo `forgejo_admin/pal-e-platform` ### What Broke Woodpecker CI pipeline `build-and-push` step fails with Harbor registry connectivity timeout. Kaniko attempts HTTPS (port 443) on the `harbor` ClusterIP service which only exposes port 80, then falls back to HTTP which gets connection refused. ``` error checking push permissions: checking push permission for "harbor.harbor.svc.cluster.local/basketball-api/api:c3247bf..." creating push check transport for harbor.harbor.svc.cluster.local failed: Get "https://harbor.harbor.svc.cluster.local/v2/": dial tcp 10.43.131.178:443: i/o timeout Get "http://harbor.harbor.svc.cluster.local/v2/": dial tcp 10.43.131.178:80: connect: connection refused ``` On retry (pipeline #146), the postgres service container also failed to start (empty logs), suggesting broader CI agent resource pressure or networking issues. ### Repro Steps 1. Merge any PR to `basketball-api` main (or any repo with Woodpecker CI build-and-push) 2. Tests pass (555/555 on pipeline #145) 3. `build-and-push` step times out connecting to Harbor 4. Retry (pipeline #146) — postgres service container fails to start entirely ### Expected Behavior Kaniko should successfully push images to Harbor via `harbor.harbor.svc.cluster.local:80`. All Harbor pods are Running, services are up. This worked previously. ### Environment - Cluster/namespace: prod / `harbor`, `woodpecker` - Harbor pods: all 10 Running (checked 2026-03-26) - Harbor service: ClusterIP `10.43.131.178`, port 80/TCP only (no 443) - Affected pipelines: basketball-api #143, #145, #146 - CI agent: Woodpecker (k8s or Mac agent — needs investigation) ### Acceptance Criteria - [ ] Identify whether this is network policy, DNS, or agent resource issue - [ ] CI pipeline successfully builds and pushes image to Harbor - [ ] basketball-api deploys with migrations 022 + 023 ### Related - `forgejo_admin/basketball-api#170` — jersey sync fix waiting on deploy - `forgejo_admin/basketball-api#173` — teams/save fix waiting on deploy - `feedback_ci_pipeline_lessons.md` — 12 prior CI root causes (Harbor hairpin was one) - `sop-ci-pipeline-recovery` — CI recovery SOP

forgejo_admin commented

2026-03-27 00:07:41 +00:00

Author

Contributor

Scope Review: NEEDS_REFINEMENT

Review note: review-411-2026-03-26

Scope document is strong on context (error output, environment, repro steps) but missing 5 template sections needed for agent execution.

Missing File Targets: No files listed. Key investigation targets: terraform/network-policies.tf, terraform/main.tf, basketball-api/.woodpecker.yaml
Missing Test Expectations: No verification commands for an agent to run post-fix
Acceptance criteria need scoping: "basketball-api deploys with migrations 022+023" is a separate deploy concern -- split to its own issue. "Identify root cause" is investigative, not testable -- rewrite as "Root cause documented in PR description"
Mac agent hypothesis unaddressed: The recently enabled Mac agent (backend: local, board #391) cannot run Kaniko containers. If Woodpecker misrouted pipeline #145 to the Mac agent, that is the root cause -- not network policy. Check agent assignment before investigating network layer.
SOP gap: sop-ci-pipeline-recovery does not cover connectivity timeout failure mode (port 443 on HTTP-only service). Update SOP after root cause confirmed.

## Scope Review: NEEDS_REFINEMENT Review note: `review-411-2026-03-26` Scope document is strong on context (error output, environment, repro steps) but missing 5 template sections needed for agent execution. - **Missing File Targets**: No files listed. Key investigation targets: `terraform/network-policies.tf`, `terraform/main.tf`, `basketball-api/.woodpecker.yaml` - **Missing Test Expectations**: No verification commands for an agent to run post-fix - **Acceptance criteria need scoping**: "basketball-api deploys with migrations 022+023" is a separate deploy concern -- split to its own issue. "Identify root cause" is investigative, not testable -- rewrite as "Root cause documented in PR description" - **Mac agent hypothesis unaddressed**: The recently enabled Mac agent (`backend: local`, board #391) cannot run Kaniko containers. If Woodpecker misrouted pipeline #145 to the Mac agent, that is the root cause -- not network policy. Check agent assignment before investigating network layer. - **SOP gap**: `sop-ci-pipeline-recovery` does not cover connectivity timeout failure mode (port 443 on HTTP-only service). Update SOP after root cause confirmed.

forgejo_admin commented

2026-03-27 00:24:48 +00:00

Author

Contributor

Root Cause Investigation (2026-03-26)

Original hypothesis was Harbor connectivity / Mac agent routing. Both ruled out.

Actual root cause: Woodpecker server DB state corruption

Woodpecker DB (woodpecker-db-1) became briefly unreachable
Server restarted with failed to setup store: dial tcp 10.43.54.87:5432: connection refused
Agent crash-looped 3x waiting for server gRPC (:9000 refused)
After recovery, server logs show persistent errors:
- queue.Done: cannot ack workflow -- sql: no rows in result set
- done: cannot close log stream for step N -- stream: not found
Service containers (postgres) report as "failed" with empty logs -- the agent creates pods fine (verified manually), but the server can't track the workflow state
Rolling restart of server + agent did NOT fix it -- the DB still has stale/missing workflow records
5 consecutive pipeline retries (#145-149) all fail at postgres service container with empty logs

Verified NOT the cause

Mac agent routing: Mac agent has filter_labels: "platform=darwin", basketball-api pipelines have no labels. Agent ran on k8s (confirmed by clone using forgejo-http.forgejo.svc.cluster.local)
Harbor connectivity: Never reached build-and-push step on retries. Original #145 Harbor timeout was during the server instability window
Image pull: postgres:16-alpine starts fine manually in woodpecker namespace
Resources: 19% CPU, 18% memory on archbox -- no pressure

Fix needed

Restart Woodpecker DB → server → agent in order, or investigate woodpecker-db-1 for corrupted workflow/step records. The SOP (sop-ci-pipeline-recovery) should be updated with this failure mode.

## Root Cause Investigation (2026-03-26) Original hypothesis was Harbor connectivity / Mac agent routing. **Both ruled out.** ### Actual root cause: Woodpecker server DB state corruption 1. **Woodpecker DB** (`woodpecker-db-1`) became briefly unreachable 2. **Server** restarted with `failed to setup store: dial tcp 10.43.54.87:5432: connection refused` 3. **Agent** crash-looped 3x waiting for server gRPC (`:9000` refused) 4. After recovery, server logs show persistent errors: - `queue.Done: cannot ack workflow -- sql: no rows in result set` - `done: cannot close log stream for step N -- stream: not found` 5. Service containers (postgres) report as "failed" with empty logs -- the agent creates pods fine (verified manually), but the server can't track the workflow state 6. Rolling restart of server + agent did NOT fix it -- the DB still has stale/missing workflow records 7. 5 consecutive pipeline retries (#145-149) all fail at postgres service container with empty logs ### Verified NOT the cause - **Mac agent routing**: Mac agent has `filter_labels: "platform=darwin"`, basketball-api pipelines have no labels. Agent ran on k8s (confirmed by clone using `forgejo-http.forgejo.svc.cluster.local`) - **Harbor connectivity**: Never reached build-and-push step on retries. Original #145 Harbor timeout was during the server instability window - **Image pull**: `postgres:16-alpine` starts fine manually in woodpecker namespace - **Resources**: 19% CPU, 18% memory on archbox -- no pressure ### Fix needed Restart Woodpecker DB → server → agent in order, or investigate `woodpecker-db-1` for corrupted workflow/step records. The SOP (`sop-ci-pipeline-recovery`) should be updated with this failure mode.

forgejo_admin commented

2026-03-27 00:25:49 +00:00

Author

Contributor

Incident Escalation: Woodpecker DB corruption

Original scope was Harbor connectivity. Root cause is deeper:

What's Actually Broken

Woodpecker server DB has corrupted/stale workflow records
sql: no rows in result set errors on queue.Done
Every pipeline restart gets queued but service containers "fail" with empty logs
Agent connects but can't ack workflows back to server
Caused by restart cascade from Helm rollback during alert triage session

Severity

P2 — CI Degraded. All pipelines are broken. Services that need CI builds are blocked.

Helm rollback of woodpecker to revision 14 (during this session to unblock tofu apply)
PodRestartStorm alert firing on woodpecker namespace
PR #185 CI clone failure was a symptom of this

Options

Check Woodpecker DB directly for corrupted records
Restart whole stack in order: DB → server → agent (likely cleanest)
Defer to next session with full investigation

## Incident Escalation: Woodpecker DB corruption Original scope was Harbor connectivity. Root cause is deeper: ### What's Actually Broken - Woodpecker server DB has corrupted/stale workflow records - `sql: no rows in result set` errors on `queue.Done` - Every pipeline restart gets queued but service containers "fail" with empty logs - Agent connects but can't ack workflows back to server - Caused by restart cascade from Helm rollback during alert triage session ### Severity **P2 — CI Degraded.** All pipelines are broken. Services that need CI builds are blocked. ### Related - Helm rollback of woodpecker to revision 14 (during this session to unblock tofu apply) - PodRestartStorm alert firing on woodpecker namespace - PR #185 CI clone failure was a symptom of this ### Options 1. Check Woodpecker DB directly for corrupted records 2. Restart whole stack in order: DB → server → agent (likely cleanest) 3. Defer to next session with full investigation

forgejo_admin referenced this issue

2026-03-27 00:36:33 +00:00

Critical: Migrate basketball-api Postgres to CNPG shared cluster #187

forgejo_admin referenced this issue

2026-03-27 01:29:13 +00:00

Woodpecker agent label routing — platform-wide pipeline contract #191

forgejo_admin referenced this issue

2026-03-27 01:46:22 +00:00

feat: Woodpecker agent label routing + retry count (#191) #192

forgejo_admin referenced this issue

2026-03-27 02:23:59 +00:00

Fix: Kaniko HTTPS probe timeout — add insecure-registry to internal Harbor pipelines #193

forgejo_admin referenced this issue

2026-03-27 03:22:03 +00:00

Bump Woodpecker agent parallel workflows from 1 to 4 #194

forgejo_admin referenced this issue

2026-03-27 03:27:33 +00:00

feat: bump Woodpecker agent parallel workflows 1→4 (#194) #195

forgejo_admin referenced this issue

2026-03-27 03:28:45 +00:00

feat: bump Woodpecker agent parallel workflows 1→4 (#194) #195

forgejo_admin referenced this issue

2026-03-27 03:59:54 +00:00

Terraform state splitting — modularize monolith main.tf for isolated applies #197

forgejo_admin referenced this issue from ldraney/westside-app

2026-03-27 05:46:37 +00:00

deploy: update image to pipeline #85 build #105

forgejo_admin referenced this issue from ldraney/westside-app

2026-03-27 05:49:39 +00:00

deploy: update image to pipeline #85 build (v2) #106

forgejo_admin referenced this issue

2026-03-27 06:03:45 +00:00

Add argocd namespace to Forgejo network policy #200

forgejo_admin referenced this issue

2026-03-27 06:04:12 +00:00

Migrate all ArgoCD apps to pal-e-deployments centralized overlays #201

forgejo_admin referenced this issue from ldraney/basketball-api

2026-03-27 06:04:33 +00:00

Bug: Alembic migration chain stuck — 020 applied but not stamped, app down #184

forgejo_admin referenced this issue

2026-03-27 06:14:56 +00:00

Add argocd namespace to Forgejo network policy #200

forgejo_admin commented

2026-03-28 22:50:00 +00:00

Author

Contributor

Resolution: RESOLVED (self-healed)

Investigation 2026-03-28:

Harbor fully healthy — HTTP 200, all 9 pods Running, zero restarts in 11+ days
CI pipelines succeeding — basketball-api #231-233 all SUCCESS including image push to Harbor
Root cause (Woodpecker DB corruption from Helm rollback) has self-healed — no stuck workflows remain
Residual sql: no rows in result set / stream: not found log noise (~5/hour) is cosmetic, not blocking

Discovered scope (tracked separately):

Woodpecker server log noise cleanup (low priority)
SOP update: sop-ci-pipeline-recovery missing "DB corruption from Helm rollback" failure mode

Closing.

## Resolution: RESOLVED (self-healed) **Investigation 2026-03-28:** - Harbor fully healthy — HTTP 200, all 9 pods Running, zero restarts in 11+ days - CI pipelines succeeding — basketball-api #231-233 all SUCCESS including image push to Harbor - Root cause (Woodpecker DB corruption from Helm rollback) has self-healed — no stuck workflows remain - Residual `sql: no rows in result set` / `stream: not found` log noise (~5/hour) is cosmetic, not blocking **Discovered scope (tracked separately):** 1. Woodpecker server log noise cleanup (low priority) 2. SOP update: `sop-ci-pipeline-recovery` missing "DB corruption from Helm rollback" failure mode Closing.

forgejo_admin referenced this issue

2026-03-28 22:50:32 +00:00

Woodpecker server log noise: orphaned queue.Done / stream errors #241

forgejo_admin referenced this issue

2026-03-28 22:50:42 +00:00

SOP update: add Helm rollback DB corruption to sop-ci-pipeline-recovery #242