Bug: CI build-and-push fails with Harbor connectivity timeout from Woodpecker agent #184

Open
opened 2026-03-27 00:03:58 +00:00 by forgejo_admin · 3 comments

Type

Bug

Lineage

standalone — discovered during CI pipeline monitoring for basketball-api PRs #172/#174

Repo

forgejo_admin/pal-e-platform

What Broke

Woodpecker CI pipeline build-and-push step fails with Harbor registry connectivity timeout. Kaniko attempts HTTPS (port 443) on the harbor ClusterIP service which only exposes port 80, then falls back to HTTP which gets connection refused.

error checking push permissions: checking push permission for
"harbor.harbor.svc.cluster.local/basketball-api/api:c3247bf..."
creating push check transport for harbor.harbor.svc.cluster.local failed:
Get "https://harbor.harbor.svc.cluster.local/v2/": dial tcp 10.43.131.178:443: i/o timeout
Get "http://harbor.harbor.svc.cluster.local/v2/": dial tcp 10.43.131.178:80: connect: connection refused

On retry (pipeline #146), the postgres service container also failed to start (empty logs), suggesting broader CI agent resource pressure or networking issues.

Repro Steps

  1. Merge any PR to basketball-api main (or any repo with Woodpecker CI build-and-push)
  2. Tests pass (555/555 on pipeline #145)
  3. build-and-push step times out connecting to Harbor
  4. Retry (pipeline #146) — postgres service container fails to start entirely

Expected Behavior

Kaniko should successfully push images to Harbor via harbor.harbor.svc.cluster.local:80. All Harbor pods are Running, services are up. This worked previously.

Environment

  • Cluster/namespace: prod / harbor, woodpecker
  • Harbor pods: all 10 Running (checked 2026-03-26)
  • Harbor service: ClusterIP 10.43.131.178, port 80/TCP only (no 443)
  • Affected pipelines: basketball-api #143, #145, #146
  • CI agent: Woodpecker (k8s or Mac agent — needs investigation)

Acceptance Criteria

  • Identify whether this is network policy, DNS, or agent resource issue
  • CI pipeline successfully builds and pushes image to Harbor
  • basketball-api deploys with migrations 022 + 023
  • forgejo_admin/basketball-api#170 — jersey sync fix waiting on deploy
  • forgejo_admin/basketball-api#173 — teams/save fix waiting on deploy
  • feedback_ci_pipeline_lessons.md — 12 prior CI root causes (Harbor hairpin was one)
  • sop-ci-pipeline-recovery — CI recovery SOP
### Type Bug ### Lineage standalone — discovered during CI pipeline monitoring for basketball-api PRs #172/#174 ### Repo `forgejo_admin/pal-e-platform` ### What Broke Woodpecker CI pipeline `build-and-push` step fails with Harbor registry connectivity timeout. Kaniko attempts HTTPS (port 443) on the `harbor` ClusterIP service which only exposes port 80, then falls back to HTTP which gets connection refused. ``` error checking push permissions: checking push permission for "harbor.harbor.svc.cluster.local/basketball-api/api:c3247bf..." creating push check transport for harbor.harbor.svc.cluster.local failed: Get "https://harbor.harbor.svc.cluster.local/v2/": dial tcp 10.43.131.178:443: i/o timeout Get "http://harbor.harbor.svc.cluster.local/v2/": dial tcp 10.43.131.178:80: connect: connection refused ``` On retry (pipeline #146), the postgres service container also failed to start (empty logs), suggesting broader CI agent resource pressure or networking issues. ### Repro Steps 1. Merge any PR to `basketball-api` main (or any repo with Woodpecker CI build-and-push) 2. Tests pass (555/555 on pipeline #145) 3. `build-and-push` step times out connecting to Harbor 4. Retry (pipeline #146) — postgres service container fails to start entirely ### Expected Behavior Kaniko should successfully push images to Harbor via `harbor.harbor.svc.cluster.local:80`. All Harbor pods are Running, services are up. This worked previously. ### Environment - Cluster/namespace: prod / `harbor`, `woodpecker` - Harbor pods: all 10 Running (checked 2026-03-26) - Harbor service: ClusterIP `10.43.131.178`, port 80/TCP only (no 443) - Affected pipelines: basketball-api #143, #145, #146 - CI agent: Woodpecker (k8s or Mac agent — needs investigation) ### Acceptance Criteria - [ ] Identify whether this is network policy, DNS, or agent resource issue - [ ] CI pipeline successfully builds and pushes image to Harbor - [ ] basketball-api deploys with migrations 022 + 023 ### Related - `forgejo_admin/basketball-api#170` — jersey sync fix waiting on deploy - `forgejo_admin/basketball-api#173` — teams/save fix waiting on deploy - `feedback_ci_pipeline_lessons.md` — 12 prior CI root causes (Harbor hairpin was one) - `sop-ci-pipeline-recovery` — CI recovery SOP
Author
Owner

Scope Review: NEEDS_REFINEMENT

Review note: review-411-2026-03-26

Scope document is strong on context (error output, environment, repro steps) but missing 5 template sections needed for agent execution.

  • Missing File Targets: No files listed. Key investigation targets: terraform/network-policies.tf, terraform/main.tf, basketball-api/.woodpecker.yaml
  • Missing Test Expectations: No verification commands for an agent to run post-fix
  • Acceptance criteria need scoping: "basketball-api deploys with migrations 022+023" is a separate deploy concern -- split to its own issue. "Identify root cause" is investigative, not testable -- rewrite as "Root cause documented in PR description"
  • Mac agent hypothesis unaddressed: The recently enabled Mac agent (backend: local, board #391) cannot run Kaniko containers. If Woodpecker misrouted pipeline #145 to the Mac agent, that is the root cause -- not network policy. Check agent assignment before investigating network layer.
  • SOP gap: sop-ci-pipeline-recovery does not cover connectivity timeout failure mode (port 443 on HTTP-only service). Update SOP after root cause confirmed.
## Scope Review: NEEDS_REFINEMENT Review note: `review-411-2026-03-26` Scope document is strong on context (error output, environment, repro steps) but missing 5 template sections needed for agent execution. - **Missing File Targets**: No files listed. Key investigation targets: `terraform/network-policies.tf`, `terraform/main.tf`, `basketball-api/.woodpecker.yaml` - **Missing Test Expectations**: No verification commands for an agent to run post-fix - **Acceptance criteria need scoping**: "basketball-api deploys with migrations 022+023" is a separate deploy concern -- split to its own issue. "Identify root cause" is investigative, not testable -- rewrite as "Root cause documented in PR description" - **Mac agent hypothesis unaddressed**: The recently enabled Mac agent (`backend: local`, board #391) cannot run Kaniko containers. If Woodpecker misrouted pipeline #145 to the Mac agent, that is the root cause -- not network policy. Check agent assignment before investigating network layer. - **SOP gap**: `sop-ci-pipeline-recovery` does not cover connectivity timeout failure mode (port 443 on HTTP-only service). Update SOP after root cause confirmed.
Author
Owner

Root Cause Investigation (2026-03-26)

Original hypothesis was Harbor connectivity / Mac agent routing. Both ruled out.

Actual root cause: Woodpecker server DB state corruption

  1. Woodpecker DB (woodpecker-db-1) became briefly unreachable
  2. Server restarted with failed to setup store: dial tcp 10.43.54.87:5432: connection refused
  3. Agent crash-looped 3x waiting for server gRPC (:9000 refused)
  4. After recovery, server logs show persistent errors:
    • queue.Done: cannot ack workflow -- sql: no rows in result set
    • done: cannot close log stream for step N -- stream: not found
  5. Service containers (postgres) report as "failed" with empty logs -- the agent creates pods fine (verified manually), but the server can't track the workflow state
  6. Rolling restart of server + agent did NOT fix it -- the DB still has stale/missing workflow records
  7. 5 consecutive pipeline retries (#145-149) all fail at postgres service container with empty logs

Verified NOT the cause

  • Mac agent routing: Mac agent has filter_labels: "platform=darwin", basketball-api pipelines have no labels. Agent ran on k8s (confirmed by clone using forgejo-http.forgejo.svc.cluster.local)
  • Harbor connectivity: Never reached build-and-push step on retries. Original #145 Harbor timeout was during the server instability window
  • Image pull: postgres:16-alpine starts fine manually in woodpecker namespace
  • Resources: 19% CPU, 18% memory on archbox -- no pressure

Fix needed

Restart Woodpecker DB → server → agent in order, or investigate woodpecker-db-1 for corrupted workflow/step records. The SOP (sop-ci-pipeline-recovery) should be updated with this failure mode.

## Root Cause Investigation (2026-03-26) Original hypothesis was Harbor connectivity / Mac agent routing. **Both ruled out.** ### Actual root cause: Woodpecker server DB state corruption 1. **Woodpecker DB** (`woodpecker-db-1`) became briefly unreachable 2. **Server** restarted with `failed to setup store: dial tcp 10.43.54.87:5432: connection refused` 3. **Agent** crash-looped 3x waiting for server gRPC (`:9000` refused) 4. After recovery, server logs show persistent errors: - `queue.Done: cannot ack workflow -- sql: no rows in result set` - `done: cannot close log stream for step N -- stream: not found` 5. Service containers (postgres) report as "failed" with empty logs -- the agent creates pods fine (verified manually), but the server can't track the workflow state 6. Rolling restart of server + agent did NOT fix it -- the DB still has stale/missing workflow records 7. 5 consecutive pipeline retries (#145-149) all fail at postgres service container with empty logs ### Verified NOT the cause - **Mac agent routing**: Mac agent has `filter_labels: "platform=darwin"`, basketball-api pipelines have no labels. Agent ran on k8s (confirmed by clone using `forgejo-http.forgejo.svc.cluster.local`) - **Harbor connectivity**: Never reached build-and-push step on retries. Original #145 Harbor timeout was during the server instability window - **Image pull**: `postgres:16-alpine` starts fine manually in woodpecker namespace - **Resources**: 19% CPU, 18% memory on archbox -- no pressure ### Fix needed Restart Woodpecker DB → server → agent in order, or investigate `woodpecker-db-1` for corrupted workflow/step records. The SOP (`sop-ci-pipeline-recovery`) should be updated with this failure mode.
Author
Owner

Incident Escalation: Woodpecker DB corruption

Original scope was Harbor connectivity. Root cause is deeper:

What's Actually Broken

  • Woodpecker server DB has corrupted/stale workflow records
  • sql: no rows in result set errors on queue.Done
  • Every pipeline restart gets queued but service containers "fail" with empty logs
  • Agent connects but can't ack workflows back to server
  • Caused by restart cascade from Helm rollback during alert triage session

Severity

P2 — CI Degraded. All pipelines are broken. Services that need CI builds are blocked.

  • Helm rollback of woodpecker to revision 14 (during this session to unblock tofu apply)
  • PodRestartStorm alert firing on woodpecker namespace
  • PR #185 CI clone failure was a symptom of this

Options

  1. Check Woodpecker DB directly for corrupted records
  2. Restart whole stack in order: DB → server → agent (likely cleanest)
  3. Defer to next session with full investigation
## Incident Escalation: Woodpecker DB corruption Original scope was Harbor connectivity. Root cause is deeper: ### What's Actually Broken - Woodpecker server DB has corrupted/stale workflow records - `sql: no rows in result set` errors on `queue.Done` - Every pipeline restart gets queued but service containers "fail" with empty logs - Agent connects but can't ack workflows back to server - Caused by restart cascade from Helm rollback during alert triage session ### Severity **P2 — CI Degraded.** All pipelines are broken. Services that need CI builds are blocked. ### Related - Helm rollback of woodpecker to revision 14 (during this session to unblock tofu apply) - PodRestartStorm alert firing on woodpecker namespace - PR #185 CI clone failure was a symptom of this ### Options 1. Check Woodpecker DB directly for corrupted records 2. Restart whole stack in order: DB → server → agent (likely cleanest) 3. Defer to next session with full investigation
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
forgejo_admin/pal-e-platform#184
No description provided.