Woodpecker agent label routing — platform-wide pipeline contract #191

Closed
opened 2026-03-27 01:29:13 +00:00 by forgejo_admin · 1 comment

Type

Feature

Lineage

Discovered during incident #184 investigation. Mac agent was root cause of "random" CI failures across all repos.

Repo

forgejo_admin/pal-e-platform (primary — Helm values + this repo's pipeline)
Cross-repo: all repos with .woodpecker.yaml need label updates.

User Story

As a platform operator
I want pipelines routed to capable agents via label contracts
So that adding a second agent (Mac, ARM, etc.) doesn't cause random CI failures across all repos

Context

Woodpecker uses work-stealing scheduling. When the Mac agent (lucass-macbook-air-1, agent ID 3, local backend) registered with no_schedule = false, it began racing the k8s agent for every queued workflow. Every pipeline it won failed — no Forgejo git credentials, no Harbor access, no container backend.

Agent-side WOODPECKER_FILTER_LABELS only restricts what an agent accepts. But if pipelines declare no labels: constraint, they match ANY agent. The platform had no routing contract — the k8s agent has no filter labels either.

Additionally, WOODPECKER_CONNECT_RETRY_COUNT=1 makes the k8s agent fragile during pod restart cascades (DB → server → agent ordering).

Immediate mitigation (done 2026-03-27): Mac agent set to no_schedule = true in DB.

File Targets

Files to modify:

  • terraform/main.tf — k8s agent Helm values: add WOODPECKER_FILTER_LABELS, bump CONNECT_RETRY_COUNT
  • .woodpecker.yaml — add labels: { platform: linux } to all workflows

Files NOT to touch:

  • salt/pillar/mac-agent.sls — already has filter_labels: "platform=darwin" (correct)
  • salt/states/mac-agent/ — plist template already renders WOODPECKER_FILTER_LABELS (correct)

Cross-repo follow-ups (separate issues):

  • basketball-api/.woodpecker.yaml
  • pal-e-deployments/.woodpecker.yaml
  • All other repos with Woodpecker pipelines

Acceptance Criteria

  • k8s agent has WOODPECKER_FILTER_LABELS=platform=linux in Helm values
  • k8s agent has WOODPECKER_CONNECT_RETRY_COUNT=10
  • pal-e-platform/.woodpecker.yaml has labels: { platform: linux } on all workflows
  • tofu plan shows only agent env changes (no unrelated drift)
  • Convention note convention-pipeline-labels documents the routing contract
  • Mac agent re-enabled (no_schedule = false) and verified: unlabeled pipelines don't route to it
  • Follow-up issues created for cross-repo pipeline label updates

Test Expectations

  • tofu validate passes
  • tofu plan -lock=false shows expected agent env changes only
  • After apply: trigger pipeline on pal-e-platform, verify k8s agent (id=1) handles it
  • After apply: verify Mac agent (id=3) does NOT pick up linux-labeled pipelines
  • Run command: tofu plan -lock=false in terraform/

Constraints

  • Must use tofu not terraform
  • tofu plan must include -lock=false (state lock blocks CI)
  • Salt side is already correct — do not modify salt configs
  • Mac agent re-enablement depends on #174 being far enough along to validate darwin routing
  • Pipeline labels: syntax: labels: { platform: linux } at workflow level in .woodpecker.yaml

Checklist

  • PR opened
  • Tests pass
  • No unrelated changes
  • Convention note created
  • Cross-repo follow-up issues created
  • #184 — incident that exposed this
  • #174 — Mac build agent Salt setup (next_up on board)
  • #166 — Mac CI agent infrastructure
  • #179 — Woodpecker agent secret duplication
  • project-pal-e-platform
### Type Feature ### Lineage Discovered during incident #184 investigation. Mac agent was root cause of "random" CI failures across all repos. ### Repo `forgejo_admin/pal-e-platform` (primary — Helm values + this repo's pipeline) Cross-repo: all repos with `.woodpecker.yaml` need label updates. ### User Story As a platform operator I want pipelines routed to capable agents via label contracts So that adding a second agent (Mac, ARM, etc.) doesn't cause random CI failures across all repos ### Context Woodpecker uses work-stealing scheduling. When the Mac agent (`lucass-macbook-air-1`, agent ID 3, local backend) registered with `no_schedule = false`, it began racing the k8s agent for every queued workflow. Every pipeline it won failed — no Forgejo git credentials, no Harbor access, no container backend. Agent-side `WOODPECKER_FILTER_LABELS` only restricts what an agent *accepts*. But if pipelines declare no `labels:` constraint, they match ANY agent. The platform had no routing contract — the k8s agent has no filter labels either. Additionally, `WOODPECKER_CONNECT_RETRY_COUNT=1` makes the k8s agent fragile during pod restart cascades (DB → server → agent ordering). **Immediate mitigation (done 2026-03-27):** Mac agent set to `no_schedule = true` in DB. ### File Targets Files to modify: - `terraform/main.tf` — k8s agent Helm values: add `WOODPECKER_FILTER_LABELS`, bump `CONNECT_RETRY_COUNT` - `.woodpecker.yaml` — add `labels: { platform: linux }` to all workflows Files NOT to touch: - `salt/pillar/mac-agent.sls` — already has `filter_labels: "platform=darwin"` (correct) - `salt/states/mac-agent/` — plist template already renders `WOODPECKER_FILTER_LABELS` (correct) Cross-repo follow-ups (separate issues): - `basketball-api/.woodpecker.yaml` - `pal-e-deployments/.woodpecker.yaml` - All other repos with Woodpecker pipelines ### Acceptance Criteria - [ ] k8s agent has `WOODPECKER_FILTER_LABELS=platform=linux` in Helm values - [ ] k8s agent has `WOODPECKER_CONNECT_RETRY_COUNT=10` - [ ] `pal-e-platform/.woodpecker.yaml` has `labels: { platform: linux }` on all workflows - [ ] `tofu plan` shows only agent env changes (no unrelated drift) - [ ] Convention note `convention-pipeline-labels` documents the routing contract - [ ] Mac agent re-enabled (`no_schedule = false`) and verified: unlabeled pipelines don't route to it - [ ] Follow-up issues created for cross-repo pipeline label updates ### Test Expectations - [ ] `tofu validate` passes - [ ] `tofu plan -lock=false` shows expected agent env changes only - [ ] After apply: trigger pipeline on pal-e-platform, verify k8s agent (id=1) handles it - [ ] After apply: verify Mac agent (id=3) does NOT pick up linux-labeled pipelines - Run command: `tofu plan -lock=false` in `terraform/` ### Constraints - Must use `tofu` not `terraform` - `tofu plan` must include `-lock=false` (state lock blocks CI) - Salt side is already correct — do not modify salt configs - Mac agent re-enablement depends on #174 being far enough along to validate darwin routing - Pipeline `labels:` syntax: `labels: { platform: linux }` at workflow level in `.woodpecker.yaml` ### Checklist - [ ] PR opened - [ ] Tests pass - [ ] No unrelated changes - [ ] Convention note created - [ ] Cross-repo follow-up issues created ### Related - #184 — incident that exposed this - #174 — Mac build agent Salt setup (next_up on board) - #166 — Mac CI agent infrastructure - #179 — Woodpecker agent secret duplication - `project-pal-e-platform`
Author
Owner

Scope Review: READY

Review note: review-425-2026-03-26
All file targets verified, template complete, traceability triangle satisfied. Scope is solid — ready to move to next_up.

One sequencing note for the implementation agent: cross-repo .woodpecker.yaml label updates must land before or simultaneously with the k8s agent WOODPECKER_FILTER_LABELS change, otherwise unlabeled pipelines will match no agent.

## Scope Review: READY Review note: `review-425-2026-03-26` All file targets verified, template complete, traceability triangle satisfied. Scope is solid — ready to move to next_up. One sequencing note for the implementation agent: cross-repo `.woodpecker.yaml` label updates must land before or simultaneously with the k8s agent `WOODPECKER_FILTER_LABELS` change, otherwise unlabeled pipelines will match no agent.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
forgejo_admin/pal-e-platform#191
No description provided.