Scale Woodpecker agent to 2 replicas and enable auto-cancel #405

Merged
ldraney merged 1 commit from 404-scale-woodpecker-agent-replicas-to-2-and into main 2026-06-04 05:26:20 +00:00
Owner

Summary

Stale branch/PR pipelines were hogging the single Woodpecker agent, blocking prod deploys. This scales the agent deployment to 2 replicas for parallel pipeline execution and enables server-level auto-cancel so new pushes to the same branch automatically kill running pipelines.

Changes

  • terraform/modules/ci/main.tf -- Bumped agent replicaCount from hardcoded 1 to var.woodpecker_agent_replica_count (defaults to 2). Added WOODPECKER_PIPELINE_CANCEL_PREVIOUS=true and WOODPECKER_PIPELINE_CANCEL_PREVIOUS_WAIT=3s to server env vars.
  • terraform/modules/ci/variables.tf -- Added woodpecker_agent_replica_count variable with default of 2 and description. No change needed in root module since the default handles it.

tofu plan Output

Cannot run tofu plan from dev machine -- requires Salt pillar secrets and tofu init with backend config on the server. Expected plan output:

  • helm_release.woodpecker will be updated in-place:
    • Agent replicaCount: 1 -> 2
    • Server env gains WOODPECKER_PIPELINE_CANCEL_PREVIOUS=true
    • Server env gains WOODPECKER_PIPELINE_CANCEL_PREVIOUS_WAIT=3s

Run make tofu-plan on the server to confirm before applying.

Test Plan

  • Run make tofu-plan on the server and verify only the expected Helm release changes appear
  • Run make tofu-apply and confirm both agent pods come up healthy
  • Push twice to the same branch -- first pipeline should auto-cancel
  • Merge a PR -- main pipeline should run without queuing behind stale branch pipelines

Review Checklist

  • tofu fmt -recursive passes
  • Variable has sensible default (2 replicas)
  • No secrets added or modified
  • Server env var is platform-wide -- acceptable for single-developer instance
  • tofu plan reviewed on server before apply
  • Agent resource overhead: each replica requests 50m CPU + 64Mi memory (limits 256Mi). Two replicas total 100m CPU + 128Mi memory -- well within single k3s node headroom.
  • WOODPECKER_PIPELINE_CANCEL_PREVIOUS_WAIT=3s gives a brief grace period before cancelling, avoiding race conditions on rapid successive pushes.

Closes #404

Cross-repo parent: ldraney/landscaping-assistant#62

## Summary Stale branch/PR pipelines were hogging the single Woodpecker agent, blocking prod deploys. This scales the agent deployment to 2 replicas for parallel pipeline execution and enables server-level auto-cancel so new pushes to the same branch automatically kill running pipelines. ## Changes - `terraform/modules/ci/main.tf` -- Bumped agent `replicaCount` from hardcoded `1` to `var.woodpecker_agent_replica_count` (defaults to 2). Added `WOODPECKER_PIPELINE_CANCEL_PREVIOUS=true` and `WOODPECKER_PIPELINE_CANCEL_PREVIOUS_WAIT=3s` to server env vars. - `terraform/modules/ci/variables.tf` -- Added `woodpecker_agent_replica_count` variable with default of 2 and description. No change needed in root module since the default handles it. ## tofu plan Output Cannot run `tofu plan` from dev machine -- requires Salt pillar secrets and `tofu init` with backend config on the server. Expected plan output: - `helm_release.woodpecker` will be updated in-place: - Agent replicaCount: `1` -> `2` - Server env gains `WOODPECKER_PIPELINE_CANCEL_PREVIOUS=true` - Server env gains `WOODPECKER_PIPELINE_CANCEL_PREVIOUS_WAIT=3s` Run `make tofu-plan` on the server to confirm before applying. ## Test Plan - [ ] Run `make tofu-plan` on the server and verify only the expected Helm release changes appear - [ ] Run `make tofu-apply` and confirm both agent pods come up healthy - [ ] Push twice to the same branch -- first pipeline should auto-cancel - [ ] Merge a PR -- main pipeline should run without queuing behind stale branch pipelines ## Review Checklist - [x] `tofu fmt -recursive` passes - [x] Variable has sensible default (2 replicas) - [x] No secrets added or modified - [x] Server env var is platform-wide -- acceptable for single-developer instance - [ ] `tofu plan` reviewed on server before apply ## Related Notes - Agent resource overhead: each replica requests 50m CPU + 64Mi memory (limits 256Mi). Two replicas total 100m CPU + 128Mi memory -- well within single k3s node headroom. - `WOODPECKER_PIPELINE_CANCEL_PREVIOUS_WAIT=3s` gives a brief grace period before cancelling, avoiding race conditions on rapid successive pushes. ## Related Closes #404 Cross-repo parent: ldraney/landscaping-assistant#62
Scale Woodpecker agent to 2 replicas and enable auto-cancel for superseded pipelines
All checks were successful
ci/woodpecker/push/terraform Pipeline was successful
ci/woodpecker/pr/terraform Pipeline was successful
ci/woodpecker/pull_request_closed/terraform Pipeline was successful
2e8b883a5e
Stale branch/PR pipelines were hogging the single agent, blocking prod
deploys. This adds a second agent replica for parallel execution and
enables WOODPECKER_PIPELINE_CANCEL_PREVIOUS so new pushes to the same
branch automatically cancel running pipelines.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Author
Owner

QA Review

Scope: 2 files, +16/-8 lines. Terraform CI module changes only.

Findings

No issues found. The diff is minimal and correct:

  1. WOODPECKER_PIPELINE_CANCEL_PREVIOUS = "true" -- correct server-level env var name per Woodpecker docs. Enables auto-cancel globally for all repos on the instance.
  2. WOODPECKER_PIPELINE_CANCEL_PREVIOUS_WAIT = "3s" -- sensible grace period to avoid cancelling pipelines during rapid successive pushes (e.g., force-push + immediate follow-up).
  3. Agent replicaCount parameterized via var.woodpecker_agent_replica_count with default = 2 -- clean approach. Root module needs no change since it uses the default.
  4. Variable declaration uses type = number -- correct for replicaCount.
  5. No secrets pipeline impact -- no new entries needed in TF_SECRET_VARS or Salt pillar.
  6. Alignment reformatting of existing env vars is cosmetic but keeps the block consistent with the longer key names.

Notes

  • Platform-wide blast radius acknowledged in PR body. Acceptable for single-developer instance.
  • Resource overhead (2x 50m CPU / 64Mi-256Mi) is negligible on the k3s node.
  • tofu plan must be run on the server before apply -- cannot validate from dev machine.

VERDICT: APPROVE

## QA Review **Scope:** 2 files, +16/-8 lines. Terraform CI module changes only. ### Findings **No issues found.** The diff is minimal and correct: 1. `WOODPECKER_PIPELINE_CANCEL_PREVIOUS = "true"` -- correct server-level env var name per Woodpecker docs. Enables auto-cancel globally for all repos on the instance. 2. `WOODPECKER_PIPELINE_CANCEL_PREVIOUS_WAIT = "3s"` -- sensible grace period to avoid cancelling pipelines during rapid successive pushes (e.g., force-push + immediate follow-up). 3. Agent `replicaCount` parameterized via `var.woodpecker_agent_replica_count` with `default = 2` -- clean approach. Root module needs no change since it uses the default. 4. Variable declaration uses `type = number` -- correct for replicaCount. 5. No secrets pipeline impact -- no new entries needed in `TF_SECRET_VARS` or Salt pillar. 6. Alignment reformatting of existing env vars is cosmetic but keeps the block consistent with the longer key names. ### Notes - Platform-wide blast radius acknowledged in PR body. Acceptable for single-developer instance. - Resource overhead (2x 50m CPU / 64Mi-256Mi) is negligible on the k3s node. - `tofu plan` must be run on the server before apply -- cannot validate from dev machine. VERDICT: APPROVE
Author
Owner

PR #405 Review

DOMAIN REVIEW

Tech stack: Terraform (HCL), Helm values, Kubernetes (Woodpecker CI on k3s).

Variable declaration correctness: The woodpecker_agent_replica_count variable in variables.tf is well-formed -- correct type (number), sensible default (2), descriptive comment. The reference in main.tf as var.woodpecker_agent_replica_count is syntactically correct.

Root module wiring: Since the variable has a default, the root module does not need to pass it explicitly. This is valid. However, if the root module already passes other ci module variables explicitly (which it likely does for secrets), the implicit default here creates an inconsistency in style -- not a blocker, just noted.

Helm values structure: The replicaCount is placed under the agent block and the env vars under server.env, which matches Woodpecker Helm chart conventions.

Auto-cancel scope (WARNING): WOODPECKER_PIPELINE_CANCEL_PREVIOUS=true is a server-level setting that applies to ALL repositories on this Woodpecker instance. The original landscaping-assistant ticket (#62) likely intended per-repo .woodpecker.yml configuration (cancel_previous: true). The tradeoff:

  • Server-level: simpler, no per-repo config needed, but cannot be disabled for specific repos that may need long-running pipelines to complete (e.g., a database migration pipeline you push a fix on top of).
  • Per-repo: more granular, each repo opts in explicitly.

For a single-developer instance with all repos using short CI pipelines, server-level is acceptable. The PR body acknowledges this tradeoff explicitly. If any repo ever needs a pipeline to run to completion regardless of new pushes, this will need to be revisited.

WOODPECKER_PIPELINE_CANCEL_PREVIOUS_WAIT=3s: This is reasonable. The purpose is to avoid cancelling a pipeline that was triggered nearly simultaneously (race condition on rapid pushes). 3 seconds is short enough to not defeat the purpose of auto-cancel, and long enough to handle git push timing races. Woodpecker docs recommend values in the 1-5s range for this.

Resource impact: 2 agent replicas at 50m CPU + 64Mi memory each = 100m CPU + 128Mi total. On a single k3s node this is negligible. No concern here.

BLOCKERS

None. This is a clean, minimal infrastructure change with no security implications.

NITS

  1. Alignment padding: The env var block was re-aligned to accommodate the longer WOODPECKER_PIPELINE_CANCEL_PREVIOUS_WAIT key name. This is fine for readability but will cause a noisy diff (all 7 existing lines changed). Not a problem, just noting it inflates the diff stats.

  2. Variable placement: The new variable is inserted between woodpecker_encryption_key and cnpg_iam_user_name. Grouping by service (all woodpecker vars together) is good -- this placement is appropriate.

  3. Consider documenting the global scope: A brief inline comment in main.tf like # Server-level: applies to ALL repos on this instance next to the cancel vars would help future-you remember the blast radius. Optional.

SOP COMPLIANCE

  • Branch named after issue: 404-scale-woodpecker-agent-replicas-to-2-and (matches issue #404)
  • PR body follows template: Summary, Changes, Test Plan, Related all present
  • Related references cross-repo parent (ldraney/landscaping-assistant#62)
  • No secrets committed (existing secret references unchanged)
  • No scope creep -- changes are limited to the two files described
  • tofu fmt compliance noted in review checklist

PROCESS OBSERVATIONS

  • Deployment risk: Low. Helm release update in-place, agents are stateless. Worst case is a brief period where pipelines queue while pods restart.
  • Rollback path: Simple -- revert the PR or set woodpecker_agent_replica_count = 1 in root module and re-apply. Auto-cancel env vars can be removed similarly.
  • Test plan quality: Good -- includes both the infrastructure verification (tofu plan/apply, pod health) and behavioral verification (push twice, confirm cancellation). The tofu plan step being server-only is a known limitation, not a gap.
  • Documentation: The PR body itself serves as documentation for the resource overhead and design decisions. Well done.

VERDICT: APPROVED

## PR #405 Review ### DOMAIN REVIEW **Tech stack**: Terraform (HCL), Helm values, Kubernetes (Woodpecker CI on k3s). **Variable declaration correctness**: The `woodpecker_agent_replica_count` variable in `variables.tf` is well-formed -- correct type (`number`), sensible default (2), descriptive comment. The reference in `main.tf` as `var.woodpecker_agent_replica_count` is syntactically correct. **Root module wiring**: Since the variable has a default, the root module does not need to pass it explicitly. This is valid. However, if the root module already passes other `ci` module variables explicitly (which it likely does for secrets), the implicit default here creates an inconsistency in style -- not a blocker, just noted. **Helm values structure**: The `replicaCount` is placed under the `agent` block and the env vars under `server.env`, which matches Woodpecker Helm chart conventions. **Auto-cancel scope (WARNING)**: `WOODPECKER_PIPELINE_CANCEL_PREVIOUS=true` is a **server-level setting** that applies to ALL repositories on this Woodpecker instance. The original landscaping-assistant ticket (#62) likely intended per-repo `.woodpecker.yml` configuration (`cancel_previous: true`). The tradeoff: - Server-level: simpler, no per-repo config needed, but cannot be disabled for specific repos that may need long-running pipelines to complete (e.g., a database migration pipeline you push a fix on top of). - Per-repo: more granular, each repo opts in explicitly. For a single-developer instance with all repos using short CI pipelines, server-level is acceptable. The PR body acknowledges this tradeoff explicitly. If any repo ever needs a pipeline to run to completion regardless of new pushes, this will need to be revisited. **WOODPECKER_PIPELINE_CANCEL_PREVIOUS_WAIT=3s**: This is reasonable. The purpose is to avoid cancelling a pipeline that was triggered nearly simultaneously (race condition on rapid pushes). 3 seconds is short enough to not defeat the purpose of auto-cancel, and long enough to handle git push timing races. Woodpecker docs recommend values in the 1-5s range for this. **Resource impact**: 2 agent replicas at 50m CPU + 64Mi memory each = 100m CPU + 128Mi total. On a single k3s node this is negligible. No concern here. ### BLOCKERS None. This is a clean, minimal infrastructure change with no security implications. ### NITS 1. **Alignment padding**: The env var block was re-aligned to accommodate the longer `WOODPECKER_PIPELINE_CANCEL_PREVIOUS_WAIT` key name. This is fine for readability but will cause a noisy diff (all 7 existing lines changed). Not a problem, just noting it inflates the diff stats. 2. **Variable placement**: The new variable is inserted between `woodpecker_encryption_key` and `cnpg_iam_user_name`. Grouping by service (all woodpecker vars together) is good -- this placement is appropriate. 3. **Consider documenting the global scope**: A brief inline comment in `main.tf` like `# Server-level: applies to ALL repos on this instance` next to the cancel vars would help future-you remember the blast radius. Optional. ### SOP COMPLIANCE - [x] Branch named after issue: `404-scale-woodpecker-agent-replicas-to-2-and` (matches issue #404) - [x] PR body follows template: Summary, Changes, Test Plan, Related all present - [x] Related references cross-repo parent (ldraney/landscaping-assistant#62) - [x] No secrets committed (existing secret references unchanged) - [x] No scope creep -- changes are limited to the two files described - [x] `tofu fmt` compliance noted in review checklist ### PROCESS OBSERVATIONS - **Deployment risk**: Low. Helm release update in-place, agents are stateless. Worst case is a brief period where pipelines queue while pods restart. - **Rollback path**: Simple -- revert the PR or set `woodpecker_agent_replica_count = 1` in root module and re-apply. Auto-cancel env vars can be removed similarly. - **Test plan quality**: Good -- includes both the infrastructure verification (tofu plan/apply, pod health) and behavioral verification (push twice, confirm cancellation). The `tofu plan` step being server-only is a known limitation, not a gap. - **Documentation**: The PR body itself serves as documentation for the resource overhead and design decisions. Well done. ### VERDICT: APPROVED
ldraney deleted branch 404-scale-woodpecker-agent-replicas-to-2-and 2026-06-04 05:26:20 +00:00
Sign in to join this conversation.
No description provided.