feat: add lock-aware retry to CI apply step #98

Closed
opened 2026-03-17 04:00:31 +00:00 by forgejo_admin · 0 comments

Lineage

phase-platform-17b-tf-state-governance -> Phase 17b.1 (CI Lock Recovery)

Repo

forgejo_admin/pal-e-platform

User Story

As a platform operator
I want the CI apply step to automatically detect and recover from stale state locks
So that a crashed apply does not block all deployments on main

Context

Every merge to main triggers tofu apply. If a previous apply crashed and left a state lock, the pipeline fails and blocks ALL deployments. This happened with pipeline #80. Currently the only recovery is manual intervention: SSH into the cluster, extract the lock ID from the error output, and run tofu force-unlock. This is a DORA MTTR problem -- stale locks should be auto-recovered.

File Targets

Files the agent should modify:

  • .woodpecker.yaml -- replace the single tofu apply command in the apply step with a lock-aware retry script

Files the agent should NOT touch:

  • terraform/ -- no .tf file changes needed
  • salt/ -- no Salt changes needed

Acceptance Criteria

  • When tofu apply fails with "the state is already locked", the script extracts the lock ID, runs tofu force-unlock -force, and retries apply once
  • When tofu apply fails with any other error, the step fails normally with the original exit code
  • When tofu apply succeeds on first attempt, the step succeeds normally
  • The script is POSIX sh compatible (Alpine/BusyBox -- no bash, no grep -P, no PIPESTATUS)

Test Expectations

  • YAML validation: .woodpecker.yaml parses as valid YAML after edit
  • Shell compatibility: script uses only POSIX sh constructs (no bashisms)
  • No .tf files changed, so tofu fmt -check -recursive should still pass

Constraints

  • The apply step image is ghcr.io/opentofu/opentofu:1.9 (Alpine-based, sh not bash)
  • BusyBox grep does not support -P (Perl regex) -- use sed for extraction
  • $? after a pipe returns exit code of last command (tee), not tofu -- use subshell + temp file approach
  • Must preserve existing step structure (environment vars, when conditions, etc.)

Checklist

  • PR opened
  • Tests pass
  • No unrelated changes
  • pal-e-platform -- project this affects
  • phase-platform-17b-tf-state-governance -- parent phase
### Lineage `phase-platform-17b-tf-state-governance` -> Phase 17b.1 (CI Lock Recovery) ### Repo `forgejo_admin/pal-e-platform` ### User Story As a platform operator I want the CI apply step to automatically detect and recover from stale state locks So that a crashed apply does not block all deployments on main ### Context Every merge to main triggers `tofu apply`. If a previous apply crashed and left a state lock, the pipeline fails and blocks ALL deployments. This happened with pipeline #80. Currently the only recovery is manual intervention: SSH into the cluster, extract the lock ID from the error output, and run `tofu force-unlock`. This is a DORA MTTR problem -- stale locks should be auto-recovered. ### File Targets Files the agent should modify: - `.woodpecker.yaml` -- replace the single `tofu apply` command in the apply step with a lock-aware retry script Files the agent should NOT touch: - `terraform/` -- no .tf file changes needed - `salt/` -- no Salt changes needed ### Acceptance Criteria - [ ] When `tofu apply` fails with "the state is already locked", the script extracts the lock ID, runs `tofu force-unlock -force`, and retries apply once - [ ] When `tofu apply` fails with any other error, the step fails normally with the original exit code - [ ] When `tofu apply` succeeds on first attempt, the step succeeds normally - [ ] The script is POSIX sh compatible (Alpine/BusyBox -- no bash, no grep -P, no PIPESTATUS) ### Test Expectations - [ ] YAML validation: `.woodpecker.yaml` parses as valid YAML after edit - [ ] Shell compatibility: script uses only POSIX sh constructs (no bashisms) - [ ] No .tf files changed, so `tofu fmt -check -recursive` should still pass ### Constraints - The apply step image is `ghcr.io/opentofu/opentofu:1.9` (Alpine-based, sh not bash) - BusyBox grep does not support `-P` (Perl regex) -- use `sed` for extraction - `$?` after a pipe returns exit code of last command (tee), not tofu -- use subshell + temp file approach - Must preserve existing step structure (environment vars, when conditions, etc.) ### Checklist - [ ] PR opened - [ ] Tests pass - [ ] No unrelated changes ### Related - `pal-e-platform` -- project this affects - `phase-platform-17b-tf-state-governance` -- parent phase
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
forgejo_admin/pal-e-platform#98
No description provided.