Bug: Woodpecker agent secret drift — 3 conflicting values across tfvars/k8s/statefulset #137

Closed
opened 2026-03-21 17:40:28 +00:00 by forgejo_admin · 3 comments

Type

Bug

Lineage

plan-pal-e-platform — standalone, discovered during operations

Repo

forgejo_admin/pal-e-platform

What Broke

woodpecker_agent_secret has different values across four sources. A tofu apply will overwrite the live value, potentially breaking webhook auth (same failure pattern as the March 14 incident).

Location Value prefix Notes
secrets.auto.tfvars 3e053aaa... Terraform source of truth
Live statefulset env 3e053aaa... Currently active in pods
k8s woodpecker-default-agent-secret 8ABReJ9K... Helm-managed k8s Secret — DIFFERENT
Woodpecker CI secret store (via from_secret: tf_var_woodpecker_agent_secret) Used during pipeline runs — value must be verified

Note: The k8s Secret woodpecker-default-agent-secret is NOT managed by Terraform directly — the Helm chart creates it internally. The drift may be between the Helm-managed secret and the statefulset env var injected via set_sensitive.

Repro Steps

  1. Compare woodpecker_agent_secret across all four locations: secrets.auto.tfvars, live statefulset env (kubectl exec ... printenv), k8s Secret (kubectl get secret), and Woodpecker CI secret store
  2. At minimum, k8s Secret differs from tfvars and statefulset env
  3. A tofu apply will overwrite live with tfvars value — if the active value differs, this breaks agent auth

Expected Behavior

All four locations should have the same value. The single source of truth should be secrets.auto.tfvars, propagated through Terraform to k8s secret and consumed by the statefulset. The Woodpecker CI secret store must also match.

Environment

  • Cluster/namespace: prod / woodpecker
  • Related alerts: potential webhook auth failure on next apply
  • Previous incident: March 14 — same drift pattern caused outage

File Targets

  • terraform/secrets.auto.tfvars — line 12, woodpecker_agent_secret value
  • terraform/variables.tf — line 157, variable definition (marked sensitive)
  • terraform/main.tf — lines 773-783, set_sensitive blocks injecting value into server + agent env
  • .woodpecker.yaml — lines 70-71 and 159-160, from_secret: tf_var_woodpecker_agent_secret
  • Woodpecker CI secret store — tf_var_woodpecker_agent_secret (via Woodpecker API or UI)
  • k8s Secret woodpecker-default-agent-secret in woodpecker namespace (Helm-managed, not directly in Terraform)

Acceptance Criteria

  • Identify which value is currently active and working
  • Reconcile all four locations to a single value (tfvars, k8s Secret, statefulset env, Woodpecker CI secret store)
  • Verify Woodpecker agent auth works after reconciliation

Rotation SOP deferred to separate issue. Phase 17a-9 has a pending SOP (sop-woodpecker-db-migration) that was also deferred. Secret rotation documentation should be tracked independently.

Test Expectations

  • Verify all 4 locations have matching values after reconciliation: grep woodpecker_agent_secret terraform/secrets.auto.tfvars, kubectl get secret woodpecker-default-agent-secret -n woodpecker -o jsonpath='{.data.*}' | base64 -d, kubectl exec into agent pod and check printenv WOODPECKER_AGENT_SECRET, and check Woodpecker CI secret store via API
  • Trigger a test pipeline and verify it completes successfully (agent connects, build runs)
  • Check Woodpecker server logs for agent auth errors after reconciliation

Constraints

  • 17 other secrets flow through the same tfvars-to-Woodpecker-to-Helm pipeline. If this drift happened to woodpecker_agent_secret, it could happen to woodpecker_encryption_key, woodpecker_db_password, or any other secret. Reconciliation must not disturb those 17 other secrets.
  • Wrong reconciliation direction = platform-wide outage. Overwriting the live active value with a stale one breaks ALL Woodpecker CI pipelines across every repo. Must verify which value is actually active before choosing reconciliation direction.
  • The Helm chart manages the k8s Secret internally — a tofu apply may or may not reconcile it depending on how the chart handles existing secrets.

Checklist

  • PR opened
  • Tests pass
  • No unrelated changes
  • Same failure pattern as March 14 incident
  • Blocks safe tofu apply — must resolve before any merge triggers apply-on-main
  • Phase 17a (Woodpecker Secrets Hardening) — thought to be resolved
  • Board item #188 (Issue #109: Platform cleanup) — parent cleanup umbrella
### Type Bug ### Lineage `plan-pal-e-platform` — standalone, discovered during operations ### Repo `forgejo_admin/pal-e-platform` ### What Broke `woodpecker_agent_secret` has different values across four sources. A `tofu apply` will overwrite the live value, potentially breaking webhook auth (same failure pattern as the March 14 incident). | Location | Value prefix | Notes | |----------|-------------|-------| | `secrets.auto.tfvars` | `3e053aaa...` | Terraform source of truth | | Live statefulset env | `3e053aaa...` | Currently active in pods | | k8s `woodpecker-default-agent-secret` | `8ABReJ9K...` | Helm-managed k8s Secret — DIFFERENT | | Woodpecker CI secret store | (via `from_secret: tf_var_woodpecker_agent_secret`) | Used during pipeline runs — value must be verified | > Note: The k8s Secret `woodpecker-default-agent-secret` is NOT managed by Terraform directly — the Helm chart creates it internally. The drift may be between the Helm-managed secret and the statefulset env var injected via `set_sensitive`. ### Repro Steps 1. Compare `woodpecker_agent_secret` across all four locations: `secrets.auto.tfvars`, live statefulset env (`kubectl exec ... printenv`), k8s Secret (`kubectl get secret`), and Woodpecker CI secret store 2. At minimum, k8s Secret differs from tfvars and statefulset env 3. A `tofu apply` will overwrite live with tfvars value — if the active value differs, this breaks agent auth ### Expected Behavior All four locations should have the same value. The single source of truth should be `secrets.auto.tfvars`, propagated through Terraform to k8s secret and consumed by the statefulset. The Woodpecker CI secret store must also match. ### Environment - Cluster/namespace: prod / woodpecker - Related alerts: potential webhook auth failure on next apply - Previous incident: March 14 — same drift pattern caused outage ### File Targets - `terraform/secrets.auto.tfvars` — line 12, `woodpecker_agent_secret` value - `terraform/variables.tf` — line 157, variable definition (marked sensitive) - `terraform/main.tf` — lines 773-783, `set_sensitive` blocks injecting value into server + agent env - `.woodpecker.yaml` — lines 70-71 and 159-160, `from_secret: tf_var_woodpecker_agent_secret` - Woodpecker CI secret store — `tf_var_woodpecker_agent_secret` (via Woodpecker API or UI) - k8s Secret `woodpecker-default-agent-secret` in `woodpecker` namespace (Helm-managed, not directly in Terraform) ### Acceptance Criteria - [ ] Identify which value is currently active and working - [ ] Reconcile all four locations to a single value (tfvars, k8s Secret, statefulset env, Woodpecker CI secret store) - [ ] Verify Woodpecker agent auth works after reconciliation > Rotation SOP deferred to separate issue. Phase 17a-9 has a pending SOP (`sop-woodpecker-db-migration`) that was also deferred. Secret rotation documentation should be tracked independently. ### Test Expectations - Verify all 4 locations have matching values after reconciliation: `grep woodpecker_agent_secret terraform/secrets.auto.tfvars`, `kubectl get secret woodpecker-default-agent-secret -n woodpecker -o jsonpath='{.data.*}' | base64 -d`, `kubectl exec` into agent pod and check `printenv WOODPECKER_AGENT_SECRET`, and check Woodpecker CI secret store via API - Trigger a test pipeline and verify it completes successfully (agent connects, build runs) - Check Woodpecker server logs for agent auth errors after reconciliation ### Constraints - **17 other secrets flow through the same tfvars-to-Woodpecker-to-Helm pipeline.** If this drift happened to `woodpecker_agent_secret`, it could happen to `woodpecker_encryption_key`, `woodpecker_db_password`, or any other secret. Reconciliation must not disturb those 17 other secrets. - **Wrong reconciliation direction = platform-wide outage.** Overwriting the live active value with a stale one breaks ALL Woodpecker CI pipelines across every repo. Must verify which value is actually active before choosing reconciliation direction. - The Helm chart manages the k8s Secret internally — a `tofu apply` may or may not reconcile it depending on how the chart handles existing secrets. ### Checklist - [ ] PR opened - [ ] Tests pass - [ ] No unrelated changes ### Related - Same failure pattern as March 14 incident - Blocks safe `tofu apply` — must resolve before any merge triggers apply-on-main - Phase 17a (Woodpecker Secrets Hardening) — thought to be resolved - Board item #188 (Issue #109: Platform cleanup) — parent cleanup umbrella
Author
Owner

Scope Review: NEEDS_REFINEMENT

Review note: review-256-2026-03-22

Ticket describes the problem well but has gaps that would confuse an executing agent.

  • Stale drift table: secrets.auto.tfvars currently has prefix 3e053aaa (matches statefulset), not 597ea9dc as listed. The 597ea9dc value does not exist anywhere in the repo.
  • Missing 4th location: Woodpecker CI secret store (from_secret: tf_var_woodpecker_agent_secret) is a distinct location not listed in the table.
  • No File Targets section: Agent needs explicit file paths and the reconciliation mechanism (tofu apply vs manual).
  • No Test Expectations: No verification commands for kubectl, pipeline test, or agent connection check.
  • AC #4 (rotation SOP) should be split: Phase 17a-9 already deferred sop-woodpecker-db-migration. This is a separate deliverable, not part of a secret reconciliation bug fix.
## Scope Review: NEEDS_REFINEMENT Review note: `review-256-2026-03-22` Ticket describes the problem well but has gaps that would confuse an executing agent. - **Stale drift table:** `secrets.auto.tfvars` currently has prefix `3e053aaa` (matches statefulset), not `597ea9dc` as listed. The `597ea9dc` value does not exist anywhere in the repo. - **Missing 4th location:** Woodpecker CI secret store (`from_secret: tf_var_woodpecker_agent_secret`) is a distinct location not listed in the table. - **No File Targets section:** Agent needs explicit file paths and the reconciliation mechanism (tofu apply vs manual). - **No Test Expectations:** No verification commands for kubectl, pipeline test, or agent connection check. - **AC #4 (rotation SOP) should be split:** Phase 17a-9 already deferred `sop-woodpecker-db-migration`. This is a separate deliverable, not part of a secret reconciliation bug fix.
Author
Owner

Scoping Notes (2026-03-24)

Context: This is likely WHY the Helm release got stuck. A previous tofu apply tried to change the agent secret via Helm upgrade, something went wrong mid-upgrade (timeout/OOM), and the release got stuck in pending-upgrade state.

Where the secret is used in terraform:

  • terraform/main.tf:773-774server.env.WOODPECKER_AGENT_SECRET (set_sensitive)
  • terraform/main.tf:779-780agent.env.WOODPECKER_AGENT_SECRET (set_sensitive)
  • Both use var.woodpecker_agent_secret from Salt pillar

The 3 values to reconcile:

  1. secrets.auto.tfvars — what Salt pillar says (canonical source)
  2. k8s secret (Helm-managed) — what Helm last successfully deployed
  3. Running pod env — what the Woodpecker server/agent are actually using

Fix procedure (after Helm unstick):

  1. helm -n woodpecker get values woodpecker -o yaml | grep AGENT_SECRET — see what Helm thinks
  2. Check Salt pillar value: make tofu-secrets && grep agent_secret terraform/secrets.auto.tfvars
  3. tofu plan -lock=false — see what tofu wants to change
  4. If values align: just apply. If not: decide which is canonical and update the others.

Pairs well with #86 (rotate Woodpecker API token) — both are secret reconciliation tasks.

## Scoping Notes (2026-03-24) **Context:** This is likely WHY the Helm release got stuck. A previous `tofu apply` tried to change the agent secret via Helm upgrade, something went wrong mid-upgrade (timeout/OOM), and the release got stuck in `pending-upgrade` state. **Where the secret is used in terraform:** - `terraform/main.tf:773-774` — `server.env.WOODPECKER_AGENT_SECRET` (set_sensitive) - `terraform/main.tf:779-780` — `agent.env.WOODPECKER_AGENT_SECRET` (set_sensitive) - Both use `var.woodpecker_agent_secret` from Salt pillar **The 3 values to reconcile:** 1. `secrets.auto.tfvars` — what Salt pillar says (canonical source) 2. k8s secret (Helm-managed) — what Helm last successfully deployed 3. Running pod env — what the Woodpecker server/agent are actually using **Fix procedure (after Helm unstick):** 1. `helm -n woodpecker get values woodpecker -o yaml | grep AGENT_SECRET` — see what Helm thinks 2. Check Salt pillar value: `make tofu-secrets && grep agent_secret terraform/secrets.auto.tfvars` 3. `tofu plan -lock=false` — see what tofu wants to change 4. If values align: just apply. If not: decide which is canonical and update the others. **Pairs well with #86** (rotate Woodpecker API token) — both are secret reconciliation tasks.
Author
Owner

Resolved (2026-03-24)

Verified all three values are aligned:

  • Salt pillar (secrets.auto.tfvars): 597ea9dc...5432
  • Running server pod (woodpecker-server-0): same
  • Running agent pod (woodpecker-agent-0): same
  • tofu plan: NO changes to helm_release.woodpecker

The drift was reconciled by a previous partial apply (pipeline #260). No further action needed.

## Resolved (2026-03-24) Verified all three values are aligned: - **Salt pillar** (secrets.auto.tfvars): `597ea9dc...5432` - **Running server pod** (woodpecker-server-0): same - **Running agent pod** (woodpecker-agent-0): same - **tofu plan**: NO changes to `helm_release.woodpecker` The drift was reconciled by a previous partial apply (pipeline #260). No further action needed.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
forgejo_admin/pal-e-platform#137
No description provided.