Terraform state splitting — modularize monolith main.tf for isolated applies #197

Closed
opened 2026-03-27 03:59:54 +00:00 by forgejo_admin · 4 comments

Type

Feature

Lineage

Discovered during incident #184 session. #196 documents the symptom (MinIO blocking Helm applies). This ticket is the permanent fix.

Repo

forgejo_admin/pal-e-platform

User Story

Story 1: Isolated deploys
As a platform operator
I want Terraform changes scoped to one service to only refresh that service's state
So that a MinIO hiccup doesn't block a Woodpecker Helm change (or vice versa)

Story 2: Blast radius containment
As a platform operator
I want each infrastructure module to have its own apply path
So that a bad apply in one module doesn't corrupt or block unrelated modules

Context

terraform/main.tf is a ~2200-line monolith with one state file. Every tofu apply refreshes ALL resources across 4 providers (kubernetes, helm, tailscale, minio). When any provider has connectivity issues, the entire apply fails — even if the change only touches one Helm release.

This caused 3 consecutive apply failures for PR #195 (a 1-line Woodpecker env var change) because the MinIO provider couldn't reach minio.minio.svc.cluster.local during the refresh phase.

Architecture

Current (monolith):

main.tf (~2200 lines, 1 state, 4 providers)
  → every apply refreshes everything
  → every provider is SPOF for every change

Target (modular, Option B first → Option A later):

terraform/
├── modules/
│   ├── ci/           ← Woodpecker Helm, agent config, gRPC funnel
│   ├── storage/      ← MinIO buckets, IAM users, policies
│   ├── monitoring/   ← kube-prometheus-stack, Grafana, alert rules
│   ├── forgejo/      ← Forgejo Helm, OAuth config
│   ├── keycloak/     ← Keycloak Helm
│   ├── harbor/       ← Harbor Helm, projects, registry config
│   ├── networking/   ← Tailscale funnels, subnet routers
│   ├── database/     ← CNPG clusters, backups, secrets
│   └── ops/          ← DORA exporter, Ollama/NVIDIA, embedding worker,
│                       tf-state-backup CronJob + MinIO IAM, cnpg-backup-verify,
│                       paledocs_db_url secret, assets bucket + policy
├── main.tf           ← thin orchestrator (module calls + shared resources)
├── variables.tf
├── outputs.tf        ← resource addresses change to module outputs
├── providers.tf      ← each module also declares required_providers
├── versions.tf       ← version constraints shared or per-module
└── network-policies.tf ← namespace references move with their modules

Migration path (each step independently valuable):

  1. Option B (this ticket): Wrap resource groups in module {} blocks, same state. CI uses -target=module.X for scoped applies.
  2. Option A (future ticket): Split state per module using tofu state mv. Each module gets its own backend. Full blast radius isolation.
  3. CI targeting (sub-ticket #198): Update .woodpecker.yaml with path-based module detection and targeted apply. Deferred from this ticket due to non-trivial CI pipeline complexity (328 lines, kubeconfig, 15+ secrets, lock retry).

File Targets

Files to create:

  • terraform/modules/ci/main.tf — Woodpecker Helm release, agent env, gRPC funnel, namespace
  • terraform/modules/ci/variables.tf + outputs.tf
  • terraform/modules/storage/main.tf — MinIO buckets, IAM users, policies
  • terraform/modules/storage/variables.tf + outputs.tf
  • terraform/modules/monitoring/main.tf — kube-prometheus-stack Helm, alert rules, PodMonitors
  • terraform/modules/monitoring/variables.tf + outputs.tf
  • terraform/modules/forgejo/main.tf — Forgejo Helm, OAuth
  • terraform/modules/forgejo/variables.tf + outputs.tf
  • terraform/modules/keycloak/main.tf — Keycloak Helm
  • terraform/modules/keycloak/variables.tf + outputs.tf
  • terraform/modules/harbor/main.tf — Harbor Helm, projects
  • terraform/modules/harbor/variables.tf + outputs.tf
  • terraform/modules/networking/main.tf — Tailscale funnels, subnet router
  • terraform/modules/networking/variables.tf + outputs.tf
  • terraform/modules/database/main.tf — CNPG clusters, backups, secrets
  • terraform/modules/database/variables.tf + outputs.tf
  • terraform/modules/ops/main.tf — DORA exporter, Ollama, embedding, backup jobs, misc
  • terraform/modules/ops/variables.tf + outputs.tf

Files to modify:

  • terraform/main.tf — replace inline resources with module calls + moved {} blocks
  • terraform/variables.tf — pass vars through to modules
  • terraform/outputs.tf — all 11 outputs re-pointed to module resource addresses
  • terraform/providers.tf — each module declares own required_providers
  • terraform/versions.tf — version constraints shared or per-module
  • terraform/network-policies.tf — 9 policies referencing namespace resources that move to modules

Files NOT to touch:

  • salt/ — Salt is a separate IaC pillar
  • terraform/k3s.tfvars / secrets.auto.tfvars — vars unchanged
  • .woodpecker.yaml — CI targeting deferred to sub-ticket #198

Acceptance Criteria

  • 9 modules created (ci, storage, monitoring, forgejo, keycloak, harbor, networking, database, ops)
  • Each module is self-contained with main.tf, variables.tf, outputs.tf
  • All 81 moved {} blocks enumerated and verified — tofu plan shows 0 add/0 destroy
  • tofu plan -target=module.ci only refreshes CI resources (no MinIO, no Tailscale)
  • tofu plan -target=module.storage only refreshes MinIO resources
  • tofu apply -target=module.ci succeeds even when MinIO is unreachable
  • Full tofu apply (no target) still works for complete reconciliation
  • Network policies continue to reference correct namespaces post-migration
  • All 11 outputs resolve to correct module resource addresses
  • tofu validate passes after restructure
  • tofu plan shows 0 changes after migration (state preserved, no resources destroyed/recreated)

Test Expectations

  • tofu validate passes
  • tofu plan -lock=false shows 0 changes after migration (refactor, not rebuild)
  • tofu plan -target=module.ci -lock=false completes without MinIO errors
  • Merge a CI-only PR, verify apply succeeds without MinIO refresh
  • Run command: tofu plan -lock=false -target=module.ci in terraform/

Constraints

  • Migration must be zero-downtime — no resources destroyed/recreated
  • Use moved {} blocks to tell Tofu about resource address changes (81 blocks required)
  • Agent must generate the full moved {} manifest before any resource relocation
  • Start with Option B (modules in same state) — Option A (separate state) is a future ticket
  • CI targeting deferred to sub-ticket #198
  • tofu plan must include -lock=false
  • Each module must declare its own required_providers block
  • Cross-module references use module outputs, not direct resource references

Checklist

  • PR opened
  • Tests pass
  • No unrelated changes
  • No resources destroyed during migration
  • Sub-ticket #198 created for CI pipeline targeting
  • #196 — symptom this fixes (MinIO blocking Helm applies)
  • #194 — blocked by #196 (MAX_WORKFLOWS apply failing)
  • #198 — sub-ticket: CI pipeline targeted apply (deferred)
  • #191 — agent label routing (same Helm values block, moves to modules/ci/)
  • project-pal-e-platform
### Type Feature ### Lineage Discovered during incident #184 session. #196 documents the symptom (MinIO blocking Helm applies). This ticket is the permanent fix. ### Repo `forgejo_admin/pal-e-platform` ### User Story **Story 1: Isolated deploys** As a platform operator I want Terraform changes scoped to one service to only refresh that service's state So that a MinIO hiccup doesn't block a Woodpecker Helm change (or vice versa) **Story 2: Blast radius containment** As a platform operator I want each infrastructure module to have its own apply path So that a bad apply in one module doesn't corrupt or block unrelated modules ### Context `terraform/main.tf` is a ~2200-line monolith with one state file. Every `tofu apply` refreshes ALL resources across 4 providers (kubernetes, helm, tailscale, minio). When any provider has connectivity issues, the entire apply fails — even if the change only touches one Helm release. This caused 3 consecutive apply failures for PR #195 (a 1-line Woodpecker env var change) because the MinIO provider couldn't reach `minio.minio.svc.cluster.local` during the refresh phase. ### Architecture **Current (monolith):** ``` main.tf (~2200 lines, 1 state, 4 providers) → every apply refreshes everything → every provider is SPOF for every change ``` **Target (modular, Option B first → Option A later):** ``` terraform/ ├── modules/ │ ├── ci/ ← Woodpecker Helm, agent config, gRPC funnel │ ├── storage/ ← MinIO buckets, IAM users, policies │ ├── monitoring/ ← kube-prometheus-stack, Grafana, alert rules │ ├── forgejo/ ← Forgejo Helm, OAuth config │ ├── keycloak/ ← Keycloak Helm │ ├── harbor/ ← Harbor Helm, projects, registry config │ ├── networking/ ← Tailscale funnels, subnet routers │ ├── database/ ← CNPG clusters, backups, secrets │ └── ops/ ← DORA exporter, Ollama/NVIDIA, embedding worker, │ tf-state-backup CronJob + MinIO IAM, cnpg-backup-verify, │ paledocs_db_url secret, assets bucket + policy ├── main.tf ← thin orchestrator (module calls + shared resources) ├── variables.tf ├── outputs.tf ← resource addresses change to module outputs ├── providers.tf ← each module also declares required_providers ├── versions.tf ← version constraints shared or per-module └── network-policies.tf ← namespace references move with their modules ``` **Migration path (each step independently valuable):** 1. **Option B (this ticket):** Wrap resource groups in `module {}` blocks, same state. CI uses `-target=module.X` for scoped applies. 2. **Option A (future ticket):** Split state per module using `tofu state mv`. Each module gets its own backend. Full blast radius isolation. 3. **CI targeting (sub-ticket #198):** Update `.woodpecker.yaml` with path-based module detection and targeted apply. Deferred from this ticket due to non-trivial CI pipeline complexity (328 lines, kubeconfig, 15+ secrets, lock retry). ### File Targets Files to create: - `terraform/modules/ci/main.tf` — Woodpecker Helm release, agent env, gRPC funnel, namespace - `terraform/modules/ci/variables.tf` + `outputs.tf` - `terraform/modules/storage/main.tf` — MinIO buckets, IAM users, policies - `terraform/modules/storage/variables.tf` + `outputs.tf` - `terraform/modules/monitoring/main.tf` — kube-prometheus-stack Helm, alert rules, PodMonitors - `terraform/modules/monitoring/variables.tf` + `outputs.tf` - `terraform/modules/forgejo/main.tf` — Forgejo Helm, OAuth - `terraform/modules/forgejo/variables.tf` + `outputs.tf` - `terraform/modules/keycloak/main.tf` — Keycloak Helm - `terraform/modules/keycloak/variables.tf` + `outputs.tf` - `terraform/modules/harbor/main.tf` — Harbor Helm, projects - `terraform/modules/harbor/variables.tf` + `outputs.tf` - `terraform/modules/networking/main.tf` — Tailscale funnels, subnet router - `terraform/modules/networking/variables.tf` + `outputs.tf` - `terraform/modules/database/main.tf` — CNPG clusters, backups, secrets - `terraform/modules/database/variables.tf` + `outputs.tf` - `terraform/modules/ops/main.tf` — DORA exporter, Ollama, embedding, backup jobs, misc - `terraform/modules/ops/variables.tf` + `outputs.tf` Files to modify: - `terraform/main.tf` — replace inline resources with module calls + `moved {}` blocks - `terraform/variables.tf` — pass vars through to modules - `terraform/outputs.tf` — all 11 outputs re-pointed to module resource addresses - `terraform/providers.tf` — each module declares own `required_providers` - `terraform/versions.tf` — version constraints shared or per-module - `terraform/network-policies.tf` — 9 policies referencing namespace resources that move to modules Files NOT to touch: - `salt/` — Salt is a separate IaC pillar - `terraform/k3s.tfvars` / `secrets.auto.tfvars` — vars unchanged - `.woodpecker.yaml` — CI targeting deferred to sub-ticket #198 ### Acceptance Criteria - [ ] 9 modules created (ci, storage, monitoring, forgejo, keycloak, harbor, networking, database, ops) - [ ] Each module is self-contained with main.tf, variables.tf, outputs.tf - [ ] All 81 `moved {}` blocks enumerated and verified — `tofu plan` shows 0 add/0 destroy - [ ] `tofu plan -target=module.ci` only refreshes CI resources (no MinIO, no Tailscale) - [ ] `tofu plan -target=module.storage` only refreshes MinIO resources - [ ] `tofu apply -target=module.ci` succeeds even when MinIO is unreachable - [ ] Full `tofu apply` (no target) still works for complete reconciliation - [ ] Network policies continue to reference correct namespaces post-migration - [ ] All 11 outputs resolve to correct module resource addresses - [ ] `tofu validate` passes after restructure - [ ] `tofu plan` shows 0 changes after migration (state preserved, no resources destroyed/recreated) ### Test Expectations - [ ] `tofu validate` passes - [ ] `tofu plan -lock=false` shows 0 changes after migration (refactor, not rebuild) - [ ] `tofu plan -target=module.ci -lock=false` completes without MinIO errors - [ ] Merge a CI-only PR, verify apply succeeds without MinIO refresh - Run command: `tofu plan -lock=false -target=module.ci` in `terraform/` ### Constraints - Migration must be zero-downtime — no resources destroyed/recreated - Use `moved {}` blocks to tell Tofu about resource address changes (81 blocks required) - Agent must generate the full `moved {}` manifest before any resource relocation - Start with Option B (modules in same state) — Option A (separate state) is a future ticket - CI targeting deferred to sub-ticket #198 - `tofu plan` must include `-lock=false` - Each module must declare its own `required_providers` block - Cross-module references use module outputs, not direct resource references ### Checklist - [ ] PR opened - [ ] Tests pass - [ ] No unrelated changes - [ ] No resources destroyed during migration - [ ] Sub-ticket #198 created for CI pipeline targeting ### Related - #196 — symptom this fixes (MinIO blocking Helm applies) - #194 — blocked by #196 (MAX_WORKFLOWS apply failing) - #198 — sub-ticket: CI pipeline targeted apply (deferred) - #191 — agent label routing (same Helm values block, moves to modules/ci/) - `project-pal-e-platform`
Author
Owner

Scope Review: NEEDS_REFINEMENT

Review note: review-436-2026-03-26
Well-structured issue with complete template, full traceability, and sound architecture -- but 4 files and ~25 resources are missing from scope.

  • network-policies.tf (228 lines, 9 policies) not in file targets -- references namespace resources that will move to modules
  • outputs.tf (11 outputs), providers.tf, versions.tf not in file targets -- outputs must be rewritten for module references
  • ~25 orphaned resources don't map to the 8 proposed modules (DORA exporter, Ollama/NVIDIA, embedding metrics, paledocs secret, tf-state backup, cnpg backup verify, assets bucket) -- need a 9th module or explicit root placement
  • CI detection logic is pseudocode only -- actual .woodpecker.yaml change is non-trivial (15+ secrets, kubeconfig setup, lock retry)
  • 81 moved blocks needed (72 main.tf + 9 network-policies.tf) -- high risk for zero-downtime constraint
  • Minor: issue says "5 providers" but providers.tf has 4
## Scope Review: NEEDS_REFINEMENT Review note: `review-436-2026-03-26` Well-structured issue with complete template, full traceability, and sound architecture -- but 4 files and ~25 resources are missing from scope. - **network-policies.tf** (228 lines, 9 policies) not in file targets -- references namespace resources that will move to modules - **outputs.tf** (11 outputs), **providers.tf**, **versions.tf** not in file targets -- outputs must be rewritten for module references - **~25 orphaned resources** don't map to the 8 proposed modules (DORA exporter, Ollama/NVIDIA, embedding metrics, paledocs secret, tf-state backup, cnpg backup verify, assets bucket) -- need a 9th module or explicit root placement - **CI detection logic** is pseudocode only -- actual .woodpecker.yaml change is non-trivial (15+ secrets, kubeconfig setup, lock retry) - **81 moved blocks** needed (72 main.tf + 9 network-policies.tf) -- high risk for zero-downtime constraint - Minor: issue says "5 providers" but providers.tf has 4
Author
Owner

Refinement Update

Per review (review-436-2026-03-26):

Fix 1: Missing file targets

Added to scope:

  • terraform/network-policies.tf (228 lines, 9 policies) — namespace references move with their modules
  • terraform/outputs.tf (11 outputs) — resource addresses change to module outputs
  • terraform/providers.tf — each module declares required_providers
  • terraform/versions.tf — version constraints shared or per-module

Fix 2: Orphaned resources → 9th "ops" module

~25 resources don't fit the 8 service modules. Adding modules/ops/:

  • DORA exporter (4 resources)
  • Ollama/NVIDIA (3 + namespace)
  • Embedding worker metrics (2)
  • tf-state-backup CronJob + MinIO IAM (8)
  • cnpg-backup-verify CronJob
  • paledocs_db_url secret
  • assets bucket + policy

Fix 3: CI pipeline change → sub-ticket

The .woodpecker.yaml targeted apply logic is non-trivial (328 lines, kubeconfig setup, 15+ secrets, lock retry). Split into a follow-up ticket rather than bundling with the module restructure. This ticket delivers the modules; CI targeting comes after.

Fix 4: 81 moved blocks

Added acceptance criterion: "All moved blocks enumerated and verified — tofu plan shows 0 add/destroy after migration." Agent executing this ticket must generate the full moved {} manifest before any resource relocation.

Fix 5: Missing acceptance criteria

Added:

  • Network policies continue to reference correct namespaces post-migration
  • All 11 outputs resolve to correct module resource addresses
  • tofu plan shows 0 add, 0 destroy after restructure

Fix 6: Provider count

Corrected: 4 providers (kubernetes, helm, tailscale, minio), not 5.

## Refinement Update Per review (`review-436-2026-03-26`): ### Fix 1: Missing file targets Added to scope: - `terraform/network-policies.tf` (228 lines, 9 policies) — namespace references move with their modules - `terraform/outputs.tf` (11 outputs) — resource addresses change to module outputs - `terraform/providers.tf` — each module declares `required_providers` - `terraform/versions.tf` — version constraints shared or per-module ### Fix 2: Orphaned resources → 9th "ops" module ~25 resources don't fit the 8 service modules. Adding `modules/ops/`: - DORA exporter (4 resources) - Ollama/NVIDIA (3 + namespace) - Embedding worker metrics (2) - tf-state-backup CronJob + MinIO IAM (8) - cnpg-backup-verify CronJob - paledocs_db_url secret - assets bucket + policy ### Fix 3: CI pipeline change → sub-ticket The `.woodpecker.yaml` targeted apply logic is non-trivial (328 lines, kubeconfig setup, 15+ secrets, lock retry). Split into a follow-up ticket rather than bundling with the module restructure. This ticket delivers the modules; CI targeting comes after. ### Fix 4: 81 moved blocks Added acceptance criterion: "All `moved` blocks enumerated and verified — `tofu plan` shows 0 add/destroy after migration." Agent executing this ticket must generate the full `moved {}` manifest before any resource relocation. ### Fix 5: Missing acceptance criteria Added: - [ ] Network policies continue to reference correct namespaces post-migration - [ ] All 11 outputs resolve to correct module resource addresses - [ ] `tofu plan` shows 0 add, 0 destroy after restructure ### Fix 6: Provider count Corrected: 4 providers (kubernetes, helm, tailscale, minio), not 5.
Author
Owner

Scope Re-Review: NEEDS_REFINEMENT

Review note: review-436-2026-03-26 (updated)

The refinement comment correctly addresses all 6 original concerns -- technically sound across the board. However, the issue body (the spec an executing agent reads) was not edited. All fixes live only in the comment.

Two actions remain before READY:

  1. Edit the issue body to incorporate all 6 fixes:
    • Add network-policies.tf, outputs.tf, providers.tf, versions.tf to File Targets
    • Add modules/ops/ to architecture diagram and File Targets (9th module for ~25 orphaned resources)
    • Remove .woodpecker.yaml from File Targets (CI split to sub-ticket)
    • Remove "CI pipeline detects changed module and runs targeted apply" from ACs
    • Add 3 new ACs: namespace refs, output addresses, moved-block manifest
    • Fix "5 providers" → "4 providers" in Context and Architecture
  2. Create the CI targeting sub-ticket -- refinement says "split into a follow-up ticket" but no Forgejo issue exists yet

VERDICT: NEEDS_REFINEMENT -- refinement content is correct, just needs to be applied to the issue body and the sub-ticket created.

## Scope Re-Review: NEEDS_REFINEMENT Review note: `review-436-2026-03-26` (updated) The refinement comment correctly addresses all 6 original concerns -- technically sound across the board. However, the issue body (the spec an executing agent reads) was **not edited**. All fixes live only in the comment. **Two actions remain before READY:** 1. **Edit the issue body** to incorporate all 6 fixes: - Add `network-policies.tf`, `outputs.tf`, `providers.tf`, `versions.tf` to File Targets - Add `modules/ops/` to architecture diagram and File Targets (9th module for ~25 orphaned resources) - Remove `.woodpecker.yaml` from File Targets (CI split to sub-ticket) - Remove "CI pipeline detects changed module and runs targeted apply" from ACs - Add 3 new ACs: namespace refs, output addresses, moved-block manifest - Fix "5 providers" → "4 providers" in Context and Architecture 2. **Create the CI targeting sub-ticket** -- refinement says "split into a follow-up ticket" but no Forgejo issue exists yet **VERDICT: NEEDS_REFINEMENT** -- refinement content is correct, just needs to be applied to the issue body and the sub-ticket created.
Author
Owner

Scope Review: READY

Review note: review-436-2026-03-26

All 6 refinement items from the previous NEEDS_REFINEMENT review have been incorporated into the issue body. Sub-ticket #198 created and tracked on board (#437).

Refinement resolution:

# Concern Status
1 4 missing file targets All 4 added to "Files to modify"
2 Orphaned resources unmapped 9th "ops" module added throughout
3 CI pipeline deferred .woodpecker.yaml in "NOT to touch", #198 created
4 81 moved blocks risk AC + constraint for manifest-first approach
5 Missing ACs for net-policies/outputs 3 new ACs added
6 Provider count 5 vs 4 Corrected to 4 in Context + Architecture

VERDICT: READY -- this ticket can move from todo to next_up.

## Scope Review: READY Review note: `review-436-2026-03-26` All 6 refinement items from the previous NEEDS_REFINEMENT review have been incorporated into the issue body. Sub-ticket #198 created and tracked on board (#437). **Refinement resolution:** | # | Concern | Status | |---|---------|--------| | 1 | 4 missing file targets | All 4 added to "Files to modify" | | 2 | Orphaned resources unmapped | 9th "ops" module added throughout | | 3 | CI pipeline deferred | .woodpecker.yaml in "NOT to touch", #198 created | | 4 | 81 moved blocks risk | AC + constraint for manifest-first approach | | 5 | Missing ACs for net-policies/outputs | 3 new ACs added | | 6 | Provider count 5 vs 4 | Corrected to 4 in Context + Architecture | **VERDICT: READY** -- this ticket can move from todo to next_up.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
forgejo_admin/pal-e-platform#197
No description provided.