Fix westside-app prod: Harbor robot account scope mismatch (4 alerts) #110

Closed
opened 2026-03-18 17:03:23 +00:00 by forgejo_admin · 4 comments

Type

Bug

Lineage

plan-pal-e-platform → Platform Hardening → Harbor robot account scoping fix

Repo

forgejo_admin/pal-e-services

What Broke

westside-app prod has been down ~36 hours. Pod is in ImagePullBackOff with 401 Unauthorized. The image exists in Harbor project westside-app, but the robot account is scoped to Harbor project westsidekingsandqueens (derived from the service key in k3s.tfvars).

Root cause: services.tf creates Harbor projects named each.key and scopes robot accounts to that project. But image_repo can point to a different Harbor project (e.g., westside-app/app → project westside-app, not westsidekingsandqueens). The naming convention is implicit, not enforced in code.

Affected services:

  • westsidekingsandqueens: key=westsidekingsandqueens, image_repo=westside-app/appBROKEN (401)
  • mcd-tracker-app: key=mcd-tracker-app, image_repo=mcd-tracker/appLATENT (will break on next CI deploy)

Repro Steps

  1. kubectl get pods -n westsidekingsandqueens → ImagePullBackOff
  2. kubectl describe pod -n westsidekingsandqueens <pod> → 401 Unauthorized
  3. skopeo inspect --creds robot$westsidekingsandqueens-pull:<secret> docker://harbor.tail5b443a.ts.net/westside-app/app:latest → unauthorized

Expected Behavior

Robot accounts should be scoped to the Harbor project that image_repo points to, not to a project named after the service key.

Environment

  • Cluster: prod (k3s)
  • Services affected: westsidekingsandqueens, mcd-tracker-app (latent)
  • Related alerts: EndpointDown, RolloutStuck, ReplicasMismatch, PodNotReady (4 alerts)
  • Down since: ~2026-03-17 08:00 UTC

Design Decision: Option B (systemic fix)

Decision: Fix services.tf to derive the Harbor project name from image_repo instead of each.key. This makes the mismatch structurally impossible.

Use split("/", each.value.image_repo)[0] to extract the Harbor project name. Apply to:

  • harbor_project.service name
  • harbor_robot_account.service_ci namespace
  • harbor_robot_account.service_pull namespace

This fixes westside AND mcd-tracker AND any future service with a mismatched key/image_repo.

File Targets

Files to modify:

  • ~/pal-e-services/terraform/services.tf — derive Harbor project name from image_repo using split("/", each.value.image_repo)[0] for project name and robot account scoping (lines 11, 48, 72)

Files NOT to touch:

  • ~/pal-e-services/terraform/k3s.tfvars — service keys stay as-is
  • ~/pal-e-platform/terraform/ — this is a pal-e-services issue
  • ~/westside-app/ — no CI config changes needed with Option B
  • ~/pal-e-deployments/ — image references are correct already

Acceptance Criteria

  • curl -s https://westsidekingsandqueens.tail5b443a.ts.net returns SPA HTML
  • kubectl get pods -n westsidekingsandqueens shows Running, Ready
  • All 4 westside alerts clear
  • skopeo inspect with robot credentials succeeds against westside-app/app
  • Harbor project names match image_repo project component for ALL services
  • tofu plan shows no robot account scope drift for any service
  • mcd-tracker robot account also scoped correctly (preventive)

Test Expectations

  • tofu plan -lock=false shows expected changes (project renames + robot rescoping)
  • tofu apply -lock=false succeeds
  • kubectl get pods -n westsidekingsandqueens shows Running after apply
  • kubectl get pods -n mcd-tracker-app still Running (no regression)

Constraints

  • tofu apply -lock=false required (state lock blocks CI)
  • Harbor project rename may require force_destroy + recreate — check plan output
  • Robot account secret rotation will update k8s imagePullSecret automatically (terraform manages it)
  • This is the highest priority fix — westside prod is down

Checklist

  • PR opened
  • tofu plan reviewed
  • tofu apply succeeds
  • Pods running
  • Tests pass
  • No unrelated changes
  • pal-e-platform — project board tracking
  • Issue #109 — umbrella alert cleanup issue
  • Board item #171 — harbor-pull-secret-drift (absorbed into this fix)
  • PR #37 — SPA build that produced the westside image
### Type Bug ### Lineage `plan-pal-e-platform` → Platform Hardening → Harbor robot account scoping fix ### Repo `forgejo_admin/pal-e-services` ### What Broke westside-app prod has been down ~36 hours. Pod is in `ImagePullBackOff` with **401 Unauthorized**. The image exists in Harbor project `westside-app`, but the robot account is scoped to Harbor project `westsidekingsandqueens` (derived from the service key in `k3s.tfvars`). Root cause: `services.tf` creates Harbor projects named `each.key` and scopes robot accounts to that project. But `image_repo` can point to a *different* Harbor project (e.g., `westside-app/app` → project `westside-app`, not `westsidekingsandqueens`). The naming convention is implicit, not enforced in code. **Affected services:** - `westsidekingsandqueens`: key=`westsidekingsandqueens`, image_repo=`westside-app/app` → **BROKEN** (401) - `mcd-tracker-app`: key=`mcd-tracker-app`, image_repo=`mcd-tracker/app` → **LATENT** (will break on next CI deploy) ### Repro Steps 1. `kubectl get pods -n westsidekingsandqueens` → ImagePullBackOff 2. `kubectl describe pod -n westsidekingsandqueens <pod>` → 401 Unauthorized 3. `skopeo inspect --creds robot$westsidekingsandqueens-pull:<secret> docker://harbor.tail5b443a.ts.net/westside-app/app:latest` → unauthorized ### Expected Behavior Robot accounts should be scoped to the Harbor project that `image_repo` points to, not to a project named after the service key. ### Environment - Cluster: prod (k3s) - Services affected: westsidekingsandqueens, mcd-tracker-app (latent) - Related alerts: EndpointDown, RolloutStuck, ReplicasMismatch, PodNotReady (4 alerts) - Down since: ~2026-03-17 08:00 UTC ### Design Decision: Option B (systemic fix) **Decision:** Fix `services.tf` to derive the Harbor project name from `image_repo` instead of `each.key`. This makes the mismatch structurally impossible. Use `split("/", each.value.image_repo)[0]` to extract the Harbor project name. Apply to: - `harbor_project.service` name - `harbor_robot_account.service_ci` namespace - `harbor_robot_account.service_pull` namespace This fixes westside AND mcd-tracker AND any future service with a mismatched key/image_repo. ### File Targets Files to modify: - `~/pal-e-services/terraform/services.tf` — derive Harbor project name from `image_repo` using `split("/", each.value.image_repo)[0]` for project name and robot account scoping (lines 11, 48, 72) Files NOT to touch: - `~/pal-e-services/terraform/k3s.tfvars` — service keys stay as-is - `~/pal-e-platform/terraform/` — this is a pal-e-services issue - `~/westside-app/` — no CI config changes needed with Option B - `~/pal-e-deployments/` — image references are correct already ### Acceptance Criteria - [ ] `curl -s https://westsidekingsandqueens.tail5b443a.ts.net` returns SPA HTML - [ ] `kubectl get pods -n westsidekingsandqueens` shows Running, Ready - [ ] All 4 westside alerts clear - [ ] `skopeo inspect` with robot credentials succeeds against `westside-app/app` - [ ] Harbor project names match `image_repo` project component for ALL services - [ ] `tofu plan` shows no robot account scope drift for any service - [ ] mcd-tracker robot account also scoped correctly (preventive) ### Test Expectations - [ ] `tofu plan -lock=false` shows expected changes (project renames + robot rescoping) - [ ] `tofu apply -lock=false` succeeds - [ ] `kubectl get pods -n westsidekingsandqueens` shows Running after apply - [ ] `kubectl get pods -n mcd-tracker-app` still Running (no regression) ### Constraints - `tofu apply -lock=false` required (state lock blocks CI) - Harbor project rename may require force_destroy + recreate — check plan output - Robot account secret rotation will update k8s imagePullSecret automatically (terraform manages it) - This is the highest priority fix — westside prod is down ### Checklist - [ ] PR opened - [ ] `tofu plan` reviewed - [ ] `tofu apply` succeeds - [ ] Pods running - [ ] Tests pass - [ ] No unrelated changes ### Related - `pal-e-platform` — project board tracking - Issue #109 — umbrella alert cleanup issue - Board item #171 — harbor-pull-secret-drift (absorbed into this fix) - PR #37 — SPA build that produced the westside image
Author
Owner

Scope Review: NEEDS_REFINEMENT

Review note: review-189-2026-03-18
Root cause verified — robot scope mismatch confirmed in code. Three refinements needed:

  • CI breakage undocumented: Woodpecker pipelines #19/#20 on westside-app are also failing (push side of the same robot scope bug). Add acceptance criterion for CI success.
  • Design decision not made: Option A vs Option B must be decided before agent dispatch. An agent cannot make this architectural choice.
  • Option A file targets incomplete: If Option A, also need changes to westside-app/.woodpecker.yaml and pal-e-deployments kustomization image refs. Enumerate all affected files explicitly.
## Scope Review: NEEDS_REFINEMENT Review note: `review-189-2026-03-18` Root cause verified — robot scope mismatch confirmed in code. Three refinements needed: - **CI breakage undocumented:** Woodpecker pipelines #19/#20 on westside-app are also failing (push side of the same robot scope bug). Add acceptance criterion for CI success. - **Design decision not made:** Option A vs Option B must be decided before agent dispatch. An agent cannot make this architectural choice. - **Option A file targets incomplete:** If Option A, also need changes to `westside-app/.woodpecker.yaml` and `pal-e-deployments` kustomization image refs. Enumerate all affected files explicitly.
Author
Owner

Scope Refinement (post-review)

Decision: Option A (targeted fix). Align image_repo to match the Terraform-created Harbor project. Option B (fix services.tf robot scoping) deferred as a separate hardening ticket.

Changes from review feedback:

Added: CI push side

Woodpecker pipelines #19 and #20 on forgejo_admin/westside-app are also failing because the CI push robot has the same scope mismatch. Added acceptance criterion for CI pipeline success.

Decision made: Option A

Change image_repo from westside-app/app to westsidekingsandqueens/app in k3s.tfvars. Update CI push target and kustomization image references to match.

Complete file targets (Option A)

  • ~/pal-e-services/terraform/k3s.tfvars line 34 — change image_repo = "westside-app/app" to image_repo = "westsidekingsandqueens/app"
  • ~/pal-e-deployments/overlays/westsidekingsandqueens/prod/kustomization.yaml — update image reference from harbor.tail5b443a.ts.net/westside-app/app to harbor.tail5b443a.ts.net/westsidekingsandqueens/app
  • ~/westside-app/.woodpecker.yaml — update CI push target from westside-app/app to westsidekingsandqueens/app
  • Run tofu apply -lock=false in pal-e-services after tfvars change to create the new Harbor project and robot

Updated acceptance criteria

  • curl -s https://westsidekingsandqueens.tail5b443a.ts.net returns SPA HTML
  • Pod running, no ImagePullBackOff
  • All 4 alerts clear
  • Next Woodpecker pipeline on westside-app succeeds (clone + build + push + deploy)
## Scope Refinement (post-review) **Decision: Option A (targeted fix)**. Align `image_repo` to match the Terraform-created Harbor project. Option B (fix services.tf robot scoping) deferred as a separate hardening ticket. **Changes from review feedback:** ### Added: CI push side Woodpecker pipelines #19 and #20 on `forgejo_admin/westside-app` are also failing because the CI push robot has the same scope mismatch. Added acceptance criterion for CI pipeline success. ### Decision made: Option A Change `image_repo` from `westside-app/app` to `westsidekingsandqueens/app` in `k3s.tfvars`. Update CI push target and kustomization image references to match. ### Complete file targets (Option A) - `~/pal-e-services/terraform/k3s.tfvars` line 34 — change `image_repo = "westside-app/app"` to `image_repo = "westsidekingsandqueens/app"` - `~/pal-e-deployments/overlays/westsidekingsandqueens/prod/kustomization.yaml` — update image reference from `harbor.tail5b443a.ts.net/westside-app/app` to `harbor.tail5b443a.ts.net/westsidekingsandqueens/app` - `~/westside-app/.woodpecker.yaml` — update CI push target from `westside-app/app` to `westsidekingsandqueens/app` - Run `tofu apply -lock=false` in pal-e-services after tfvars change to create the new Harbor project and robot ### Updated acceptance criteria - [ ] `curl -s https://westsidekingsandqueens.tail5b443a.ts.net` returns SPA HTML - [ ] Pod running, no ImagePullBackOff - [ ] All 4 alerts clear - [ ] **Next Woodpecker pipeline on westside-app succeeds (clone + build + push + deploy)**
Author
Owner

Scope Review: NEEDS_REFINEMENT

Review note: review-189-2026-03-18-westside-harbor
Ticket is well-structured (all template sections present, file targets verified, acceptance criteria automatable) but cannot be dispatched without a design decision and blast radius update.

  • Design decision required: Option A vs Option B not committed — agent cannot proceed without knowing which to implement
  • Blast radius understated: mcd-tracker-app has the identical key/image_repo mismatch (mcd-tracker-app key, mcd-tracker/app image_repo) — currently masked but will break on next CI deploy
  • Board item #171 (todo-harbor-pull-secret-drift) in next_up addresses the same class of bug — should be referenced and coordinated
  • If Option A: ~/westside-app/.woodpecker.yaml must be added to File Targets (pushes to westside-app/app, would need second PR in different repo)
  • If Option B: Scope must include mcd-tracker-app fix and document how services.tf will parse project prefix from image_repo
## Scope Review: NEEDS_REFINEMENT Review note: `review-189-2026-03-18-westside-harbor` Ticket is well-structured (all template sections present, file targets verified, acceptance criteria automatable) but cannot be dispatched without a design decision and blast radius update. - **Design decision required:** Option A vs Option B not committed — agent cannot proceed without knowing which to implement - **Blast radius understated:** `mcd-tracker-app` has the identical key/image_repo mismatch (`mcd-tracker-app` key, `mcd-tracker/app` image_repo) — currently masked but will break on next CI deploy - **Board item #171 (todo-harbor-pull-secret-drift)** in `next_up` addresses the same class of bug — should be referenced and coordinated - **If Option A:** `~/westside-app/.woodpecker.yaml` must be added to File Targets (pushes to `westside-app/app`, would need second PR in different repo) - **If Option B:** Scope must include `mcd-tracker-app` fix and document how `services.tf` will parse project prefix from `image_repo`
Author
Owner

Scope Review: READY

Review note: review-189-2026-03-18-v2
Refined scope (Option A, complete file targets, CI push side added) addresses all v1 findings. All 4 file targets verified against codebase. Design decision made. Acceptance criteria automatable.

  • Advisory: Agent should execute in order: (1) tfvars + tofu apply, (2) .woodpecker.yaml merge + CI push, (3) kustomization update. Empty Harbor project will 401 if kustomization changes first.
  • Discovered scope: mcd-tracker-app has the same key/image_repo mismatch (masked today, image updater annotation wrong). Flag for #171 or new ticket.
## Scope Review: READY Review note: `review-189-2026-03-18-v2` Refined scope (Option A, complete file targets, CI push side added) addresses all v1 findings. All 4 file targets verified against codebase. Design decision made. Acceptance criteria automatable. - **Advisory:** Agent should execute in order: (1) tfvars + tofu apply, (2) .woodpecker.yaml merge + CI push, (3) kustomization update. Empty Harbor project will 401 if kustomization changes first. - **Discovered scope:** `mcd-tracker-app` has the same key/image_repo mismatch (masked today, image updater annotation wrong). Flag for #171 or new ticket.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
forgejo_admin/pal-e-platform#110
No description provided.