SOP: Harbor robot import recovery — find the real robot ID without breaking live infra #274

Closed
opened 2026-04-26 17:14:36 +00:00 by forgejo_admin · 2 comments
Contributor

Type

Feature

Lineage

Standalone — captured 2026-04-26 after a near-miss during apply-backlog reconciliation where the wrong Harbor robot was imported into terraform state, briefly putting westsidekingsandqueens-pull at risk of deletion.

Repo

forgejo_admin/pal-e-api (deliverable is a note in pal-e-docs, owned by this project)

User Story

As a platform operator recovering from a Harbor 409 conflict during tofu apply
I want a documented procedure for finding the correct robot ID and importing safely
So that I don't misread Harbor's error messages and import the wrong robot, which would mark a live robot for replacement and break image pulls for an unrelated service.

Context

On 2026-04-26 a tofu apply returned Error: 409 robot account 27:playme2k+playme2k-ci already exists while attempting to create the playme2k CI robot. The 27 in that message is the Harbor project ID, not the robot ID — but the format invites misreading. The operator imported /robots/27 into state on the assumption it was the robot ID. That ID actually belongs to westsidekingsandqueens-pull, which then had two terraform resources pointing to it. The next plan reported must be replaced with a name change from westsidekingsandqueens-pull to playme2k-ci, which would have destroyed the live pull robot and broken westsidekingsandqueens image pulls. Caught on plan inspection before apply, recovered via tofu state rm.

Additional finding: the paginated GET /api/v2.0/robots?page=N&page_size=100 endpoint returns system-level robots only. Project-scoped robots (which is what every service uses for CI and pull access) do not appear in that listing. Direct GET /api/v2.0/robots/{id} lookups do return them. The real playme2k-ci robot was found by scanning IDs 1-400.

This recovery pattern is not covered by any existing SOP. service-onboarding-sop covers happy-path onboarding but not import recovery.

File Targets

  • New SOP note in pal-e-docs: slug sop-harbor-robot-import, tagged sop,active.
  • Update sop-index to reference the new SOP.

Files the agent should NOT touch:

  • service-onboarding-sop — happy-path onboarding belongs there; recovery belongs in its own note.
  • Existing terraform code in pal-e-services — separate ticket.

Acceptance Criteria

  • SOP note created with the steps captured under "Procedure" below.
  • SOP includes the explicit warning that Harbor's 409 error format <project_id>:<robot_full_name> is misleading — <project_id> is NOT the robot ID.
  • SOP documents the gap that paginated /robots listing hides project-scoped robots.
  • SOP includes the safety check: always run tofu plan after import; if it shows must be replaced, immediately tofu state rm to back out.
  • sop-index references the new SOP.

Procedure (content the SOP must include)

  1. When apply fails with Error: 409 robot account <X>:<robot-name> already exists, treat <X> as a Harbor project ID and ignore it for import purposes.
  2. Find the real robot ID by direct lookup against /api/v2.0/robots/{id} for an ID range, filtering by name match for <robot-name>. The robot's name is robot$<project>+<robot-name>.
  3. Import with tofu import -var-file=k3s.tfvars 'harbor_robot_account.service_ci["<service>"]' '/robots/<real-id>'.
  4. Always re-plan immediately: tofu plan -var-file=k3s.tfvars -lock=false. If output shows must be replaced or any ~ name = "..." -> "..." for the imported resource, the wrong robot was imported. Run tofu state rm 'harbor_robot_account.service_ci["<service>"]' immediately to back out (this only edits state, does not touch the live robot).
  5. Repeat from step 2 with the correct ID.

Test Expectations

  • Note renders cleanly in pal-e-docs (no broken markdown, frontmatter valid).
  • list_notes(tags="sop,active") returns the new SOP.
  • sop-index displays the new entry.

Constraints

  • Follow the existing SOP note convention.
  • Use plain language, executable commands, and concrete examples — not abstract advice.
  • Cross-reference feedback_never_alter_prod_directly and the import command formats from service-onboarding-sop.

Checklist

  • SOP note created
  • sop-index updated
  • Tagged sop,active
  • Verified via session-startup SOP injection on next session
  • pal-e-docs — project this affects
  • pal-e-platform — recovery applies to platform terraform operations
  • Discovered 2026-04-26 during import-and-apply session
### Type Feature ### Lineage Standalone — captured 2026-04-26 after a near-miss during apply-backlog reconciliation where the wrong Harbor robot was imported into terraform state, briefly putting `westsidekingsandqueens-pull` at risk of deletion. ### Repo `forgejo_admin/pal-e-api` (deliverable is a note in pal-e-docs, owned by this project) ### User Story As a platform operator recovering from a Harbor 409 conflict during `tofu apply` I want a documented procedure for finding the correct robot ID and importing safely So that I don't misread Harbor's error messages and import the wrong robot, which would mark a live robot for replacement and break image pulls for an unrelated service. ### Context On 2026-04-26 a `tofu apply` returned `Error: 409 robot account 27:playme2k+playme2k-ci already exists` while attempting to create the playme2k CI robot. The `27` in that message is the **Harbor project ID**, not the robot ID — but the format invites misreading. The operator imported `/robots/27` into state on the assumption it was the robot ID. That ID actually belongs to `westsidekingsandqueens-pull`, which then had two terraform resources pointing to it. The next plan reported `must be replaced` with a name change from `westsidekingsandqueens-pull` to `playme2k-ci`, which would have destroyed the live pull robot and broken westsidekingsandqueens image pulls. Caught on plan inspection before apply, recovered via `tofu state rm`. Additional finding: the paginated `GET /api/v2.0/robots?page=N&page_size=100` endpoint returns **system-level robots only**. Project-scoped robots (which is what every service uses for CI and pull access) do not appear in that listing. Direct `GET /api/v2.0/robots/{id}` lookups do return them. The real playme2k-ci robot was found by scanning IDs 1-400. This recovery pattern is not covered by any existing SOP. `service-onboarding-sop` covers happy-path onboarding but not import recovery. ### File Targets - New SOP note in pal-e-docs: slug `sop-harbor-robot-import`, tagged `sop,active`. - Update `sop-index` to reference the new SOP. Files the agent should NOT touch: - `service-onboarding-sop` — happy-path onboarding belongs there; recovery belongs in its own note. - Existing terraform code in `pal-e-services` — separate ticket. ### Acceptance Criteria - [ ] SOP note created with the steps captured under "Procedure" below. - [ ] SOP includes the explicit warning that Harbor's 409 error format `<project_id>:<robot_full_name>` is misleading — `<project_id>` is NOT the robot ID. - [ ] SOP documents the gap that paginated `/robots` listing hides project-scoped robots. - [ ] SOP includes the safety check: always run `tofu plan` after import; if it shows `must be replaced`, immediately `tofu state rm` to back out. - [ ] `sop-index` references the new SOP. ### Procedure (content the SOP must include) 1. When apply fails with `Error: 409 robot account <X>:<robot-name> already exists`, treat `<X>` as a Harbor project ID and ignore it for import purposes. 2. Find the real robot ID by direct lookup against `/api/v2.0/robots/{id}` for an ID range, filtering by name match for `<robot-name>`. The robot's `name` is `robot$<project>+<robot-name>`. 3. Import with `tofu import -var-file=k3s.tfvars 'harbor_robot_account.service_ci["<service>"]' '/robots/<real-id>'`. 4. **Always re-plan immediately:** `tofu plan -var-file=k3s.tfvars -lock=false`. If output shows `must be replaced` or any `~ name = "..." -> "..."` for the imported resource, the wrong robot was imported. Run `tofu state rm 'harbor_robot_account.service_ci["<service>"]'` immediately to back out (this only edits state, does not touch the live robot). 5. Repeat from step 2 with the correct ID. ### Test Expectations - [ ] Note renders cleanly in pal-e-docs (no broken markdown, frontmatter valid). - [ ] `list_notes(tags="sop,active")` returns the new SOP. - [ ] `sop-index` displays the new entry. ### Constraints - Follow the existing SOP note convention. - Use plain language, executable commands, and concrete examples — not abstract advice. - Cross-reference `feedback_never_alter_prod_directly` and the import command formats from `service-onboarding-sop`. ### Checklist - [ ] SOP note created - [ ] sop-index updated - [ ] Tagged `sop,active` - [ ] Verified via session-startup SOP injection on next session ### Related - `pal-e-docs` — project this affects - `pal-e-platform` — recovery applies to platform terraform operations - Discovered 2026-04-26 during import-and-apply session
Author
Contributor

Scope Review: APPROVED

Review note: review-1108-2026-04-26

Scope is solid — comprehensive issue body, traceability triangle intact (story:superuser-recover verified on project-pal-e-platform), zero blast radius, fits the 5-minute rule. The Procedure section already contains the SOP content; dev agent only needs to structure it into the standard SOP shape and add the sop-index entry.

Non-blocking observations:

  • [SCOPE] arch:harbor — no arch-harbor note in pal-e-docs (platform-wide gap, separate ticket)
  • [SCOPE] arch:terraform — no arch-terraform note in pal-e-docs (platform-wide gap, separate ticket)

Foundational doc work ships without complete arch backing notes — gating SOPs on missing arch notes would block all platform documentation. Ready to advance to next_up.

## Scope Review: APPROVED Review note: `review-1108-2026-04-26` Scope is solid — comprehensive issue body, traceability triangle intact (story:superuser-recover verified on project-pal-e-platform), zero blast radius, fits the 5-minute rule. The Procedure section already contains the SOP content; dev agent only needs to structure it into the standard SOP shape and add the sop-index entry. Non-blocking observations: - `[SCOPE]` arch:harbor — no `arch-harbor` note in pal-e-docs (platform-wide gap, separate ticket) - `[SCOPE]` arch:terraform — no `arch-terraform` note in pal-e-docs (platform-wide gap, separate ticket) Foundational doc work ships without complete arch backing notes — gating SOPs on missing arch notes would block all platform documentation. Ready to advance to next_up.
Author
Contributor

SOP delivered.

SOP note created: sop-harbor-robot-importnote_type: sop, status: active, project: pal-e-platform, tagged sop,active.

Sections (matches template-sop requirements):

  • Purpose — when this SOP applies, who uses it, expected outcome
  • Background — the 2026-04-26 incident details and the two traps (project-id misread + paginated listing hides project robots)
  • Steps — 5 ordered actionable steps with exact commands
  • Recovery — tofu state rm back-out procedure
  • Rules — 6 invariants (never treat 409 number as robot ID, never use paginated listing for project robots, always plan after import, etc.)
  • Related — links to service-onboarding-sop, sop-platform-tf-changes, sop-incident-response, plus the underlying behavioral feedback memories

Discoverability verified:

  • list_notes(tags="sop,active") returns the new SOP.
  • sop-index updated: new entry added to the "Deployment & Onboarding" table (now 7 rows).
  • Will be auto-injected on next session start via the SOP listing.

Acceptance criteria status: all 5 from the issue body satisfied:

  • Procedure captured (5 steps + recovery)
  • Project-id misread warning explicit in Background
  • Paginated /robots gap documented in Background
  • Safety check (re-plan after import → state rm if drift) is step 4 + Recovery section
  • sop-index references the new SOP

Board item #1108 advancing to needs_approval for your sign-off.

SOP delivered. **SOP note created:** [`sop-harbor-robot-import`](/notes/sop-harbor-robot-import) — `note_type: sop`, `status: active`, project: `pal-e-platform`, tagged `sop,active`. **Sections** (matches `template-sop` requirements): - Purpose — when this SOP applies, who uses it, expected outcome - Background — the 2026-04-26 incident details and the two traps (project-id misread + paginated listing hides project robots) - Steps — 5 ordered actionable steps with exact commands - Recovery — `tofu state rm` back-out procedure - Rules — 6 invariants (never treat 409 number as robot ID, never use paginated listing for project robots, always plan after import, etc.) - Related — links to `service-onboarding-sop`, `sop-platform-tf-changes`, `sop-incident-response`, plus the underlying behavioral feedback memories **Discoverability verified:** - `list_notes(tags="sop,active")` returns the new SOP. - `sop-index` updated: new entry added to the "Deployment & Onboarding" table (now 7 rows). - Will be auto-injected on next session start via the SOP listing. **Acceptance criteria status:** all 5 from the issue body satisfied: - ✅ Procedure captured (5 steps + recovery) - ✅ Project-id misread warning explicit in Background - ✅ Paginated `/robots` gap documented in Background - ✅ Safety check (re-plan after import → state rm if drift) is step 4 + Recovery section - ✅ `sop-index` references the new SOP Board item #1108 advancing to `needs_approval` for your sign-off.
Commenting is not possible because the repository is archived.
No milestone
No project
No assignees
1 participant
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
ldraney/pal-e-api#274
No description provided.