[CRITICAL] Migration 041 dual-revision collision — root cause of 20h CrashLoopBackOff #443

Closed
opened 2026-04-11 20:08:25 +00:00 by forgejo_admin · 0 comments

Type

Bug

Lineage

Discovered 2026-04-11 during validation after merging #442. PR #442 fixed the 040 collision but did NOT unblock the deploy because a SECOND collision exists at revision 041. Related to PR #426 (contract audit log) and PR #5/#435 (westside-streamlit RO role).

Repo

forgejo_admin/basketball-api

What Broke

alembic/versions/ on main contains TWO files claiming revision = "041" with down_revision = "040":

  1. 041_add_contract_audit_log.py — APPLIED in prod (contract_audit_log table exists)
  2. 041_add_westside_streamlit_ro_role.py — NEVER APPLIED (westside_streamlit_ro postgres role does NOT exist)

This is the root cause of the 20-hour basketball-api-5cb84b9b67-vxpx9 CrashLoopBackOff (240 restarts). The crashing pod's startup logs show:

UserWarning: Revision 041 is present more than once
FAILED: Multiple head revisions are present for given argument 'head'

The currently-serving pod (basketball-api-5c4b9bcc-vvfsx) was started before the second 041 file was added, so it migrated cleanly to alembic_version=042 and is still healthy. RollingUpdate maxUnavailable=0 has been hiding the failure.

Repro Steps

  1. kubectl -n basketball-api logs basketball-api-5cb84b9b67-vxpx9 --tail=10 — shows alembic dual-revision error
  2. curl -sS "$FORGEJO_URL/api/v1/repos/forgejo_admin/basketball-api/contents/alembic/versions?ref=main" — shows both 041_add_contract_audit_log.py and 041_add_westside_streamlit_ro_role.py
  3. kubectl -n basketball-api exec postgres-9b5b87b5-5nccx -- psql -U basketball -d basketball -tc "SELECT to_regclass('public.contract_audit_log');" returns the table name
  4. kubectl -n basketball-api exec postgres-9b5b87b5-5nccx -- psql -U basketball -d basketball -tc "SELECT 1 FROM pg_roles WHERE rolname = 'westside_streamlit_ro';" returns empty

Expected Behavior

Exactly one file claims revision = "041". The streamlit_ro_role migration (which was never applied) should be renumbered to a unique revision at the end of the chain, preserving the existing 041_add_contract_audit_log.py which is already in production.

Environment

  • Cluster/namespace: basketball-api
  • Crashing pod: basketball-api-5cb84b9b67-vxpx9, image harbor.tail5b443a.ts.net/basketball-api/api:7ccc4b3020797c0b59544493194de837c19441fe
  • Healthy pod: basketball-api-5c4b9bcc-vvfsx (still serving, alembic_version=042)
  • ArgoCD application: Synced / Degraded (has been Degraded for 20 hours)
  • Git head (after PR #442 merge): includes both 041 files

Acceptance Criteria

  • Only one file with revision = "041" remains in alembic/versions/
  • 041_add_westside_streamlit_ro_role.py renamed to 044_add_westside_streamlit_ro_role.py (after the 043 jersey migration), with revision = "044", down_revision = "043"
  • 041_add_contract_audit_log.py is NOT touched
  • Schema body of the renamed file is byte-identical (only metadata changes)
  • alembic heads returns a single head (044)
  • After rollout, westside_streamlit_ro postgres role exists in prod
  • After rollout, jersey_public_orders table exists in prod
  • ArgoCD application returns to Synced / Healthy
  • CrashLoopBackOff pod replaced by a healthy new pod
  • alembic_version table reaches 044
  • pal-e-platform — project tracking
  • forgejo_admin/basketball-api#441 — sister bug for the 040 collision (PR #442 fix)
  • PR #426 — contract audit log (the 041 that was applied)
  • PR #5 / #435 — westside-streamlit RO role (the 041 that became a ghost)
  • 20-hour ArgoCD Degraded state was masked because RollingUpdate kept the old pod serving
  • Process gap: same as #441 — Woodpecker should run alembic heads before build to fail-fast on collisions
### Type Bug ### Lineage Discovered 2026-04-11 during validation after merging #442. PR #442 fixed the 040 collision but did NOT unblock the deploy because a SECOND collision exists at revision 041. Related to PR #426 (contract audit log) and PR #5/#435 (westside-streamlit RO role). ### Repo `forgejo_admin/basketball-api` ### What Broke `alembic/versions/` on `main` contains TWO files claiming `revision = "041"` with `down_revision = "040"`: 1. `041_add_contract_audit_log.py` — APPLIED in prod (`contract_audit_log` table exists) 2. `041_add_westside_streamlit_ro_role.py` — NEVER APPLIED (`westside_streamlit_ro` postgres role does NOT exist) This is the root cause of the 20-hour `basketball-api-5cb84b9b67-vxpx9` CrashLoopBackOff (240 restarts). The crashing pod's startup logs show: ``` UserWarning: Revision 041 is present more than once FAILED: Multiple head revisions are present for given argument 'head' ``` The currently-serving pod (`basketball-api-5c4b9bcc-vvfsx`) was started before the second 041 file was added, so it migrated cleanly to alembic_version=042 and is still healthy. RollingUpdate maxUnavailable=0 has been hiding the failure. ### Repro Steps 1. `kubectl -n basketball-api logs basketball-api-5cb84b9b67-vxpx9 --tail=10` — shows alembic dual-revision error 2. `curl -sS "$FORGEJO_URL/api/v1/repos/forgejo_admin/basketball-api/contents/alembic/versions?ref=main"` — shows both `041_add_contract_audit_log.py` and `041_add_westside_streamlit_ro_role.py` 3. `kubectl -n basketball-api exec postgres-9b5b87b5-5nccx -- psql -U basketball -d basketball -tc "SELECT to_regclass('public.contract_audit_log');"` returns the table name 4. `kubectl -n basketball-api exec postgres-9b5b87b5-5nccx -- psql -U basketball -d basketball -tc "SELECT 1 FROM pg_roles WHERE rolname = 'westside_streamlit_ro';"` returns empty ### Expected Behavior Exactly one file claims `revision = "041"`. The streamlit_ro_role migration (which was never applied) should be renumbered to a unique revision at the end of the chain, preserving the existing `041_add_contract_audit_log.py` which is already in production. ### Environment - Cluster/namespace: `basketball-api` - Crashing pod: `basketball-api-5cb84b9b67-vxpx9`, image `harbor.tail5b443a.ts.net/basketball-api/api:7ccc4b3020797c0b59544493194de837c19441fe` - Healthy pod: `basketball-api-5c4b9bcc-vvfsx` (still serving, alembic_version=042) - ArgoCD application: `Synced / Degraded` (has been Degraded for 20 hours) - Git head (after PR #442 merge): includes both 041 files ### Acceptance Criteria - [ ] Only one file with `revision = "041"` remains in `alembic/versions/` - [ ] `041_add_westside_streamlit_ro_role.py` renamed to `044_add_westside_streamlit_ro_role.py` (after the 043 jersey migration), with `revision = "044"`, `down_revision = "043"` - [ ] `041_add_contract_audit_log.py` is NOT touched - [ ] Schema body of the renamed file is byte-identical (only metadata changes) - [ ] `alembic heads` returns a single head (044) - [ ] After rollout, `westside_streamlit_ro` postgres role exists in prod - [ ] After rollout, `jersey_public_orders` table exists in prod - [ ] ArgoCD application returns to `Synced / Healthy` - [ ] CrashLoopBackOff pod replaced by a healthy new pod - [ ] alembic_version table reaches 044 ### Related - `pal-e-platform` — project tracking - `forgejo_admin/basketball-api#441` — sister bug for the 040 collision (PR #442 fix) - PR #426 — contract audit log (the 041 that was applied) - PR #5 / #435 — westside-streamlit RO role (the 041 that became a ghost) - 20-hour ArgoCD Degraded state was masked because RollingUpdate kept the old pod serving - Process gap: same as #441 — Woodpecker should run `alembic heads` before build to fail-fast on collisions
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
forgejo_admin/basketball-api#443
No description provided.