[CRITICAL] Migration 041 dual-revision collision — root cause of 20h CrashLoopBackOff #443
Labels
No labels
domain:backend
domain:devops
domain:frontend
status:approved
status:in-progress
status:needs-fix
status:qa
type:bug
type:devops
type:feature
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
forgejo_admin/basketball-api#443
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Type
Bug
Lineage
Discovered 2026-04-11 during validation after merging #442. PR #442 fixed the 040 collision but did NOT unblock the deploy because a SECOND collision exists at revision 041. Related to PR #426 (contract audit log) and PR #5/#435 (westside-streamlit RO role).
Repo
forgejo_admin/basketball-apiWhat Broke
alembic/versions/onmaincontains TWO files claimingrevision = "041"withdown_revision = "040":041_add_contract_audit_log.py— APPLIED in prod (contract_audit_logtable exists)041_add_westside_streamlit_ro_role.py— NEVER APPLIED (westside_streamlit_ropostgres role does NOT exist)This is the root cause of the 20-hour
basketball-api-5cb84b9b67-vxpx9CrashLoopBackOff (240 restarts). The crashing pod's startup logs show:The currently-serving pod (
basketball-api-5c4b9bcc-vvfsx) was started before the second 041 file was added, so it migrated cleanly to alembic_version=042 and is still healthy. RollingUpdate maxUnavailable=0 has been hiding the failure.Repro Steps
kubectl -n basketball-api logs basketball-api-5cb84b9b67-vxpx9 --tail=10— shows alembic dual-revision errorcurl -sS "$FORGEJO_URL/api/v1/repos/forgejo_admin/basketball-api/contents/alembic/versions?ref=main"— shows both041_add_contract_audit_log.pyand041_add_westside_streamlit_ro_role.pykubectl -n basketball-api exec postgres-9b5b87b5-5nccx -- psql -U basketball -d basketball -tc "SELECT to_regclass('public.contract_audit_log');"returns the table namekubectl -n basketball-api exec postgres-9b5b87b5-5nccx -- psql -U basketball -d basketball -tc "SELECT 1 FROM pg_roles WHERE rolname = 'westside_streamlit_ro';"returns emptyExpected Behavior
Exactly one file claims
revision = "041". The streamlit_ro_role migration (which was never applied) should be renumbered to a unique revision at the end of the chain, preserving the existing041_add_contract_audit_log.pywhich is already in production.Environment
basketball-apibasketball-api-5cb84b9b67-vxpx9, imageharbor.tail5b443a.ts.net/basketball-api/api:7ccc4b3020797c0b59544493194de837c19441febasketball-api-5c4b9bcc-vvfsx(still serving, alembic_version=042)Synced / Degraded(has been Degraded for 20 hours)Acceptance Criteria
revision = "041"remains inalembic/versions/041_add_westside_streamlit_ro_role.pyrenamed to044_add_westside_streamlit_ro_role.py(after the 043 jersey migration), withrevision = "044",down_revision = "043"041_add_contract_audit_log.pyis NOT touchedalembic headsreturns a single head (044)westside_streamlit_ropostgres role exists in prodjersey_public_orderstable exists in prodSynced / HealthyRelated
pal-e-platform— project trackingforgejo_admin/basketball-api#441— sister bug for the 040 collision (PR #442 fix)alembic headsbefore build to fail-fast on collisions