fix: rename migration 041 streamlit_ro → 044 to resolve dual-revision collision [CRITICAL] #444

Merged
forgejo_admin merged 1 commit from fix-041-streamlit-ro-rename into main 2026-04-11 20:12:24 +00:00

Summary

Second hot-fix in the migration-collision pair. PR #442 fixed the dual-040 collision but the basketball-api pod is STILL CrashLoopBackOff because there's a SECOND collision at revision 041:

  • 041_add_contract_audit_log.py — applied in production (contract_audit_log table exists)
  • 041_add_westside_streamlit_ro_role.py — never applied (westside_streamlit_ro role does NOT exist)

The crashing pod (20 hours old, 240 restarts) shows Revision 041 is present more than once in its alembic startup logs. This collision predated PR #433/#442 — basketball-api has been silently degraded for 20 hours, masked by RollingUpdate keeping the old pod serving.

This PR renames the never-applied streamlit_ro_role file to 044 (after my 043 jersey migration), placing it at the end of the chain. The contract_audit_log file (which IS in production) is untouched.

Changes

  • Rename alembic/versions/041_add_westside_streamlit_ro_role.pyalembic/versions/044_add_westside_streamlit_ro_role.py
  • Update revision = "041"revision = "044"
  • Update down_revision = "040"down_revision = "043"
  • Update docstring header Revision ID: 041044, Revises: 040Revises: 043
  • Diff: 4 insertions / 4 deletions, similarity 98%, all in revision metadata + docstring header
  • Schema body unchanged

New chain: 039 → 040 queens → 041 contract_audit → 042 alice_dedupe → 043 jersey_public_orders → 044 streamlit_ro_role

Test Plan

  • alembic heads returns a single head → expect 044
  • alembic upgrade head from 042 (current prod state) → applies 043, then 044, cleanly
  • alembic downgrade -1 reverses 044 → 043 cleanly (drops the postgres role)
  • alembic upgrade head re-applies cleanly
  • Post-rollout: kubectl -n basketball-api exec postgres-... -- psql -tc "SELECT 1 FROM pg_roles WHERE rolname='westside_streamlit_ro';" returns 1
  • Post-rollout: kubectl -n basketball-api exec postgres-... -- psql -tc "\dt jersey_public_orders" shows the table
  • Post-rollout: ArgoCD basketball-api → Synced / Healthy
  • CrashLoopBackOff pod replaced by a healthy new pod

Review Checklist

  • Only one file changed (renamed) + revision metadata
  • Schema body unchanged
  • 041_add_contract_audit_log.py NOT touched (it's already applied in prod)
  • No model changes, no route changes, no application code changes
  • Streamlit_ro_role migration body is independent of jersey_public_orders — re-chaining is safe
  • Reversible
  • Pre-merge: dev or QA verifies alembic heads returns single head locally
  • arch-jersey-intake — architecture doc
  • story:WS-S31 — parent user story
  • forgejo_admin/basketball-api#441 — sister 040 collision bug (fixed by PR #442)

Closes #443

## Summary Second hot-fix in the migration-collision pair. PR #442 fixed the dual-040 collision but the basketball-api pod is STILL CrashLoopBackOff because there's a SECOND collision at revision 041: - `041_add_contract_audit_log.py` — applied in production (`contract_audit_log` table exists) - `041_add_westside_streamlit_ro_role.py` — never applied (`westside_streamlit_ro` role does NOT exist) The crashing pod (20 hours old, 240 restarts) shows `Revision 041 is present more than once` in its alembic startup logs. This collision predated PR #433/#442 — basketball-api has been silently degraded for 20 hours, masked by RollingUpdate keeping the old pod serving. This PR renames the never-applied streamlit_ro_role file to 044 (after my 043 jersey migration), placing it at the end of the chain. The contract_audit_log file (which IS in production) is untouched. ## Changes - **Rename** `alembic/versions/041_add_westside_streamlit_ro_role.py` → `alembic/versions/044_add_westside_streamlit_ro_role.py` - **Update** `revision = "041"` → `revision = "044"` - **Update** `down_revision = "040"` → `down_revision = "043"` - **Update** docstring header `Revision ID: 041` → `044`, `Revises: 040` → `Revises: 043` - Diff: **4 insertions / 4 deletions**, similarity 98%, all in revision metadata + docstring header - Schema body unchanged New chain: `039 → 040 queens → 041 contract_audit → 042 alice_dedupe → 043 jersey_public_orders → 044 streamlit_ro_role` ## Test Plan - [ ] `alembic heads` returns a single head → expect `044` - [ ] `alembic upgrade head` from 042 (current prod state) → applies 043, then 044, cleanly - [ ] `alembic downgrade -1` reverses 044 → 043 cleanly (drops the postgres role) - [ ] `alembic upgrade head` re-applies cleanly - [ ] Post-rollout: `kubectl -n basketball-api exec postgres-... -- psql -tc "SELECT 1 FROM pg_roles WHERE rolname='westside_streamlit_ro';"` returns 1 - [ ] Post-rollout: `kubectl -n basketball-api exec postgres-... -- psql -tc "\dt jersey_public_orders"` shows the table - [ ] Post-rollout: ArgoCD basketball-api → `Synced / Healthy` - [ ] CrashLoopBackOff pod replaced by a healthy new pod ## Review Checklist - [x] Only one file changed (renamed) + revision metadata - [x] Schema body unchanged - [x] `041_add_contract_audit_log.py` NOT touched (it's already applied in prod) - [x] No model changes, no route changes, no application code changes - [x] Streamlit_ro_role migration body is independent of jersey_public_orders — re-chaining is safe - [x] Reversible - [ ] Pre-merge: dev or QA verifies `alembic heads` returns single head locally ## Related Notes - `arch-jersey-intake` — architecture doc - `story:WS-S31` — parent user story - `forgejo_admin/basketball-api#441` — sister 040 collision bug (fixed by PR #442) Closes #443
fix: rename migration 041 streamlit_ro -> 044 to resolve dual-revision collision
Some checks failed
ci/woodpecker/pr/woodpecker Pipeline failed
8d1b9be16b
Discovered after merging #442: main contained TWO migration files with revision="041":
- 041_add_contract_audit_log.py (applied to prod, contract_audit_log table exists)
- 041_add_westside_streamlit_ro_role.py (ghost, never applied, role does NOT exist)

This is the actual root cause of the 20-hour basketball-api CrashLoopBackOff that
PR #442 alone could not fix. PR #433's dual-040 collision was a second collision
on top of an already-broken chain.

Fix: rename the never-applied streamlit_ro_role migration to revision 044, after
my newly-renamed 043 jersey_public_orders migration. New chain:
039 -> 040 queens -> 041 contract_audit -> 042 alice_dedupe -> 043 jersey_public_orders -> 044 streamlit_ro

Schema body unchanged. The contract_audit_log 041 (already applied) is untouched.
Author
Owner

PR #444 Review

DOMAIN REVIEW

Stack: Alembic migration (Python). Emergency hot-fix #2 in the dual-collision pair (companion to PR #442). Verified against the diff returned by review_pr:

  1. Single file renamed: alembic/versions/041_add_westside_streamlit_ro_role.py -> alembic/versions/044_add_westside_streamlit_ro_role.py. changed_files: 1, additions: 4, deletions: 4, similarity index 98%. Confirmed.
  2. Revision metadata updated verbatim: revision = "041" -> revision = "044", down_revision = "040" -> down_revision = "043". Confirmed.
  3. Docstring header updated: Revision ID: 041 -> 044, Revises: 040 -> Revises: 043. Confirmed.
  4. Schema body byte-identical: The only hunks in the diff are the 4 docstring/metadata lines. No op.* or function body hunks appear. The 98% similarity index corroborates this — rename + metadata only.
  5. 041_add_contract_audit_log.py untouched: Not in the diff. The production-applied migration is preserved. Confirmed.
  6. No other migration files touched: changed_files: 1. Confirmed.
  7. No models / routes / services changes: changed_files: 1. Confirmed.
  8. Streamlit RO body is jersey-independent: Docstring describes the migration as creating a SELECT-only Postgres role (westside_streamlit_ro) via op + environment-sourced password. It is a role/grant migration, not a table migration, so chaining it after 043_jersey_public_orders is safe regardless of grant scope (role creation does not require the jersey table to exist; any grants on specific tables would still resolve because 043 runs first).

New chain: 039 -> 040 queens -> 041 contract_audit -> 042 alice_dedupe -> 043 jersey_public_orders -> 044 streamlit_ro_role. Single head at 044.

This is the correct surgical fix for the "Revision 041 is present more than once" alembic startup failure that has kept the pod in CrashLoopBackOff for 20 hours (masked by RollingUpdate). Renames the never-applied sibling to the tail of the chain; leaves the production-applied 041_add_contract_audit_log exactly where prod's alembic_version table already points.

BLOCKERS

None.

NITS

None raised — same emergency profile as PR #442. Cosmetic nits are deferred to a post-recovery cleanup ticket per the nits-to-epilogue convention.

SOP COMPLIANCE

  • Branch named after issue (fix-041-streamlit-ro-rename, references #443)
  • PR body has Summary / Changes / Test Plan / Related
  • Related references parent story (story:WS-S31) and sister PR (#442)
  • No secrets committed
  • Closes #443

PROCESS OBSERVATIONS

  • 20-hour silent degradation is a DORA MTTR hit. Once recovered, a follow-up should add an alembic-heads smoke test or CI gate so a dual-revision collision fails the pipeline instead of the pod.
  • RollingUpdate masked the CrashLoopBackOff — readiness/liveness probe tuning or a startupProbe on alembic completion should be considered so the old pod doesn't keep serving indefinitely while new pods loop.
  • Both of these belong in the PR #442 / #444 epilogue, not this PR.

VERDICT: APPROVED

## PR #444 Review ### DOMAIN REVIEW Stack: Alembic migration (Python). Emergency hot-fix #2 in the dual-collision pair (companion to PR #442). Verified against the diff returned by `review_pr`: 1. **Single file renamed**: `alembic/versions/041_add_westside_streamlit_ro_role.py` -> `alembic/versions/044_add_westside_streamlit_ro_role.py`. `changed_files: 1`, `additions: 4`, `deletions: 4`, `similarity index 98%`. Confirmed. 2. **Revision metadata updated verbatim**: `revision = "041"` -> `revision = "044"`, `down_revision = "040"` -> `down_revision = "043"`. Confirmed. 3. **Docstring header updated**: `Revision ID: 041` -> `044`, `Revises: 040` -> `Revises: 043`. Confirmed. 4. **Schema body byte-identical**: The only hunks in the diff are the 4 docstring/metadata lines. No `op.*` or function body hunks appear. The 98% similarity index corroborates this — rename + metadata only. 5. **`041_add_contract_audit_log.py` untouched**: Not in the diff. The production-applied migration is preserved. Confirmed. 6. **No other migration files touched**: `changed_files: 1`. Confirmed. 7. **No models / routes / services changes**: `changed_files: 1`. Confirmed. 8. **Streamlit RO body is jersey-independent**: Docstring describes the migration as creating a SELECT-only Postgres role (`westside_streamlit_ro`) via `op` + environment-sourced password. It is a role/grant migration, not a table migration, so chaining it after `043_jersey_public_orders` is safe regardless of grant scope (role creation does not require the jersey table to exist; any grants on specific tables would still resolve because 043 runs first). **New chain**: `039 -> 040 queens -> 041 contract_audit -> 042 alice_dedupe -> 043 jersey_public_orders -> 044 streamlit_ro_role`. Single head at 044. This is the correct surgical fix for the "Revision 041 is present more than once" alembic startup failure that has kept the pod in CrashLoopBackOff for 20 hours (masked by RollingUpdate). Renames the never-applied sibling to the tail of the chain; leaves the production-applied `041_add_contract_audit_log` exactly where prod's `alembic_version` table already points. ### BLOCKERS None. ### NITS None raised — same emergency profile as PR #442. Cosmetic nits are deferred to a post-recovery cleanup ticket per the nits-to-epilogue convention. ### SOP COMPLIANCE - [x] Branch named after issue (`fix-041-streamlit-ro-rename`, references #443) - [x] PR body has Summary / Changes / Test Plan / Related - [x] Related references parent story (`story:WS-S31`) and sister PR (#442) - [x] No secrets committed - [x] Closes #443 ### PROCESS OBSERVATIONS - 20-hour silent degradation is a DORA MTTR hit. Once recovered, a follow-up should add an alembic-heads smoke test or CI gate so a dual-revision collision fails the pipeline instead of the pod. - RollingUpdate masked the CrashLoopBackOff — readiness/liveness probe tuning or a startupProbe on alembic completion should be considered so the old pod doesn't keep serving indefinitely while new pods loop. - Both of these belong in the PR #442 / #444 epilogue, not this PR. ### VERDICT: APPROVED
forgejo_admin deleted branch fix-041-streamlit-ro-rename 2026-04-11 20:12:24 +00:00
Sign in to join this conversation.
No description provided.