#158 - INCIDENT: Fix migration crash — drop HNSW index before vector dimension ALTER - forgejo_admin/pal-e-api

forgejo_admin commented

2026-03-14 17:14:27 +00:00

Owner

Lineage

plan-pal-e-docs → Phase 5 (Activate Semantic Search) → Phase 5a (Dimension Fix) → INCIDENT

Repo

forgejo_admin/pal-e-docs

User Story

As the platform
I need the API to stop crashing on startup
So that pal-e-docs is accessible again

Context

INCIDENT: API is in CrashLoopBackOff. Both pal-e-docs pods are crashing because the Alembic migration from PR #157 runs on startup and fails:

sqlalchemy.exc.OperationalError: (psycopg2.errors.ProgramLimitExceeded) 
column cannot have more than 2000 dimensions for hnsw index
[SQL: ALTER TABLE blocks ALTER COLUMN embedding TYPE vector(2560)]

The blocks.embedding column has an HNSW index, and HNSW has a hard 2000-dimension limit in pgvector. The migration tries to ALTER to vector(2560) without dropping the index first.

File Targets

Files the agent should modify:

alembic/versions/o5j6k7l8m9n0_fix_embedding_vector_dimension.py — the migration that's crashing. Fix the upgrade() function to:
1. DROP INDEX IF EXISTS ix_blocks_embedding (or whatever the index name is — check with SELECT indexname FROM pg_indexes WHERE tablename = 'blocks' AND indexdef LIKE '%vector%')
2. ALTER TABLE blocks ALTER COLUMN embedding TYPE vector(2560)
3. UPDATE blocks SET embedding_status = 'pending' WHERE embedding_status = 'error' (already exists)
4. Do NOT recreate any index — 5,643 blocks doesn't need one. Sequential scan is fine.
Fix the downgrade() function symmetrically if it exists.

Files the agent should NOT touch:

src/pal_e_docs/models.py — already correct (Vector(2560))
src/pal_e_docs/embedding_worker.py — already correct
k8s/ — deployment is fine, the migration is the problem

Acceptance Criteria

Migration drops HNSW index before ALTER
Migration does NOT recreate any index
Tests pass
After deploy, API pods are Running (no CrashLoopBackOff)
After deploy, embedding worker starts processing blocks successfully

Test Expectations

Run: pytest tests/ -v
Verify no test creates an HNSW index that would break

Constraints

INCIDENT — speed matters. This is an MTTR fix, not an optimization.
Do NOT add any index recreation — that's a future design decision
The downgrade should add back the HNSW index on vector(768) for reversibility
Alembic runs on app startup — this migration must succeed for the API to start

Checklist

PR opened
Tests pass
No unrelated changes

pal-e-docs — project this affects

### Lineage `plan-pal-e-docs` → Phase 5 (Activate Semantic Search) → Phase 5a (Dimension Fix) → INCIDENT ### Repo `forgejo_admin/pal-e-docs` ### User Story As the platform I need the API to stop crashing on startup So that pal-e-docs is accessible again ### Context **INCIDENT: API is in CrashLoopBackOff.** Both pal-e-docs pods are crashing because the Alembic migration from PR #157 runs on startup and fails: ``` sqlalchemy.exc.OperationalError: (psycopg2.errors.ProgramLimitExceeded) column cannot have more than 2000 dimensions for hnsw index [SQL: ALTER TABLE blocks ALTER COLUMN embedding TYPE vector(2560)] ``` The `blocks.embedding` column has an HNSW index, and HNSW has a hard 2000-dimension limit in pgvector. The migration tries to ALTER to `vector(2560)` without dropping the index first. ### File Targets Files the agent should modify: - `alembic/versions/o5j6k7l8m9n0_fix_embedding_vector_dimension.py` — the migration that's crashing. Fix the upgrade() function to: 1. `DROP INDEX IF EXISTS ix_blocks_embedding` (or whatever the index name is — check with `SELECT indexname FROM pg_indexes WHERE tablename = 'blocks' AND indexdef LIKE '%vector%'`) 2. `ALTER TABLE blocks ALTER COLUMN embedding TYPE vector(2560)` 3. `UPDATE blocks SET embedding_status = 'pending' WHERE embedding_status = 'error'` (already exists) 4. Do NOT recreate any index — 5,643 blocks doesn't need one. Sequential scan is fine. - Fix the downgrade() function symmetrically if it exists. Files the agent should NOT touch: - `src/pal_e_docs/models.py` — already correct (`Vector(2560)`) - `src/pal_e_docs/embedding_worker.py` — already correct - `k8s/` — deployment is fine, the migration is the problem ### Acceptance Criteria - [ ] Migration drops HNSW index before ALTER - [ ] Migration does NOT recreate any index - [ ] Tests pass - [ ] After deploy, API pods are Running (no CrashLoopBackOff) - [ ] After deploy, embedding worker starts processing blocks successfully ### Test Expectations - [ ] Run: `pytest tests/ -v` - [ ] Verify no test creates an HNSW index that would break ### Constraints - **INCIDENT — speed matters.** This is an MTTR fix, not an optimization. - Do NOT add any index recreation — that's a future design decision - The downgrade should add back the HNSW index on vector(768) for reversibility - Alembic runs on app startup — this migration must succeed for the API to start ### Checklist - [ ] PR opened - [ ] Tests pass - [ ] No unrelated changes ### Related - `pal-e-docs` — project this affects