INCIDENT: Fix migration crash — drop HNSW index before vector dimension ALTER #158

Closed
opened 2026-03-14 17:14:27 +00:00 by forgejo_admin · 0 comments

Lineage

plan-pal-e-docs → Phase 5 (Activate Semantic Search) → Phase 5a (Dimension Fix) → INCIDENT

Repo

forgejo_admin/pal-e-docs

User Story

As the platform
I need the API to stop crashing on startup
So that pal-e-docs is accessible again

Context

INCIDENT: API is in CrashLoopBackOff. Both pal-e-docs pods are crashing because the Alembic migration from PR #157 runs on startup and fails:

sqlalchemy.exc.OperationalError: (psycopg2.errors.ProgramLimitExceeded) 
column cannot have more than 2000 dimensions for hnsw index
[SQL: ALTER TABLE blocks ALTER COLUMN embedding TYPE vector(2560)]

The blocks.embedding column has an HNSW index, and HNSW has a hard 2000-dimension limit in pgvector. The migration tries to ALTER to vector(2560) without dropping the index first.

File Targets

Files the agent should modify:

  • alembic/versions/o5j6k7l8m9n0_fix_embedding_vector_dimension.py — the migration that's crashing. Fix the upgrade() function to:
    1. DROP INDEX IF EXISTS ix_blocks_embedding (or whatever the index name is — check with SELECT indexname FROM pg_indexes WHERE tablename = 'blocks' AND indexdef LIKE '%vector%')
    2. ALTER TABLE blocks ALTER COLUMN embedding TYPE vector(2560)
    3. UPDATE blocks SET embedding_status = 'pending' WHERE embedding_status = 'error' (already exists)
    4. Do NOT recreate any index — 5,643 blocks doesn't need one. Sequential scan is fine.
  • Fix the downgrade() function symmetrically if it exists.

Files the agent should NOT touch:

  • src/pal_e_docs/models.py — already correct (Vector(2560))
  • src/pal_e_docs/embedding_worker.py — already correct
  • k8s/ — deployment is fine, the migration is the problem

Acceptance Criteria

  • Migration drops HNSW index before ALTER
  • Migration does NOT recreate any index
  • Tests pass
  • After deploy, API pods are Running (no CrashLoopBackOff)
  • After deploy, embedding worker starts processing blocks successfully

Test Expectations

  • Run: pytest tests/ -v
  • Verify no test creates an HNSW index that would break

Constraints

  • INCIDENT — speed matters. This is an MTTR fix, not an optimization.
  • Do NOT add any index recreation — that's a future design decision
  • The downgrade should add back the HNSW index on vector(768) for reversibility
  • Alembic runs on app startup — this migration must succeed for the API to start

Checklist

  • PR opened
  • Tests pass
  • No unrelated changes
  • pal-e-docs — project this affects
### Lineage `plan-pal-e-docs` → Phase 5 (Activate Semantic Search) → Phase 5a (Dimension Fix) → INCIDENT ### Repo `forgejo_admin/pal-e-docs` ### User Story As the platform I need the API to stop crashing on startup So that pal-e-docs is accessible again ### Context **INCIDENT: API is in CrashLoopBackOff.** Both pal-e-docs pods are crashing because the Alembic migration from PR #157 runs on startup and fails: ``` sqlalchemy.exc.OperationalError: (psycopg2.errors.ProgramLimitExceeded) column cannot have more than 2000 dimensions for hnsw index [SQL: ALTER TABLE blocks ALTER COLUMN embedding TYPE vector(2560)] ``` The `blocks.embedding` column has an HNSW index, and HNSW has a hard 2000-dimension limit in pgvector. The migration tries to ALTER to `vector(2560)` without dropping the index first. ### File Targets Files the agent should modify: - `alembic/versions/o5j6k7l8m9n0_fix_embedding_vector_dimension.py` — the migration that's crashing. Fix the upgrade() function to: 1. `DROP INDEX IF EXISTS ix_blocks_embedding` (or whatever the index name is — check with `SELECT indexname FROM pg_indexes WHERE tablename = 'blocks' AND indexdef LIKE '%vector%'`) 2. `ALTER TABLE blocks ALTER COLUMN embedding TYPE vector(2560)` 3. `UPDATE blocks SET embedding_status = 'pending' WHERE embedding_status = 'error'` (already exists) 4. Do NOT recreate any index — 5,643 blocks doesn't need one. Sequential scan is fine. - Fix the downgrade() function symmetrically if it exists. Files the agent should NOT touch: - `src/pal_e_docs/models.py` — already correct (`Vector(2560)`) - `src/pal_e_docs/embedding_worker.py` — already correct - `k8s/` — deployment is fine, the migration is the problem ### Acceptance Criteria - [ ] Migration drops HNSW index before ALTER - [ ] Migration does NOT recreate any index - [ ] Tests pass - [ ] After deploy, API pods are Running (no CrashLoopBackOff) - [ ] After deploy, embedding worker starts processing blocks successfully ### Test Expectations - [ ] Run: `pytest tests/ -v` - [ ] Verify no test creates an HNSW index that would break ### Constraints - **INCIDENT — speed matters.** This is an MTTR fix, not an optimization. - Do NOT add any index recreation — that's a future design decision - The downgrade should add back the HNSW index on vector(768) for reversibility - Alembic runs on app startup — this migration must succeed for the API to start ### Checklist - [ ] PR opened - [ ] Tests pass - [ ] No unrelated changes ### Related - `pal-e-docs` — project this affects
forgejo_admin 2026-03-14 17:18:01 +00:00
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
forgejo_admin/pal-e-api#158
No description provided.