Add pgvector extension + embedding schema migration #121
Labels
No labels
domain:backend
domain:devops
domain:frontend
status:approved
status:in-progress
status:needs-fix
status:qa
type:bug
type:devops
type:feature
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
forgejo_admin/pal-e-api#121
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Lineage
plan-2026-02-26-tf-modularize-postgres→ Phase 6 → Phase 6b (pgvector extension + schema migration)Repo
forgejo_admin/pal-e-docsUser Story
As a platform operator
I want the blocks table to support vector embeddings
So that semantic search can find conceptually related content, not just keyword matches
Context
Phase 6a deployed Ollama with GPU acceleration (PR #27 on pal-e-platform). The embedding model
qwen3-embedding:4bis live and returning 768-dimensional vectors. This phase adds the database schema to store those vectors.The CNPG cluster uses
ghcr.io/cloudnative-pg/postgresql:17.4-1, which includes pgvector 0.8.0 (confirmed available but not yet installed). Current Alembic head isj0e1f2g3h4i5(blocks + compiled pages).Key decisions from
decision-phase6-vector-search-architecture:Follow the same pattern as
i9d0e1f2g3h4_add_search_vector.py— Alembic migration usingop.execute()for raw SQL.File Targets
Files the agent should create:
alembic/versions/k1f2g3h4i5j6_add_vector_embeddings.py— new Alembic migration (chain fromj0e1f2g3h4i5)Files the agent should modify:
src/pal_e_docs/models.py— addembeddingandembedding_statuscolumns to Block model (SQLAlchemy column definitions for ORM awareness, even though migration handles DDL)Files the agent should NOT touch:
k8s/— CNPG Cluster CRD does not need modification (pgvector is available in the image, just needsCREATE EXTENSION)alembic/env.py— should not need changesAcceptance Criteria
CREATE EXTENSION IF NOT EXISTS vectorembedding vector(768)column added toblockstable (nullable)embedding_status varchar(20)column added toblockstable with default'pending'CREATE INDEX ... USING hnsw (embedding vector_cosine_ops)content, setsembedding_status = 'pending'and callsNOTIFY embedding_queueembedding_status = 'skipped')Test Expectations
alembic upgrade headsucceeds against the running databasealembic downgrade -1reverses cleanlypytest tests/ -vembedding_status = 'pending'and fire NOTIFYConstraints
op.execute()for raw SQL in the Alembic migration (same pattern as search_vector migration)down_revision = "j0e1f2g3h4i5"pgvectorPython package (from pgvector.sqlalchemy import Vector) may be needed for the SQLAlchemy model — add topyproject.tomlif soChecklist
Related
phase-postgres-6-vector-search— parent phasedecision-phase6-vector-search-architecture— architectural decisionsi9d0e1f2g3h4_add_search_vector.py— reference migration (same pattern)PR #122 Review
BLOCKERS
1. Alembic revision ID collision -- migration will crash on
alembic upgrade headThe new migration file
k1f2g3h4i5j6_add_vector_embeddings.pyuses revision IDk1f2g3h4i5j6, but this revision ID is already taken on main byk1f2g3h4i5j6_add_block_anchor_ids_and_unique_constraint.py. Both migrations also declaredown_revision = "j0e1f2g3h4i5".This means Alembic will fail with a duplicate revision error. Running
alembic upgrade headis impossible.Fix required:
l2g3h4i5j6k7).down_revisiontok1f2g3h4i5j6(the anchor_ids migration), since that is the actual current head of the migration chain.2. Trigger only fires on
contentchanges, butblock_typechanges can also affect embedding eligibilityThe trigger is defined as:
The
BlockUpdateschema allows changingblock_typewithout changingcontent. If a block's type is changed fromparagraphtomermaid(or vice versa), the trigger will NOT fire, andembedding_statuswill not be updated toskipped(or back topending). This creates inconsistent state.Fix required: Either:
block_typeto the trigger column list:BEFORE INSERT OR UPDATE OF content, block_typeNITS
1.
rawis not a valid block type in the systemThe trigger function skips blocks with
block_type IN ('mermaid', 'raw'), but theBlockTypeLiteral inschemas.pyonly defines:heading,paragraph,list,table,code,mermaid. There is norawtype anywhere in the parser, schemas, or models. Includingrawin the skip list is harmless dead code, but it suggests the spec references a block type that doesn't exist. Consider removing it or adding a code comment explaining that it's a forward-looking guard.2. Test Plan has unchecked items that require live DB validation
The PR body lists 3 unchecked test plan items (alembic upgrade, downgrade, NOTIFY behavior). These require a live CNPG database and cannot be validated in the SQLite test suite. This follows the same pattern as the search_vector migration (which also has no SQLite tests), so it's consistent -- but worth noting that these remain unverified until deployment.
3. Model column uses bare
mapped_columnforembeddinginstead ofMappedtype annotationThe
embeddingcolumn is defined as:All other columns on
Blockuse theMapped[T]annotation pattern (e.g.,block_type: Mapped[str] = mapped_column(...)). TheVectortype from pgvector may not supportMappedtyping, which would make this acceptable, but it's worth confirming. IfMappedcan be used, it should be for consistency.SOP COMPLIANCE
feat/pgvector-embedding-schema, notissue-121or121-*. This deviates from convention but is descriptive.plan-2026-02-26-tf-modularize-postgres)VERDICT: NOT APPROVED
Two blockers must be resolved before merge:
block_type(semantic correctness issue).