Add pgvector extension + embedding schema migration #121

Closed
opened 2026-03-09 03:16:29 +00:00 by forgejo_admin · 1 comment

Lineage

plan-2026-02-26-tf-modularize-postgres → Phase 6 → Phase 6b (pgvector extension + schema migration)

Repo

forgejo_admin/pal-e-docs

User Story

As a platform operator
I want the blocks table to support vector embeddings
So that semantic search can find conceptually related content, not just keyword matches

Context

Phase 6a deployed Ollama with GPU acceleration (PR #27 on pal-e-platform). The embedding model qwen3-embedding:4b is live and returning 768-dimensional vectors. This phase adds the database schema to store those vectors.

The CNPG cluster uses ghcr.io/cloudnative-pg/postgresql:17.4-1, which includes pgvector 0.8.0 (confirmed available but not yet installed). Current Alembic head is j0e1f2g3h4i5 (blocks + compiled pages).

Key decisions from decision-phase6-vector-search-architecture:

  • 768 dimensions (configurable, Qwen3-Embedding-4B default)
  • Per-block embedding (not per-note) — paragraph, list, heading, table, code blocks get embedded. Mermaid and raw HTML are skipped.
  • Async pipeline via LISTEN/NOTIFY — a Postgres trigger sets status and notifies on block changes. A separate worker (Phase 6c) consumes these.

Follow the same pattern as i9d0e1f2g3h4_add_search_vector.py — Alembic migration using op.execute() for raw SQL.

File Targets

Files the agent should create:

  • alembic/versions/k1f2g3h4i5j6_add_vector_embeddings.py — new Alembic migration (chain from j0e1f2g3h4i5)

Files the agent should modify:

  • src/pal_e_docs/models.py — add embedding and embedding_status columns to Block model (SQLAlchemy column definitions for ORM awareness, even though migration handles DDL)

Files the agent should NOT touch:

  • k8s/ — CNPG Cluster CRD does not need modification (pgvector is available in the image, just needs CREATE EXTENSION)
  • API routes — no new endpoints in this phase
  • alembic/env.py — should not need changes

Acceptance Criteria

  • Migration creates the pgvector extension: CREATE EXTENSION IF NOT EXISTS vector
  • embedding vector(768) column added to blocks table (nullable)
  • embedding_status varchar(20) column added to blocks table with default 'pending'
  • HNSW index created on embedding column: CREATE INDEX ... USING hnsw (embedding vector_cosine_ops)
  • Postgres trigger function: on block INSERT or UPDATE of content, sets embedding_status = 'pending' and calls NOTIFY embedding_queue
  • Trigger skips mermaid and raw block types (sets embedding_status = 'skipped')
  • Downgrade function reverses all changes cleanly
  • SQLAlchemy Block model updated with column definitions
  • Existing tests still pass

Test Expectations

  • alembic upgrade head succeeds against the running database
  • alembic downgrade -1 reverses cleanly
  • Existing tests pass: pytest tests/ -v
  • After migration, inserting a block row should set embedding_status = 'pending' and fire NOTIFY

Constraints

  • Use op.execute() for raw SQL in the Alembic migration (same pattern as search_vector migration)
  • Revision ID should follow the existing pattern (sequential hex-like IDs)
  • Chain from down_revision = "j0e1f2g3h4i5"
  • The pgvector Python package (from pgvector.sqlalchemy import Vector) may be needed for the SQLAlchemy model — add to pyproject.toml if so
  • Do NOT create the embedding worker or API endpoints — that's Phase 6c/6d

Checklist

  • PR opened
  • Tests pass
  • No unrelated changes
  • phase-postgres-6-vector-search — parent phase
  • decision-phase6-vector-search-architecture — architectural decisions
  • i9d0e1f2g3h4_add_search_vector.py — reference migration (same pattern)
  • Phase 6a (PR #27, pal-e-platform) — Ollama + GPU now live
### Lineage `plan-2026-02-26-tf-modularize-postgres` → Phase 6 → Phase 6b (pgvector extension + schema migration) ### Repo `forgejo_admin/pal-e-docs` ### User Story As a platform operator I want the blocks table to support vector embeddings So that semantic search can find conceptually related content, not just keyword matches ### Context Phase 6a deployed Ollama with GPU acceleration (PR #27 on pal-e-platform). The embedding model `qwen3-embedding:4b` is live and returning 768-dimensional vectors. This phase adds the database schema to store those vectors. The CNPG cluster uses `ghcr.io/cloudnative-pg/postgresql:17.4-1`, which includes pgvector 0.8.0 (confirmed available but not yet installed). Current Alembic head is `j0e1f2g3h4i5` (blocks + compiled pages). Key decisions from `decision-phase6-vector-search-architecture`: - **768 dimensions** (configurable, Qwen3-Embedding-4B default) - **Per-block embedding** (not per-note) — paragraph, list, heading, table, code blocks get embedded. Mermaid and raw HTML are skipped. - **Async pipeline via LISTEN/NOTIFY** — a Postgres trigger sets status and notifies on block changes. A separate worker (Phase 6c) consumes these. Follow the same pattern as `i9d0e1f2g3h4_add_search_vector.py` — Alembic migration using `op.execute()` for raw SQL. ### File Targets Files the agent should create: - `alembic/versions/k1f2g3h4i5j6_add_vector_embeddings.py` — new Alembic migration (chain from `j0e1f2g3h4i5`) Files the agent should modify: - `src/pal_e_docs/models.py` — add `embedding` and `embedding_status` columns to Block model (SQLAlchemy column definitions for ORM awareness, even though migration handles DDL) Files the agent should NOT touch: - `k8s/` — CNPG Cluster CRD does not need modification (pgvector is available in the image, just needs `CREATE EXTENSION`) - API routes — no new endpoints in this phase - `alembic/env.py` — should not need changes ### Acceptance Criteria - [ ] Migration creates the pgvector extension: `CREATE EXTENSION IF NOT EXISTS vector` - [ ] `embedding vector(768)` column added to `blocks` table (nullable) - [ ] `embedding_status varchar(20)` column added to `blocks` table with default `'pending'` - [ ] HNSW index created on embedding column: `CREATE INDEX ... USING hnsw (embedding vector_cosine_ops)` - [ ] Postgres trigger function: on block INSERT or UPDATE of `content`, sets `embedding_status = 'pending'` and calls `NOTIFY embedding_queue` - [ ] Trigger skips mermaid and raw block types (sets `embedding_status = 'skipped'`) - [ ] Downgrade function reverses all changes cleanly - [ ] SQLAlchemy Block model updated with column definitions - [ ] Existing tests still pass ### Test Expectations - [ ] `alembic upgrade head` succeeds against the running database - [ ] `alembic downgrade -1` reverses cleanly - [ ] Existing tests pass: `pytest tests/ -v` - [ ] After migration, inserting a block row should set `embedding_status = 'pending'` and fire NOTIFY ### Constraints - Use `op.execute()` for raw SQL in the Alembic migration (same pattern as search_vector migration) - Revision ID should follow the existing pattern (sequential hex-like IDs) - Chain from `down_revision = "j0e1f2g3h4i5"` - The `pgvector` Python package (`from pgvector.sqlalchemy import Vector`) may be needed for the SQLAlchemy model — add to `pyproject.toml` if so - Do NOT create the embedding worker or API endpoints — that's Phase 6c/6d ### Checklist - [ ] PR opened - [ ] Tests pass - [ ] No unrelated changes ### Related - `phase-postgres-6-vector-search` — parent phase - `decision-phase6-vector-search-architecture` — architectural decisions - `i9d0e1f2g3h4_add_search_vector.py` — reference migration (same pattern) - Phase 6a (PR #27, pal-e-platform) — Ollama + GPU now live
Author
Owner

PR #122 Review

BLOCKERS

1. Alembic revision ID collision -- migration will crash on alembic upgrade head

The new migration file k1f2g3h4i5j6_add_vector_embeddings.py uses revision ID k1f2g3h4i5j6, but this revision ID is already taken on main by k1f2g3h4i5j6_add_block_anchor_ids_and_unique_constraint.py. Both migrations also declare down_revision = "j0e1f2g3h4i5".

This means Alembic will fail with a duplicate revision error. Running alembic upgrade head is impossible.

Fix required:

  • Assign a new, unique revision ID to the vector embeddings migration (e.g., l2g3h4i5j6k7).
  • Set down_revision to k1f2g3h4i5j6 (the anchor_ids migration), since that is the actual current head of the migration chain.
  • Rename the file accordingly.

2. Trigger only fires on content changes, but block_type changes can also affect embedding eligibility

The trigger is defined as:

BEFORE INSERT OR UPDATE OF content ON blocks

The BlockUpdate schema allows changing block_type without changing content. If a block's type is changed from paragraph to mermaid (or vice versa), the trigger will NOT fire, and embedding_status will not be updated to skipped (or back to pending). This creates inconsistent state.

Fix required: Either:

  • Add block_type to the trigger column list: BEFORE INSERT OR UPDATE OF content, block_type
  • Or document this as an accepted limitation and handle it in the embedding worker.

NITS

1. raw is not a valid block type in the system

The trigger function skips blocks with block_type IN ('mermaid', 'raw'), but the BlockType Literal in schemas.py only defines: heading, paragraph, list, table, code, mermaid. There is no raw type anywhere in the parser, schemas, or models. Including raw in the skip list is harmless dead code, but it suggests the spec references a block type that doesn't exist. Consider removing it or adding a code comment explaining that it's a forward-looking guard.

2. Test Plan has unchecked items that require live DB validation

The PR body lists 3 unchecked test plan items (alembic upgrade, downgrade, NOTIFY behavior). These require a live CNPG database and cannot be validated in the SQLite test suite. This follows the same pattern as the search_vector migration (which also has no SQLite tests), so it's consistent -- but worth noting that these remain unverified until deployment.

3. Model column uses bare mapped_column for embedding instead of Mapped type annotation

The embedding column is defined as:

embedding = mapped_column(Vector(768), nullable=True)

All other columns on Block use the Mapped[T] annotation pattern (e.g., block_type: Mapped[str] = mapped_column(...)). The Vector type from pgvector may not support Mapped typing, which would make this acceptable, but it's worth confirming. If Mapped can be used, it should be for consistency.

SOP COMPLIANCE

  • Branch named after issue -- Branch is feat/pgvector-embedding-schema, not issue-121 or 121-*. This deviates from convention but is descriptive.
  • PR body has Summary, Changes, Test Plan, Related sections
  • Related section references the plan slug (plan-2026-02-26-tf-modularize-postgres)
  • No secrets committed
  • No unnecessary file changes (3 files, all directly relevant)
  • Commit messages are descriptive
  • Migration chains correctly from current head -- BROKEN (see Blocker #1)

VERDICT: NOT APPROVED

Two blockers must be resolved before merge:

  1. Revision ID collision and broken migration chain (critical -- deployment will fail).
  2. Trigger column list missing block_type (semantic correctness issue).
## PR #122 Review ### BLOCKERS **1. Alembic revision ID collision -- migration will crash on `alembic upgrade head`** The new migration file `k1f2g3h4i5j6_add_vector_embeddings.py` uses revision ID `k1f2g3h4i5j6`, but this revision ID is **already taken** on main by `k1f2g3h4i5j6_add_block_anchor_ids_and_unique_constraint.py`. Both migrations also declare `down_revision = "j0e1f2g3h4i5"`. This means Alembic will fail with a duplicate revision error. Running `alembic upgrade head` is impossible. **Fix required:** - Assign a new, unique revision ID to the vector embeddings migration (e.g., `l2g3h4i5j6k7`). - Set `down_revision` to `k1f2g3h4i5j6` (the anchor_ids migration), since that is the actual current head of the migration chain. - Rename the file accordingly. **2. Trigger only fires on `content` changes, but `block_type` changes can also affect embedding eligibility** The trigger is defined as: ```sql BEFORE INSERT OR UPDATE OF content ON blocks ``` The `BlockUpdate` schema allows changing `block_type` without changing `content`. If a block's type is changed from `paragraph` to `mermaid` (or vice versa), the trigger will NOT fire, and `embedding_status` will not be updated to `skipped` (or back to `pending`). This creates inconsistent state. **Fix required:** Either: - Add `block_type` to the trigger column list: `BEFORE INSERT OR UPDATE OF content, block_type` - Or document this as an accepted limitation and handle it in the embedding worker. ### NITS **1. `raw` is not a valid block type in the system** The trigger function skips blocks with `block_type IN ('mermaid', 'raw')`, but the `BlockType` Literal in `schemas.py` only defines: `heading`, `paragraph`, `list`, `table`, `code`, `mermaid`. There is no `raw` type anywhere in the parser, schemas, or models. Including `raw` in the skip list is harmless dead code, but it suggests the spec references a block type that doesn't exist. Consider removing it or adding a code comment explaining that it's a forward-looking guard. **2. Test Plan has unchecked items that require live DB validation** The PR body lists 3 unchecked test plan items (alembic upgrade, downgrade, NOTIFY behavior). These require a live CNPG database and cannot be validated in the SQLite test suite. This follows the same pattern as the search_vector migration (which also has no SQLite tests), so it's consistent -- but worth noting that these remain unverified until deployment. **3. Model column uses bare `mapped_column` for `embedding` instead of `Mapped` type annotation** The `embedding` column is defined as: ```python embedding = mapped_column(Vector(768), nullable=True) ``` All other columns on `Block` use the `Mapped[T]` annotation pattern (e.g., `block_type: Mapped[str] = mapped_column(...)`). The `Vector` type from pgvector may not support `Mapped` typing, which would make this acceptable, but it's worth confirming. If `Mapped` can be used, it should be for consistency. ### SOP COMPLIANCE - [ ] Branch named after issue -- Branch is `feat/pgvector-embedding-schema`, not `issue-121` or `121-*`. This deviates from convention but is descriptive. - [x] PR body has Summary, Changes, Test Plan, Related sections - [x] Related section references the plan slug (`plan-2026-02-26-tf-modularize-postgres`) - [x] No secrets committed - [x] No unnecessary file changes (3 files, all directly relevant) - [x] Commit messages are descriptive - [ ] Migration chains correctly from current head -- **BROKEN** (see Blocker #1) ### VERDICT: NOT APPROVED Two blockers must be resolved before merge: 1. Revision ID collision and broken migration chain (critical -- deployment will fail). 2. Trigger column list missing `block_type` (semantic correctness issue).
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
forgejo_admin/pal-e-api#121
No description provided.