Add pgvector extension + embedding schema migration #122

Merged
forgejo_admin merged 2 commits from feat/pgvector-embedding-schema into main 2026-03-09 03:32:14 +00:00

Summary

Adds the pgvector extension to Postgres and creates the embedding schema on the blocks table. This enables Phase 6c (embedding worker) to store 768-dimensional vectors for semantic search.

Changes

  • alembic/versions/k1f2g3h4i5j6_add_vector_embeddings.py -- New Alembic migration that creates pgvector extension, adds embedding vector(768) and embedding_status varchar(20) columns, creates HNSW index for cosine similarity, and adds a trigger function that sets status to pending/skipped and fires NOTIFY on content changes
  • src/pal_e_docs/models.py -- Adds embedding (Vector(768)) and embedding_status (String(20)) columns to Block SQLAlchemy model
  • pyproject.toml -- Adds pgvector>=0.3 dependency

Test Plan

  • All 552 existing tests pass (pytest tests/ -v)
  • ruff format and ruff check pass
  • alembic upgrade head succeeds against running CNPG database
  • alembic downgrade -1 reverses cleanly
  • After migration, inserting a block sets embedding_status = 'pending' and fires NOTIFY

Review Checklist

  • Passed automated review-fix loop
  • No secrets committed
  • No unnecessary file changes
  • Commit messages are descriptive
  • Closes forgejo_admin/pal-e-docs #121
  • plan-2026-02-26-tf-modularize-postgres -- Phase 6b (pgvector extension + schema migration)
## Summary Adds the pgvector extension to Postgres and creates the embedding schema on the blocks table. This enables Phase 6c (embedding worker) to store 768-dimensional vectors for semantic search. ## Changes - `alembic/versions/k1f2g3h4i5j6_add_vector_embeddings.py` -- New Alembic migration that creates pgvector extension, adds `embedding vector(768)` and `embedding_status varchar(20)` columns, creates HNSW index for cosine similarity, and adds a trigger function that sets status to pending/skipped and fires NOTIFY on content changes - `src/pal_e_docs/models.py` -- Adds `embedding` (Vector(768)) and `embedding_status` (String(20)) columns to Block SQLAlchemy model - `pyproject.toml` -- Adds `pgvector>=0.3` dependency ## Test Plan - [x] All 552 existing tests pass (`pytest tests/ -v`) - [x] `ruff format` and `ruff check` pass - [ ] `alembic upgrade head` succeeds against running CNPG database - [ ] `alembic downgrade -1` reverses cleanly - [ ] After migration, inserting a block sets `embedding_status = 'pending'` and fires NOTIFY ## Review Checklist - [x] Passed automated review-fix loop - [x] No secrets committed - [x] No unnecessary file changes - [x] Commit messages are descriptive ## Related - Closes forgejo_admin/pal-e-docs #121 - `plan-2026-02-26-tf-modularize-postgres` -- Phase 6b (pgvector extension + schema migration)
Add pgvector embedding schema migration for blocks table
All checks were successful
ci/woodpecker/pr/woodpecker Pipeline was successful
1ea8648d5f
Closes #121

- Create pgvector extension
- Add embedding vector(768) and embedding_status columns to blocks
- Add HNSW index on embedding column for cosine similarity
- Add trigger function that sets status to pending and NOTIFYs
  embedding_queue on content changes, skips mermaid/raw types
- Update SQLAlchemy Block model with new column definitions
- Add pgvector dependency to pyproject.toml

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Fix QA blockers: revision ID collision, trigger column list, raw block type
All checks were successful
ci/woodpecker/pr/woodpecker Pipeline was successful
c06ed4d967
- Rename migration to l2g3h4i5j6k7 with unique revision ID to avoid
  collision with k1f2g3h4i5j6_add_block_anchor_ids_and_unique_constraint
- Set down_revision to k1f2g3h4i5j6 (actual current head)
- Add block_type to trigger column list so type changes update
  embedding_status correctly
- Remove nonexistent 'raw' block type from trigger skip list and
  backfill query (only 'mermaid' is a valid non-textual type)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Author
Owner

PR #122 Re-Review

Re-review after dev agent addressed two blockers and one nit from the previous review.

Previous Findings Status

1. Blocker: Revision ID collision -- RESOLVED

  • Migration now uses unique revision ID l2g3h4i5j6k7 (was colliding before).
  • Chains correctly from k1f2g3h4i5j6 via down_revision.
  • Migration chain: j0e1f2g3h4i5 -> k1f2g3h4i5j6 -> l2g3h4i5j6k7. Confirmed correct.

2. Blocker: Trigger column list -- RESOLVED

  • Trigger now fires on BEFORE INSERT OR UPDATE OF content, block_type (was missing block_type).
  • This ensures that if a block's type changes (e.g., paragraph to mermaid), the embedding status is correctly updated.

3. Nit: raw block type in skip list -- RESOLVED

  • Skip list now contains only mermaid, which is the correct non-textual block type.
  • Valid block types in the system: heading, paragraph, table, code, list, mermaid. Only mermaid (diagram definitions) should skip embedding.

New Findings

No new blockers.

NITS

  1. Stale PR body filename -- The "Changes" section references alembic/versions/k1f2g3h4i5j6_add_vector_embeddings.py but the actual file is alembic/versions/l2g3h4i5j6k7_add_vector_embeddings.py. The PR body was not updated after the revision ID fix. Non-blocking but could confuse future readers.

  2. mergeable: false status -- The PR currently shows as not mergeable. This may require a rebase onto main to pick up the k1f2g3h4i5j6 migration that the new migration chains from. Not a code issue but needs resolution before merge.

Technical Notes (informational, not blocking)

  • Trigger backfill UPDATE (SET embedding_status = 'skipped' WHERE block_type = 'mermaid') correctly avoids re-triggering the BEFORE trigger since it modifies embedding_status, not content or block_type.
  • HNSW index creation on an empty embedding column is safe (creates empty index structure).
  • Downgrade drops columns before dropping the extension -- correct dependency order.
  • embedding and embedding_status are not exposed in API schemas -- appropriate for schema-only migration work.
  • pg_notify with NEW.id::text in a BEFORE INSERT trigger is valid because PostgreSQL evaluates SERIAL/sequence defaults before BEFORE triggers fire.

SOP COMPLIANCE

  • Branch named after issue -- branch is feat/pgvector-embedding-schema, not issue-121. Matches feature convention but does not include issue number.
  • PR body follows template -- Summary, Changes, Test Plan, Related sections present.
  • Related references plan slug -- references plan-2026-02-26-tf-modularize-postgres Phase 6b.
  • No secrets committed -- no .env files, credentials, or sensitive data in diff.
  • No unnecessary file changes -- 3 files, all directly related to the feature.
  • Commit messages are descriptive.

VERDICT: APPROVED

All three previous findings are resolved. No new blockers. The two nits (stale PR body, mergeable status) are non-blocking and can be addressed during merge preparation.

## PR #122 Re-Review Re-review after dev agent addressed two blockers and one nit from the previous review. ### Previous Findings Status **1. Blocker: Revision ID collision** -- RESOLVED - Migration now uses unique revision ID `l2g3h4i5j6k7` (was colliding before). - Chains correctly from `k1f2g3h4i5j6` via `down_revision`. - Migration chain: `j0e1f2g3h4i5` -> `k1f2g3h4i5j6` -> `l2g3h4i5j6k7`. Confirmed correct. **2. Blocker: Trigger column list** -- RESOLVED - Trigger now fires on `BEFORE INSERT OR UPDATE OF content, block_type` (was missing `block_type`). - This ensures that if a block's type changes (e.g., paragraph to mermaid), the embedding status is correctly updated. **3. Nit: `raw` block type in skip list** -- RESOLVED - Skip list now contains only `mermaid`, which is the correct non-textual block type. - Valid block types in the system: heading, paragraph, table, code, list, mermaid. Only mermaid (diagram definitions) should skip embedding. ### New Findings **No new blockers.** ### NITS 1. **Stale PR body filename** -- The "Changes" section references `alembic/versions/k1f2g3h4i5j6_add_vector_embeddings.py` but the actual file is `alembic/versions/l2g3h4i5j6k7_add_vector_embeddings.py`. The PR body was not updated after the revision ID fix. Non-blocking but could confuse future readers. 2. **`mergeable: false` status** -- The PR currently shows as not mergeable. This may require a rebase onto main to pick up the `k1f2g3h4i5j6` migration that the new migration chains from. Not a code issue but needs resolution before merge. ### Technical Notes (informational, not blocking) - Trigger backfill UPDATE (`SET embedding_status = 'skipped' WHERE block_type = 'mermaid'`) correctly avoids re-triggering the BEFORE trigger since it modifies `embedding_status`, not `content` or `block_type`. - HNSW index creation on an empty `embedding` column is safe (creates empty index structure). - Downgrade drops columns before dropping the extension -- correct dependency order. - `embedding` and `embedding_status` are not exposed in API schemas -- appropriate for schema-only migration work. - `pg_notify` with `NEW.id::text` in a BEFORE INSERT trigger is valid because PostgreSQL evaluates SERIAL/sequence defaults before BEFORE triggers fire. ### SOP COMPLIANCE - [ ] Branch named after issue -- branch is `feat/pgvector-embedding-schema`, not `issue-121`. Matches feature convention but does not include issue number. - [x] PR body follows template -- Summary, Changes, Test Plan, Related sections present. - [x] Related references plan slug -- references `plan-2026-02-26-tf-modularize-postgres` Phase 6b. - [x] No secrets committed -- no .env files, credentials, or sensitive data in diff. - [x] No unnecessary file changes -- 3 files, all directly related to the feature. - [x] Commit messages are descriptive. ### VERDICT: APPROVED All three previous findings are resolved. No new blockers. The two nits (stale PR body, mergeable status) are non-blocking and can be addressed during merge preparation.
forgejo_admin force-pushed feat/pgvector-embedding-schema from c06ed4d967
All checks were successful
ci/woodpecker/pr/woodpecker Pipeline was successful
to 3e3c07f033
All checks were successful
ci/woodpecker/pr/woodpecker Pipeline was successful
2026-03-09 03:31:22 +00:00
Compare
forgejo_admin deleted branch feat/pgvector-embedding-schema 2026-03-09 03:32:14 +00:00
Sign in to join this conversation.
No description provided.