ldraney/pal-e-api

Archived

Fork 0

Add full-text search: tsvector column, GIN index, trigger, search endpoint #83

New issue

Closed

opened 2026-03-06 20:33:57 +00:00 by forgejo_admin · 2 comments

forgejo_admin commented

2026-03-06 20:33:57 +00:00

Contributor

Plan

plan-2026-02-26-tf-modularize-postgres -- Phase 5

Repo

forgejo_admin/pal-e-docs

User Story

As an AI agent querying platform knowledge
I want to search notes by text content with ranked results and snippets
So that I can find relevant knowledge in one call instead of enumerating all notes

Context

There is currently no search anywhere — not in the API, not in MCP tools. AI agents call list_notes(tags=...) then get_note() on each result — 12+ API calls and thousands of tokens per knowledge lookup. This is unsustainable as the knowledge base grows.

Postgres is live (Phase 3 completed). We can now use tsvector for full-text search with weighted ranking and snippet extraction.

Design decisions (from planning):

Postgres trigger auto-builds the tsvector on every INSERT/UPDATE — no app code needed to keep it in sync
Regex HTML stripping inside the trigger (regexp_replace(html_content, '<[^>]+>', ' ', 'g')) — simpler than Python-side extraction, sufficient for nh3-sanitized HTML
Weighted tsvector: title (A), stripped html_content (B), slug with hyphens replaced by spaces (C)
Separate search endpoint (not a filter on list_notes) — search has ranking, snippets, different performance characteristics
Returns summaries + snippets, NOT full html_content — token efficiency is the whole point

File Targets

Files to modify or create:

alembic/versions/XXXX_add_search_vector.py — new migration: add search_vector column (tsvector), GIN index, trigger function, backfill existing rows
app/models.py — add search_vector column to Note model (server-managed, no app writes)
app/routes/notes.py — add GET /notes/search endpoint
app/schemas.py — add NoteSearchResult schema (slug, title, note_type, status, project, headline, rank)

Files NOT to touch:

app/autolink.py, app/wrap_tables.py — rendering transforms, unrelated
Existing endpoints — don't modify list_notes or get_note behavior

Acceptance Criteria

GET /notes/search?q=secrets+management returns relevant notes ranked by ts_rank
Results include headline field with match context (ts_headline, ~50 words)
Results include rank score
Results do NOT include full html_content (summaries only)
Optional filters work: note_type, project, status, tags (all optional)
limit parameter works (default 10)
Creating or updating a note auto-updates search_vector (trigger)
All 246 existing notes are backfilled during migration
Existing tests pass

Test Expectations

Unit test: search returns expected notes for known query terms
Unit test: search respects filters (note_type, project)
Unit test: search returns empty list for nonsense query
Test: creating a new note makes it searchable immediately
Run: pytest tests/

Constraints

Match existing route style in routes/notes.py
Match existing schema style in schemas.py
Migration must use op.execute() for raw SQL (trigger, function, backfill)
Use to_tsquery('english', ...) — handle multi-word queries by joining with & (AND)
The search_vector column should be nullable and server-managed — don't set it in app code
If existing tests use SQLite, the search endpoint should return 501 or be skipped on SQLite (tsvector is Postgres-only)

Draft SQL (starting point)

Trigger function:

CREATE OR REPLACE FUNCTION notes_search_vector_update() RETURNS trigger AS $$
BEGIN
  NEW.search_vector :=
    setweight(to_tsvector('english', COALESCE(NEW.title, '')), 'A') ||
    setweight(to_tsvector('english', COALESCE(
      regexp_replace(NEW.html_content, '<[^>]+>', ' ', 'g'), ''
    )), 'B') ||
    setweight(to_tsvector('english', COALESCE(
      replace(NEW.slug, '-', ' '), ''
    )), 'C');
  RETURN NEW;
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER notes_search_vector_trigger
  BEFORE INSERT OR UPDATE OF title, html_content, slug
  ON notes
  FOR EACH ROW
  EXECUTE FUNCTION notes_search_vector_update();

Search query:

SELECT
  n.slug, n.title, n.note_type, n.status,
  ts_rank(n.search_vector, query) AS rank,
  ts_headline('english', regexp_replace(n.html_content, '<[^>]+>', ' ', 'g'), query,
    'MaxWords=50, MinWords=20, StartSel=**, StopSel=**') AS headline
FROM notes n, to_tsquery('english', :q) query
WHERE n.search_vector @@ query
ORDER BY rank DESC
LIMIT :limit;

Checklist

PR opened
Tests pass
No unrelated changes

pal-e-docs project
Phase note: phase-postgres-5-fulltext-search
Depends-on: Phase 3 (Postgres migration) — COMPLETED
Blocks: forgejo_admin/pal-e-docs-mcp search_notes tool (separate issue)

### Plan `plan-2026-02-26-tf-modularize-postgres` -- Phase 5 ### Repo `forgejo_admin/pal-e-docs` ### User Story As an AI agent querying platform knowledge I want to search notes by text content with ranked results and snippets So that I can find relevant knowledge in one call instead of enumerating all notes ### Context There is currently **no search** anywhere — not in the API, not in MCP tools. AI agents call `list_notes(tags=...)` then `get_note()` on each result — 12+ API calls and thousands of tokens per knowledge lookup. This is unsustainable as the knowledge base grows. Postgres is live (Phase 3 completed). We can now use `tsvector` for full-text search with weighted ranking and snippet extraction. **Design decisions (from planning):** - **Postgres trigger** auto-builds the tsvector on every INSERT/UPDATE — no app code needed to keep it in sync - **Regex HTML stripping** inside the trigger (`regexp_replace(html_content, '<[^>]+>', ' ', 'g')`) — simpler than Python-side extraction, sufficient for nh3-sanitized HTML - **Weighted tsvector**: title (A), stripped html_content (B), slug with hyphens replaced by spaces (C) - **Separate search endpoint** (not a filter on list_notes) — search has ranking, snippets, different performance characteristics - **Returns summaries + snippets**, NOT full html_content — token efficiency is the whole point ### File Targets Files to modify or create: - `alembic/versions/XXXX_add_search_vector.py` — new migration: add `search_vector` column (tsvector), GIN index, trigger function, backfill existing rows - `app/models.py` — add `search_vector` column to Note model (server-managed, no app writes) - `app/routes/notes.py` — add `GET /notes/search` endpoint - `app/schemas.py` — add `NoteSearchResult` schema (slug, title, note_type, status, project, headline, rank) Files NOT to touch: - `app/autolink.py`, `app/wrap_tables.py` — rendering transforms, unrelated - Existing endpoints — don't modify list_notes or get_note behavior ### Acceptance Criteria - [ ] `GET /notes/search?q=secrets+management` returns relevant notes ranked by ts_rank - [ ] Results include `headline` field with match context (ts_headline, ~50 words) - [ ] Results include `rank` score - [ ] Results do NOT include full `html_content` (summaries only) - [ ] Optional filters work: `note_type`, `project`, `status`, `tags` (all optional) - [ ] `limit` parameter works (default 10) - [ ] Creating or updating a note auto-updates search_vector (trigger) - [ ] All 246 existing notes are backfilled during migration - [ ] Existing tests pass ### Test Expectations - [ ] Unit test: search returns expected notes for known query terms - [ ] Unit test: search respects filters (note_type, project) - [ ] Unit test: search returns empty list for nonsense query - [ ] Test: creating a new note makes it searchable immediately - Run: `pytest tests/` ### Constraints - Match existing route style in `routes/notes.py` - Match existing schema style in `schemas.py` - Migration must use `op.execute()` for raw SQL (trigger, function, backfill) - Use `to_tsquery('english', ...)` — handle multi-word queries by joining with `&` (AND) - The `search_vector` column should be nullable and server-managed — don't set it in app code - If existing tests use SQLite, the search endpoint should return 501 or be skipped on SQLite (tsvector is Postgres-only) ### Draft SQL (starting point) **Trigger function:** ```sql CREATE OR REPLACE FUNCTION notes_search_vector_update() RETURNS trigger AS $$ BEGIN NEW.search_vector := setweight(to_tsvector('english', COALESCE(NEW.title, '')), 'A') || setweight(to_tsvector('english', COALESCE( regexp_replace(NEW.html_content, '<[^>]+>', ' ', 'g'), '' )), 'B') || setweight(to_tsvector('english', COALESCE( replace(NEW.slug, '-', ' '), '' )), 'C'); RETURN NEW; END; $$ LANGUAGE plpgsql; CREATE TRIGGER notes_search_vector_trigger BEFORE INSERT OR UPDATE OF title, html_content, slug ON notes FOR EACH ROW EXECUTE FUNCTION notes_search_vector_update(); ``` **Search query:** ```sql SELECT n.slug, n.title, n.note_type, n.status, ts_rank(n.search_vector, query) AS rank, ts_headline('english', regexp_replace(n.html_content, '<[^>]+>', ' ', 'g'), query, 'MaxWords=50, MinWords=20, StartSel=**, StopSel=**') AS headline FROM notes n, to_tsquery('english', :q) query WHERE n.search_vector @@ query ORDER BY rank DESC LIMIT :limit; ``` ### Checklist - [ ] PR opened - [ ] Tests pass - [ ] No unrelated changes ### Related - `pal-e-docs` project - Phase note: `phase-postgres-5-fulltext-search` - Depends-on: Phase 3 (Postgres migration) — COMPLETED - Blocks: `forgejo_admin/pal-e-docs-mcp` search_notes tool (separate issue)

forgejo_admin referenced this issue from a commit

2026-03-06 20:40:51 +00:00

Add full-text search: tsvector column, GIN index, trigger, search endpoint

forgejo_admin referenced this issue

2026-03-06 20:41:23 +00:00

Add full-text search: tsvector column, GIN index, trigger, search endpoint #84

forgejo_admin commented

2026-03-06 20:44:55 +00:00

Author

Contributor

PR #84 Review

BLOCKERS

None. The code is well-structured, the migration is correct, and the security-sensitive tsquery sanitization is properly handled.

NITS

Empty result on all-special-char queries (/home/ldraney/pal-e-docs/src/pal_e_docs/routes/notes.py, line 195-196): If someone searches q=!!!+???, the min_length=1 validation passes, but after stripping non-word characters the words list is empty and the endpoint returns [] with 200. Consider returning 422 for consistency with the validation intent, or document this as intentional behavior.
\w includes underscores (line 193): The regex r"[^\w]" preserves underscores and digits, so q=___ produces a tsquery for ___. Not harmful, but worth noting -- Postgres will try to match the literal token ___ which will never match anything meaningful.
HTML stripping regex is approximate (migration line 36 and route line 239): regexp_replace(html_content, '<[^>]+>', ' ', 'g') won't correctly strip tags containing > inside attributes (e.g., <div data-x="a>b">). This is fine for search indexing quality -- it is not a security boundary -- but could produce minor noise in headline snippets.
Test coverage is thin: Only 3 tests, all exercising the SQLite 501 guard and FastAPI validation. No Postgres integration tests exist for: actual ranked search results, filter parameters (note_type, project, status, tags), headline snippet generation, the trigger function, or sanitization edge cases. The PR body correctly marks the Postgres test items as unchecked. Consider whether a Postgres test fixture (or a follow-up issue for integration tests) is warranted.
Hardcoded 'english' text search configuration: The trigger, backfill, and query all use to_tsvector('english', ...). If the knowledge base ever needs multilingual support, this will need to be parameterized. Fine for now, but worth a comment in the migration.

SOP COMPLIANCE

Branch named after issue (83-add-fulltext-search-tsvector references issue #83)
PR body has: Summary, Changes, Test Plan, Related
Related section references the plan slug (plan-2026-02-26-tf-modularize-postgres)
No secrets committed
No unnecessary file changes (5 files, all scoped to the feature)
Commit messages are descriptive

NOTES

Route ordering is correct: /search (line 175) is registered before /{slug} (line 252), preventing slug-match conflicts.
The db.bind.dialect.name check correctly gates Postgres-only functionality. Verified the test conftest binds to SQLite.
Alembic migration chain is valid: h8c9d0e1f2g3 -> i9d0e1f2g3h4.
Downgrade is clean and idempotent (IF EXISTS on all drops).
tsquery sanitization (split on whitespace, strip non-word chars, filter empties, join with &) is a solid defense against tsquery operator injection. All filter parameters use SQLAlchemy :param bindings -- no SQL injection vectors.
The decision to NOT map search_vector in the ORM model is sound -- keeps SQLite compatibility for the test suite.

VERDICT: APPROVED

## PR #84 Review ### BLOCKERS None. The code is well-structured, the migration is correct, and the security-sensitive tsquery sanitization is properly handled. ### NITS 1. **Empty result on all-special-char queries** (`/home/ldraney/pal-e-docs/src/pal_e_docs/routes/notes.py`, line 195-196): If someone searches `q=!!!+???`, the `min_length=1` validation passes, but after stripping non-word characters the `words` list is empty and the endpoint returns `[]` with 200. Consider returning 422 for consistency with the validation intent, or document this as intentional behavior. 2. **`\w` includes underscores** (line 193): The regex `r"[^\w]"` preserves underscores and digits, so `q=___` produces a tsquery for `___`. Not harmful, but worth noting -- Postgres will try to match the literal token `___` which will never match anything meaningful. 3. **HTML stripping regex is approximate** (migration line 36 and route line 239): `regexp_replace(html_content, '<[^>]+>', ' ', 'g')` won't correctly strip tags containing `>` inside attributes (e.g., `<div data-x="a>b">`). This is fine for search indexing quality -- it is not a security boundary -- but could produce minor noise in headline snippets. 4. **Test coverage is thin**: Only 3 tests, all exercising the SQLite 501 guard and FastAPI validation. No Postgres integration tests exist for: actual ranked search results, filter parameters (`note_type`, `project`, `status`, `tags`), headline snippet generation, the trigger function, or sanitization edge cases. The PR body correctly marks the Postgres test items as unchecked. Consider whether a Postgres test fixture (or a follow-up issue for integration tests) is warranted. 5. **Hardcoded `'english'` text search configuration**: The trigger, backfill, and query all use `to_tsvector('english', ...)`. If the knowledge base ever needs multilingual support, this will need to be parameterized. Fine for now, but worth a comment in the migration. ### SOP COMPLIANCE - [x] Branch named after issue (`83-add-fulltext-search-tsvector` references issue #83) - [x] PR body has: Summary, Changes, Test Plan, Related - [x] Related section references the plan slug (`plan-2026-02-26-tf-modularize-postgres`) - [x] No secrets committed - [x] No unnecessary file changes (5 files, all scoped to the feature) - [x] Commit messages are descriptive ### NOTES - Route ordering is correct: `/search` (line 175) is registered before `/{slug}` (line 252), preventing slug-match conflicts. - The `db.bind.dialect.name` check correctly gates Postgres-only functionality. Verified the test conftest binds to SQLite. - Alembic migration chain is valid: `h8c9d0e1f2g3` -> `i9d0e1f2g3h4`. - Downgrade is clean and idempotent (`IF EXISTS` on all drops). - tsquery sanitization (split on whitespace, strip non-word chars, filter empties, join with `&`) is a solid defense against tsquery operator injection. All filter parameters use SQLAlchemy `:param` bindings -- no SQL injection vectors. - The decision to NOT map `search_vector` in the ORM model is sound -- keeps SQLite compatibility for the test suite. ### VERDICT: APPROVED

forgejo_admin commented

2026-03-06 20:48:57 +00:00

Author

Contributor

Closed by PR #84 (squash merged). tsvector column, GIN index, trigger, and GET /notes/search endpoint deployed.

Next: deploy to cluster, then spawn dev agent for pal-e-docs-mcp #18 (search_notes MCP tool).

Closed by PR #84 (squash merged). tsvector column, GIN index, trigger, and `GET /notes/search` endpoint deployed. Next: deploy to cluster, then spawn dev agent for pal-e-docs-mcp #18 (search_notes MCP tool).

forgejo_admin closed this issue

2026-03-07 12:36:45 +00:00

Commenting is not possible because the repository is archived.