6e: Add hybrid ranking to search endpoint (tsvector + pgvector)

forgejo_admin commented

2026-03-09 17:31:16 +00:00

Owner

Lineage

plan-2026-02-26-tf-modularize-postgres → Phase 6 (Vector Search) → Phase 6e (Hybrid Ranking)

Repo

forgejo_admin/pal-e-docs

User Story

As an agent querying pal-e-docs
I want a unified search endpoint that combines keyword and semantic relevance
So that search results are ranked by both exact term matches and meaning similarity

Context

Phase 5 added full-text search (tsvector, GET /notes/search). Phase 6d added semantic search (pgvector + Ollama embeddings, GET /notes/semantic-search). Currently agents must choose one or the other. Hybrid ranking combines both signals — a note that matches both keywords AND meaning should rank higher than one matching only one signal.

The recommended approach is Reciprocal Rank Fusion (RRF). RRF is simpler than weighted linear combination (no score normalization needed) and well-proven in information retrieval. Formula: RRF(d) = Σ 1/(k + rank_i(d)) where k is typically 60.

File Targets

Files to modify:

src/pal_e_docs/routes/notes.py — add mode query parameter to search endpoint (keyword/semantic/hybrid)
src/pal_e_docs/services/ or equivalent — implement hybrid ranking logic (RRF combination of tsvector and pgvector results)
tests/ — unit + integration tests for hybrid mode

Files NOT to touch:

src/pal_e_docs/services/embedding_worker.py — embeddings are already computed
Alembic migrations — no schema changes needed (tsvector and pgvector columns already exist)

Acceptance Criteria

GET /notes/search?q=hello&mode=keyword returns same results as current behavior (backward compatible)
GET /notes/search?q=hello&mode=semantic returns semantically similar results
GET /notes/search?q=hello&mode=hybrid combines both signals using RRF
Default mode is keyword (backward compatible when mode is omitted)
alpha parameter (0.0-1.0) controls weighting — 0.0 = pure keyword, 1.0 = pure semantic, 0.5 = balanced
Results include score/rank metadata

Test Expectations

Unit test: RRF ranking function correctly combines two ranked lists
Unit test: mode parameter validation (keyword/semantic/hybrid only)
Unit test: alpha parameter clamping (0.0-1.0)
Integration test: hybrid search returns results, mode=keyword matches existing behavior
Run: pytest tests/ -v -k hybrid

Constraints

Backward compatible — existing GET /notes/search behavior must not change when mode is omitted
The /notes/semantic-search endpoint remains unchanged (deprecation is a separate decision)
RRF with k=60 is the recommended approach — simpler than score normalization
SDK and MCP tool updates (adding mode parameter pass-through) will be separate follow-up sub-phases
Follow existing route/service patterns in routes/notes.py

Checklist

PR opened
Tests pass
No unrelated changes

pal-e-docs — project
phase-postgres-6e-hybrid-ranking — phase note

### Lineage `plan-2026-02-26-tf-modularize-postgres` → Phase 6 (Vector Search) → Phase 6e (Hybrid Ranking) ### Repo `forgejo_admin/pal-e-docs` ### User Story As an agent querying pal-e-docs I want a unified search endpoint that combines keyword and semantic relevance So that search results are ranked by both exact term matches and meaning similarity ### Context Phase 5 added full-text search (tsvector, `GET /notes/search`). Phase 6d added semantic search (pgvector + Ollama embeddings, `GET /notes/semantic-search`). Currently agents must choose one or the other. Hybrid ranking combines both signals — a note that matches both keywords AND meaning should rank higher than one matching only one signal. The recommended approach is Reciprocal Rank Fusion (RRF). RRF is simpler than weighted linear combination (no score normalization needed) and well-proven in information retrieval. Formula: `RRF(d) = Σ 1/(k + rank_i(d))` where k is typically 60. ### File Targets Files to modify: - `src/pal_e_docs/routes/notes.py` — add `mode` query parameter to search endpoint (`keyword`/`semantic`/`hybrid`) - `src/pal_e_docs/services/` or equivalent — implement hybrid ranking logic (RRF combination of tsvector and pgvector results) - `tests/` — unit + integration tests for hybrid mode Files NOT to touch: - `src/pal_e_docs/services/embedding_worker.py` — embeddings are already computed - Alembic migrations — no schema changes needed (tsvector and pgvector columns already exist) ### Acceptance Criteria - [ ] `GET /notes/search?q=hello&mode=keyword` returns same results as current behavior (backward compatible) - [ ] `GET /notes/search?q=hello&mode=semantic` returns semantically similar results - [ ] `GET /notes/search?q=hello&mode=hybrid` combines both signals using RRF - [ ] Default mode is `keyword` (backward compatible when `mode` is omitted) - [ ] `alpha` parameter (0.0-1.0) controls weighting — 0.0 = pure keyword, 1.0 = pure semantic, 0.5 = balanced - [ ] Results include score/rank metadata ### Test Expectations - [ ] Unit test: RRF ranking function correctly combines two ranked lists - [ ] Unit test: mode parameter validation (keyword/semantic/hybrid only) - [ ] Unit test: alpha parameter clamping (0.0-1.0) - [ ] Integration test: hybrid search returns results, mode=keyword matches existing behavior - Run: `pytest tests/ -v -k hybrid` ### Constraints - Backward compatible — existing `GET /notes/search` behavior must not change when `mode` is omitted - The `/notes/semantic-search` endpoint remains unchanged (deprecation is a separate decision) - RRF with k=60 is the recommended approach — simpler than score normalization - SDK and MCP tool updates (adding `mode` parameter pass-through) will be separate follow-up sub-phases - Follow existing route/service patterns in `routes/notes.py` ### Checklist - [ ] PR opened - [ ] Tests pass - [ ] No unrelated changes ### Related - `pal-e-docs` — project - `phase-postgres-6e-hybrid-ranking` — phase note