6c: Async embedding pipeline + backfill #129

Closed
opened 2026-03-09 03:56:52 +00:00 by forgejo_admin · 0 comments

Lineage

plan-2026-02-26-tf-modularize-postgres → Phase 6 (Vector Search) → Phase 6c

Repo

forgejo_admin/pal-e-docs

User Story

As an AI agent on the pal-e platform
I want block content automatically embedded as vectors when notes are created or updated
So that semantic search can find relevant knowledge without brute-force enumeration

Context

Phase 6b deployed pgvector schema: embedding vector(768) and embedding_status varchar(20) columns on the blocks table, plus a Postgres trigger that fires NOTIFY embedding_queue on block INSERT/UPDATE and sets embedding_status = 'pending'. Ollama is live as a platform service (Phase 6a) at http://ollama.ollama.svc.cluster.local:11434 with qwen3-embedding:4b loaded in VRAM.

The trigger is firing but nothing is listening. This phase builds the worker that consumes those notifications, calls Ollama for embeddings, and stores the vectors. It also backfills all ~5K existing blocks.

Decisions already made (from decision-phase6-vector-search-architecture):

  • Async via PostgreSQL LISTEN/NOTIFY (no Redis/Celery)
  • Per-block embedding (not per-note)
  • Separate k8s Deployment (independent failure domain, no GPU request — calls Ollama over HTTP)
  • Instruction prefix: "Represent this platform knowledge base section for retrieval: {block_text}"
  • Same Docker image as API pod, different entrypoint
  • Embed: paragraph, list, heading (with parent context), table (flattened), code. Skip: mermaid (already skipped by trigger)

File Targets

Files to create:

  • src/pal_e_docs/embedding_worker.py — main worker process (LISTEN loop, poll fallback, batch processor, health endpoint, metrics, backfill mode)
  • k8s/embedding-worker.yaml — k8s Deployment manifest

Files to modify:

  • src/pal_e_docs/config.py — add ollama_url setting (PALDOCS_OLLAMA_URL)
  • k8s/kustomization.yaml — add embedding-worker.yaml resource
  • pyproject.toml — move httpx from dev to main deps

Files NOT to touch:

  • alembic/versions/ — no new migrations (6b already created the schema + trigger)
  • src/pal_e_docs/routes/ — no API changes (that's 6d)
  • src/pal_e_docs/models.py — model already has embedding and embedding_status columns

Acceptance Criteria

  • Worker starts, connects to Postgres, issues LISTEN embedding_queue
  • On NOTIFY: queries pending blocks, extracts text (block_type-aware), calls Ollama, stores vector, sets embedding_status = 'completed'
  • Poll fallback: every 60s, sweeps blocks where embedding_status = 'pending' (catches missed notifications)
  • State machine: pending → processing → completed | error. processing prevents duplicate work on restart
  • Retry with exponential backoff on Ollama transient errors
  • Graceful SIGTERM: finishes current batch, resets any processing blocks to pending
  • Health endpoint on a lightweight HTTP server (e.g., port 8001) for k8s probes
  • Prometheus metrics: embedding_total, embedding_errors_total, embedding_duration_seconds, embedding_queue_depth
  • --backfill flag: processes all pending blocks in rate-limited batches with progress logging
  • k8s Deployment: same image, entrypoint python -m pal_e_docs.embedding_worker, no GPU request, minimal resources (10m/64Mi req, 256Mi limit)
  • Backfill run completes: all ~5K embeddable blocks have embedding_status = 'completed' and non-null embedding

Test Expectations

  • Unit test: block text extraction — each block_type produces expected plain text
  • Unit test: mermaid blocks skipped, empty blocks handled gracefully
  • Unit test: heading text includes parent note title context
  • Integration test: end-to-end — create block → worker picks up → embedding stored (requires Ollama, may need to mock in CI)
  • Run command: pytest tests/ -k test_embedding

Constraints

  • Use raw psycopg2 connection for LISTEN (SQLAlchemy doesn't expose it)
  • Use httpx for Ollama HTTP calls (async-capable, already in dev deps)
  • Match structured logging style of existing codebase
  • Health endpoint should be minimal (not full FastAPI — a simple http.server or similar)
  • Batch size: 10 blocks per cycle (live), configurable via env var for backfill
  • The worker does NOT need runtimeClassName: nvidia — it calls Ollama over HTTP, Ollama owns the GPU

Checklist

  • PR opened
  • Tests pass
  • Backfill verified (all blocks embedded)
  • No unrelated changes
  • project-pal-e-docs — project this affects
  • Issue #126 — 6b-1 extension ownership (independent, doesn't block this)
  • PR #122 — 6b schema migration (predecessor)
  • PR #27 (pal-e-platform) — 6a Ollama deployment (predecessor)
### Lineage `plan-2026-02-26-tf-modularize-postgres` → Phase 6 (Vector Search) → Phase 6c ### Repo `forgejo_admin/pal-e-docs` ### User Story As an AI agent on the pal-e platform I want block content automatically embedded as vectors when notes are created or updated So that semantic search can find relevant knowledge without brute-force enumeration ### Context Phase 6b deployed pgvector schema: `embedding vector(768)` and `embedding_status varchar(20)` columns on the `blocks` table, plus a Postgres trigger that fires `NOTIFY embedding_queue` on block INSERT/UPDATE and sets `embedding_status = 'pending'`. Ollama is live as a platform service (Phase 6a) at `http://ollama.ollama.svc.cluster.local:11434` with `qwen3-embedding:4b` loaded in VRAM. The trigger is firing but nothing is listening. This phase builds the worker that consumes those notifications, calls Ollama for embeddings, and stores the vectors. It also backfills all ~5K existing blocks. Decisions already made (from `decision-phase6-vector-search-architecture`): - Async via PostgreSQL LISTEN/NOTIFY (no Redis/Celery) - Per-block embedding (not per-note) - Separate k8s Deployment (independent failure domain, no GPU request — calls Ollama over HTTP) - Instruction prefix: `"Represent this platform knowledge base section for retrieval: {block_text}"` - Same Docker image as API pod, different entrypoint - Embed: paragraph, list, heading (with parent context), table (flattened), code. Skip: mermaid (already `skipped` by trigger) ### File Targets Files to create: - `src/pal_e_docs/embedding_worker.py` — main worker process (LISTEN loop, poll fallback, batch processor, health endpoint, metrics, backfill mode) - `k8s/embedding-worker.yaml` — k8s Deployment manifest Files to modify: - `src/pal_e_docs/config.py` — add `ollama_url` setting (`PALDOCS_OLLAMA_URL`) - `k8s/kustomization.yaml` — add `embedding-worker.yaml` resource - `pyproject.toml` — move `httpx` from dev to main deps Files NOT to touch: - `alembic/versions/` — no new migrations (6b already created the schema + trigger) - `src/pal_e_docs/routes/` — no API changes (that's 6d) - `src/pal_e_docs/models.py` — model already has `embedding` and `embedding_status` columns ### Acceptance Criteria - [ ] Worker starts, connects to Postgres, issues `LISTEN embedding_queue` - [ ] On NOTIFY: queries pending blocks, extracts text (block_type-aware), calls Ollama, stores vector, sets `embedding_status = 'completed'` - [ ] Poll fallback: every 60s, sweeps blocks where `embedding_status = 'pending'` (catches missed notifications) - [ ] State machine: `pending → processing → completed | error`. `processing` prevents duplicate work on restart - [ ] Retry with exponential backoff on Ollama transient errors - [ ] Graceful SIGTERM: finishes current batch, resets any `processing` blocks to `pending` - [ ] Health endpoint on a lightweight HTTP server (e.g., port 8001) for k8s probes - [ ] Prometheus metrics: `embedding_total`, `embedding_errors_total`, `embedding_duration_seconds`, `embedding_queue_depth` - [ ] `--backfill` flag: processes all pending blocks in rate-limited batches with progress logging - [ ] k8s Deployment: same image, entrypoint `python -m pal_e_docs.embedding_worker`, no GPU request, minimal resources (10m/64Mi req, 256Mi limit) - [ ] Backfill run completes: all ~5K embeddable blocks have `embedding_status = 'completed'` and non-null `embedding` ### Test Expectations - [ ] Unit test: block text extraction — each block_type produces expected plain text - [ ] Unit test: mermaid blocks skipped, empty blocks handled gracefully - [ ] Unit test: heading text includes parent note title context - [ ] Integration test: end-to-end — create block → worker picks up → embedding stored (requires Ollama, may need to mock in CI) - Run command: `pytest tests/ -k test_embedding` ### Constraints - Use raw `psycopg2` connection for LISTEN (SQLAlchemy doesn't expose it) - Use `httpx` for Ollama HTTP calls (async-capable, already in dev deps) - Match structured logging style of existing codebase - Health endpoint should be minimal (not full FastAPI — a simple `http.server` or similar) - Batch size: 10 blocks per cycle (live), configurable via env var for backfill - The worker does NOT need `runtimeClassName: nvidia` — it calls Ollama over HTTP, Ollama owns the GPU ### Checklist - [ ] PR opened - [ ] Tests pass - [ ] Backfill verified (all blocks embedded) - [ ] No unrelated changes ### Related - `project-pal-e-docs` — project this affects - Issue #126 — 6b-1 extension ownership (independent, doesn't block this) - PR #122 — 6b schema migration (predecessor) - PR #27 (pal-e-platform) — 6a Ollama deployment (predecessor)
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
forgejo_admin/pal-e-api#129
No description provided.