F12: Ollama hostPath volume + embedding alerting

forgejo_admin commented

2026-03-16 02:47:54 +00:00

Owner

Lineage

plan-pal-e-docs → Phase F12 (phase-pal-e-docs-f12-semantic-search-recovery)

Repo

forgejo_admin/pal-e-platform

User Story

As a platform operator
I want Ollama models to persist across pod restarts and embedding failures to trigger alerts
So that semantic search stays healthy without manual intervention

Context

Semantic search was down for 6+ days. Root cause: Ollama PVC was recreated and qwen3-embedding:4b model was never re-pulled. Immediate fix applied (model pulled, chat models removed to free 6Gi cgroup). This issue is the durable fix: hostPath volume so models survive any k8s lifecycle event, plus Prometheus alerting so failures are detected within 10 minutes.

Key findings from diagnosis:

Ollama image: ollama/ollama:0.17.6, GPU: GTX 1070 (8GB VRAM)
Current 6Gi memory limit — mmap'd model files count against cgroup. With hostPath, page cache behavior may differ.
embedding_queue_depth metric is NOT sufficient for alerting — failed blocks get marked error and leave the queue (reads 0 during failures). Must alert on embedding_errors_total rate.
152 blocks stuck in error status need backfill.

File Targets

Files to modify:

terraform/modules/ollama/main.tf (or wherever Ollama deployment is defined) — swap PVC for hostPath volume mount (/var/lib/ollama)
terraform/modules/prometheus/alerts.tf (or alert rules config) — add embedding error rate alerts
terraform/modules/prometheus/scrape.tf (or scrape config) — add scrape target for embedding worker :8001/metrics

Files NOT to touch:

~/pal-e-docs/ — backfill is a manual step, not a code change

Acceptance Criteria

Ollama deployment uses hostPath volume (/var/lib/ollama), PVC removed
tofu plan -lock=false shows clean (PVC removed, hostPath added)
qwen3-embedding:4b survives pod restart (delete pod, verify model still loaded after restart)
Prometheus scrapes embedding worker metrics on :8001
Alert rule: rate(embedding_errors_total[5m]) > 0 → warning
Alert rule: embedding_total == 0 for > 10min while errors increasing → critical
After deploy: reset error blocks (UPDATE blocks SET embedding_status = 'pending' WHERE embedding_status = 'error'), verify worker processes them

Test Expectations

tofu validate passes
tofu plan -lock=false shows expected changes (PVC → hostPath, new alert rules, new scrape target)
After apply: kubectl delete pod -n ollama <pod> → pod restarts → ollama list still shows qwen3-embedding:4b
After apply: verify curl localhost:8001/metrics (port-forwarded) returns Prometheus metrics

Constraints

tofu plan MUST include -lock=false (state lock blocks CI)
tofu fmt and tofu validate must pass
Ollama pod has nvidia.com/gpu: 1 — it always lands on the GPU node, so hostPath is safe
Only keep qwen3-embedding:4b model. No chat models — they cause OOM with 6Gi limit.
Memory limit may need adjustment after hostPath change (mmap behavior differs)

Checklist

PR opened
tofu plan -lock=false output in PR description
Tests pass
No unrelated changes

phase-pal-e-docs-f12-semantic-search-recovery — phase note with full diagnostic data
plan-pal-e-platform — Platform Hardening (alerting infrastructure from Phase 16)
Blocks: phase-pal-e-docs-f13-context-intelligence (F13b-2 needs durable vectors)

### Lineage `plan-pal-e-docs` → Phase F12 (`phase-pal-e-docs-f12-semantic-search-recovery`) ### Repo `forgejo_admin/pal-e-platform` ### User Story As a platform operator I want Ollama models to persist across pod restarts and embedding failures to trigger alerts So that semantic search stays healthy without manual intervention ### Context Semantic search was down for 6+ days. Root cause: Ollama PVC was recreated and `qwen3-embedding:4b` model was never re-pulled. Immediate fix applied (model pulled, chat models removed to free 6Gi cgroup). This issue is the durable fix: hostPath volume so models survive any k8s lifecycle event, plus Prometheus alerting so failures are detected within 10 minutes. Key findings from diagnosis: - Ollama image: `ollama/ollama:0.17.6`, GPU: GTX 1070 (8GB VRAM) - Current 6Gi memory limit — mmap'd model files count against cgroup. With hostPath, page cache behavior may differ. - `embedding_queue_depth` metric is NOT sufficient for alerting — failed blocks get marked `error` and leave the queue (reads 0 during failures). Must alert on `embedding_errors_total` rate. - 152 blocks stuck in `error` status need backfill. ### File Targets Files to modify: - `terraform/modules/ollama/main.tf` (or wherever Ollama deployment is defined) — swap PVC for hostPath volume mount (`/var/lib/ollama`) - `terraform/modules/prometheus/alerts.tf` (or alert rules config) — add embedding error rate alerts - `terraform/modules/prometheus/scrape.tf` (or scrape config) — add scrape target for embedding worker `:8001/metrics` Files NOT to touch: - `~/pal-e-docs/` — backfill is a manual step, not a code change ### Acceptance Criteria - [ ] Ollama deployment uses hostPath volume (`/var/lib/ollama`), PVC removed - [ ] `tofu plan -lock=false` shows clean (PVC removed, hostPath added) - [ ] `qwen3-embedding:4b` survives pod restart (delete pod, verify model still loaded after restart) - [ ] Prometheus scrapes embedding worker metrics on `:8001` - [ ] Alert rule: `rate(embedding_errors_total[5m]) > 0` → warning - [ ] Alert rule: `embedding_total == 0` for > 10min while errors increasing → critical - [ ] After deploy: reset error blocks (`UPDATE blocks SET embedding_status = 'pending' WHERE embedding_status = 'error'`), verify worker processes them ### Test Expectations - [ ] `tofu validate` passes - [ ] `tofu plan -lock=false` shows expected changes (PVC → hostPath, new alert rules, new scrape target) - [ ] After apply: `kubectl delete pod -n ollama <pod>` → pod restarts → `ollama list` still shows `qwen3-embedding:4b` - [ ] After apply: verify `curl localhost:8001/metrics` (port-forwarded) returns Prometheus metrics ### Constraints - `tofu plan` MUST include `-lock=false` (state lock blocks CI) - `tofu fmt` and `tofu validate` must pass - Ollama pod has `nvidia.com/gpu: 1` — it always lands on the GPU node, so hostPath is safe - Only keep `qwen3-embedding:4b` model. No chat models — they cause OOM with 6Gi limit. - Memory limit may need adjustment after hostPath change (mmap behavior differs) ### Checklist - [ ] PR opened - [ ] `tofu plan -lock=false` output in PR description - [ ] Tests pass - [ ] No unrelated changes ### Related - `phase-pal-e-docs-f12-semantic-search-recovery` — phase note with full diagnostic data - `plan-pal-e-platform` — Platform Hardening (alerting infrastructure from Phase 16) - Blocks: `phase-pal-e-docs-f13-context-intelligence` (F13b-2 needs durable vectors)