Deploy Ollama + NVIDIA device plugin as platform services #24

Closed
opened 2026-03-08 20:57:14 +00:00 by forgejo_admin · 0 comments

Lineage

plan-2026-02-26-tf-modularize-postgres → Phase 6 → Phase 6a (Deploy Ollama as Platform Service)

Repo

forgejo_admin/pal-e-platform

User Story

As a platform operator
I want Ollama running in-cluster with GPU access managed by Terraform
So that the embedding worker (Phase 6b) can generate vectors via http://ollama.ollama.svc.cluster.local:11434

Context

Phase 6 (Vector Search) needs Ollama running in-cluster to generate embeddings via Qwen3-Embedding-4B. This is the first sub-phase — deploying Ollama as a reusable platform service.

Key facts:

  • NVIDIA k8s device plugin is ALREADY deployed (manually as a DaemonSet, not Terraform-managed). Image: nvcr.io/nvidia/k8s-device-plugin:v0.17.0. Must be brought under Helm/Terraform management.
  • Host Ollama (systemd) is running but idle (54 MiB GPU, 0% utilization). Must be disabled to free the GPU for the k8s pod.
  • GPU is visible to k8s: nvidia.com/gpu: 1.
  • No Node Feature Discovery (NFD) deployed — the NVIDIA device plugin Helm chart's default affinity rules (which match NFD labels) must be overridden to {} or the DaemonSet won't schedule.

Decisions made:

  • NVIDIA device plugin Helm chart version: 0.17.4 (closest patch to deployed v0.17.0 image; chart repo doesn't have 0.17.0)
  • Ollama Helm chart: ollama-helm/ollama version 1.49.0 (latest stable, app version 0.17.6) from https://otwld.github.io/ollama-helm/
  • Ollama chart uses ollama.gpu.enabled: true + ollama.gpu.number: 1 (not raw resource limits)
  • Models pulled via ollama.models.pull: ["qwen3-embedding:4b"]
  • Persistence via persistentVolume.enabled: true with local-path storageClass
  • Service: ClusterIP on 11434 (internal only, no Tailscale funnel)
  • SaltStack change: set host ollama service to dead + enable: False

File Targets

Files to modify:

  • terraform/main.tf — Add 3 resources after the CNPG section (~line 1097): helm_release.nvidia_device_plugin, kubernetes_namespace_v1.ollama, helm_release.ollama
  • salt/states/services/init.sls — Change ollama-service from service.running + enable: True to service.dead + enable: False. Update comments to explain why.

Files NOT to touch:

  • terraform/variables.tf — No new variables needed (no secrets for Ollama)
  • terraform/providers.tf — No new providers needed

Acceptance Criteria

  • kubectl get daemonset -n kube-system | grep nvidia shows Helm-managed device plugin running
  • kubectl get nodes -o json | jq '.items[0].status.capacity["nvidia.com/gpu"]' returns "1"
  • kubectl get pods -n ollama shows Ollama pod Running
  • kubectl exec -n ollama deployment/ollama -- ollama list shows qwen3-embedding:4b
  • Embedding test returns vector: kubectl exec -n ollama deployment/ollama -- curl -s http://localhost:11434/api/embed -d '{"model":"qwen3-embedding:4b","input":"test"}'
  • systemctl is-active ollama returns inactive (host service disabled)
  • tofu plan shows no drift after apply

Test Expectations

  • tofu validate passes
  • tofu fmt -check passes
  • tofu plan shows exactly 3 new resources (namespace + 2 helm releases)

Constraints

  • Follow existing main.tf patterns: yamlencode() for values, depends_on chains, resource naming conventions
  • NVIDIA device plugin: must override affinity: {} (no NFD on this cluster)
  • Ollama chart: use ollama.gpu.enabled mechanism, not raw resource limits
  • Timeout 600 for Ollama (model pull is ~4GB)
  • Manual step required before tofu apply: kubectl delete daemonset nvidia-device-plugin-daemonset -n kube-system
  • Manual step required: sudo salt-call --local state.apply services to disable host Ollama

Checklist

  • PR opened
  • tofu validate passes
  • tofu fmt -check passes
  • No unrelated changes
  • plan-2026-02-26-tf-modularize-postgres — parent plan (Phase 6: Vector Search)
### Lineage `plan-2026-02-26-tf-modularize-postgres` → Phase 6 → Phase 6a (Deploy Ollama as Platform Service) ### Repo `forgejo_admin/pal-e-platform` ### User Story As a platform operator I want Ollama running in-cluster with GPU access managed by Terraform So that the embedding worker (Phase 6b) can generate vectors via `http://ollama.ollama.svc.cluster.local:11434` ### Context Phase 6 (Vector Search) needs Ollama running in-cluster to generate embeddings via Qwen3-Embedding-4B. This is the first sub-phase — deploying Ollama as a reusable platform service. **Key facts:** - NVIDIA k8s device plugin is ALREADY deployed (manually as a DaemonSet, not Terraform-managed). Image: `nvcr.io/nvidia/k8s-device-plugin:v0.17.0`. Must be brought under Helm/Terraform management. - Host Ollama (systemd) is running but idle (54 MiB GPU, 0% utilization). Must be disabled to free the GPU for the k8s pod. - GPU is visible to k8s: `nvidia.com/gpu: 1`. - No Node Feature Discovery (NFD) deployed — the NVIDIA device plugin Helm chart's default `affinity` rules (which match NFD labels) must be overridden to `{}` or the DaemonSet won't schedule. **Decisions made:** - NVIDIA device plugin Helm chart version: `0.17.4` (closest patch to deployed v0.17.0 image; chart repo doesn't have 0.17.0) - Ollama Helm chart: `ollama-helm/ollama` version `1.49.0` (latest stable, app version 0.17.6) from `https://otwld.github.io/ollama-helm/` - Ollama chart uses `ollama.gpu.enabled: true` + `ollama.gpu.number: 1` (not raw resource limits) - Models pulled via `ollama.models.pull: ["qwen3-embedding:4b"]` - Persistence via `persistentVolume.enabled: true` with `local-path` storageClass - Service: ClusterIP on 11434 (internal only, no Tailscale funnel) - SaltStack change: set host ollama service to `dead` + `enable: False` ### File Targets Files to modify: - `terraform/main.tf` — Add 3 resources after the CNPG section (~line 1097): `helm_release.nvidia_device_plugin`, `kubernetes_namespace_v1.ollama`, `helm_release.ollama` - `salt/states/services/init.sls` — Change ollama-service from `service.running` + `enable: True` to `service.dead` + `enable: False`. Update comments to explain why. Files NOT to touch: - `terraform/variables.tf` — No new variables needed (no secrets for Ollama) - `terraform/providers.tf` — No new providers needed ### Acceptance Criteria - [ ] `kubectl get daemonset -n kube-system | grep nvidia` shows Helm-managed device plugin running - [ ] `kubectl get nodes -o json | jq '.items[0].status.capacity["nvidia.com/gpu"]'` returns `"1"` - [ ] `kubectl get pods -n ollama` shows Ollama pod Running - [ ] `kubectl exec -n ollama deployment/ollama -- ollama list` shows `qwen3-embedding:4b` - [ ] Embedding test returns vector: `kubectl exec -n ollama deployment/ollama -- curl -s http://localhost:11434/api/embed -d '{"model":"qwen3-embedding:4b","input":"test"}'` - [ ] `systemctl is-active ollama` returns `inactive` (host service disabled) - [ ] `tofu plan` shows no drift after apply ### Test Expectations - [ ] `tofu validate` passes - [ ] `tofu fmt -check` passes - [ ] `tofu plan` shows exactly 3 new resources (namespace + 2 helm releases) ### Constraints - Follow existing `main.tf` patterns: `yamlencode()` for values, `depends_on` chains, resource naming conventions - NVIDIA device plugin: must override `affinity: {}` (no NFD on this cluster) - Ollama chart: use `ollama.gpu.enabled` mechanism, not raw resource limits - Timeout 600 for Ollama (model pull is ~4GB) - Manual step required before `tofu apply`: `kubectl delete daemonset nvidia-device-plugin-daemonset -n kube-system` - Manual step required: `sudo salt-call --local state.apply services` to disable host Ollama ### Checklist - [ ] PR opened - [ ] `tofu validate` passes - [ ] `tofu fmt -check` passes - [ ] No unrelated changes ### Related - `plan-2026-02-26-tf-modularize-postgres` — parent plan (Phase 6: Vector Search)
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
forgejo_admin/pal-e-platform#24
No description provided.