Fix GPU visibility — add runtimeClassName: nvidia to Helm values #26
Labels
No labels
domain:backend
domain:devops
domain:frontend
status:approved
status:in-progress
status:needs-fix
status:qa
type:bug
type:devops
type:feature
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
forgejo_admin/pal-e-platform#26
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Lineage
plan-2026-02-26-tf-modularize-postgres→ Phase 6 → Phase 6a (Deploy Ollama as platform service)Repo
forgejo_admin/pal-e-platformUser Story
As a platform operator
I want the NVIDIA device plugin and Ollama pods to use the nvidia container runtime
So that GPU resources are discoverable and Ollama can use GPU acceleration for embedding generation
Context
PR #25 deployed the NVIDIA device plugin and Ollama Helm releases, but both are missing
runtimeClassName: "nvidia"in their Helm values. On k3s, the default container runtime isrunc, which cannot discover GPUs via NVML. The device plugin reports 0 GPU capacity, so Ollama's pod (which requestsnvidia.com/gpu: 1) stays Pending indefinitely. The Ollama Helm release is in FAILED state (timed out waiting for pod).The fix is adding
runtimeClassName = "nvidia"to both Helm values blocks interraform/main.tf.Additionally:
helm delete)/etc/containerd/conf.d/99-nvidia.tomlshould be deleted (host-level cleanup, not in scope for this PR but noted)File Targets
Files the agent should modify:
terraform/main.tf— addruntimeClassName = "nvidia"to bothhelm_release.nvidia_device_pluginandhelm_release.ollamavalues blocksFiles the agent should NOT touch:
salt/— Salt states are correct as-isAcceptance Criteria
helm_release.nvidia_device_pluginvalues includeruntimeClassName = "nvidia"helm_release.ollamavalues includeruntimeClassName = "nvidia"tofu fmtpassestofu validatepassesTest Expectations
tofu fmt -checkexits 0tofu validateexits 0kubectl get nodes -o json | jq '.items[].status.capacity["nvidia.com/gpu"]'shows "1"Constraints
runtimeClassNameto the two values blocksChecklist
tofu fmtcleantofu validatecleanRelated
phase-postgres-6-vector-search— parent phase