Fix GPU visibility — add runtimeClassName: nvidia to Helm values #26

Closed
opened 2026-03-09 02:46:17 +00:00 by forgejo_admin · 0 comments

Lineage

plan-2026-02-26-tf-modularize-postgres → Phase 6 → Phase 6a (Deploy Ollama as platform service)

Repo

forgejo_admin/pal-e-platform

User Story

As a platform operator
I want the NVIDIA device plugin and Ollama pods to use the nvidia container runtime
So that GPU resources are discoverable and Ollama can use GPU acceleration for embedding generation

Context

PR #25 deployed the NVIDIA device plugin and Ollama Helm releases, but both are missing runtimeClassName: "nvidia" in their Helm values. On k3s, the default container runtime is runc, which cannot discover GPUs via NVML. The device plugin reports 0 GPU capacity, so Ollama's pod (which requests nvidia.com/gpu: 1) stays Pending indefinitely. The Ollama Helm release is in FAILED state (timed out waiting for pod).

The fix is adding runtimeClassName = "nvidia" to both Helm values blocks in terraform/main.tf.

Additionally:

  • The failed Ollama Helm release may need cleanup before reapply (TF taint or helm delete)
  • Orphan file /etc/containerd/conf.d/99-nvidia.toml should be deleted (host-level cleanup, not in scope for this PR but noted)

File Targets

Files the agent should modify:

  • terraform/main.tf — add runtimeClassName = "nvidia" to both helm_release.nvidia_device_plugin and helm_release.ollama values blocks

Files the agent should NOT touch:

  • salt/ — Salt states are correct as-is
  • Any other Terraform resources

Acceptance Criteria

  • helm_release.nvidia_device_plugin values include runtimeClassName = "nvidia"
  • helm_release.ollama values include runtimeClassName = "nvidia"
  • tofu fmt passes
  • tofu validate passes

Test Expectations

  • tofu fmt -check exits 0
  • tofu validate exits 0
  • Post-apply verification (manual): kubectl get nodes -o json | jq '.items[].status.capacity["nvidia.com/gpu"]' shows "1"

Constraints

  • Keep changes minimal — only add runtimeClassName to the two values blocks
  • Do not change chart versions, model lists, or any other values
  • Follow existing Terraform formatting patterns

Checklist

  • PR opened
  • tofu fmt clean
  • tofu validate clean
  • No unrelated changes
  • phase-postgres-6-vector-search — parent phase
  • PR #25 — original deploy that this fixes
  • Forgejo issue #24 — original issue (closed)
### Lineage `plan-2026-02-26-tf-modularize-postgres` → Phase 6 → Phase 6a (Deploy Ollama as platform service) ### Repo `forgejo_admin/pal-e-platform` ### User Story As a platform operator I want the NVIDIA device plugin and Ollama pods to use the nvidia container runtime So that GPU resources are discoverable and Ollama can use GPU acceleration for embedding generation ### Context PR #25 deployed the NVIDIA device plugin and Ollama Helm releases, but both are missing `runtimeClassName: "nvidia"` in their Helm values. On k3s, the default container runtime is `runc`, which cannot discover GPUs via NVML. The device plugin reports 0 GPU capacity, so Ollama's pod (which requests `nvidia.com/gpu: 1`) stays Pending indefinitely. The Ollama Helm release is in FAILED state (timed out waiting for pod). The fix is adding `runtimeClassName = "nvidia"` to both Helm values blocks in `terraform/main.tf`. Additionally: - The failed Ollama Helm release may need cleanup before reapply (TF taint or `helm delete`) - Orphan file `/etc/containerd/conf.d/99-nvidia.toml` should be deleted (host-level cleanup, not in scope for this PR but noted) ### File Targets Files the agent should modify: - `terraform/main.tf` — add `runtimeClassName = "nvidia"` to both `helm_release.nvidia_device_plugin` and `helm_release.ollama` values blocks Files the agent should NOT touch: - `salt/` — Salt states are correct as-is - Any other Terraform resources ### Acceptance Criteria - [ ] `helm_release.nvidia_device_plugin` values include `runtimeClassName = "nvidia"` - [ ] `helm_release.ollama` values include `runtimeClassName = "nvidia"` - [ ] `tofu fmt` passes - [ ] `tofu validate` passes ### Test Expectations - [ ] `tofu fmt -check` exits 0 - [ ] `tofu validate` exits 0 - Post-apply verification (manual): `kubectl get nodes -o json | jq '.items[].status.capacity["nvidia.com/gpu"]'` shows "1" ### Constraints - Keep changes minimal — only add `runtimeClassName` to the two values blocks - Do not change chart versions, model lists, or any other values - Follow existing Terraform formatting patterns ### Checklist - [ ] PR opened - [ ] `tofu fmt` clean - [ ] `tofu validate` clean - [ ] No unrelated changes ### Related - `phase-postgres-6-vector-search` — parent phase - PR #25 — original deploy that this fixes - Forgejo issue #24 — original issue (closed)
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
forgejo_admin/pal-e-platform#26
No description provided.