Skip to main content

NVIDIA Device Plugin

Official Documentation:

Note: This document is provided for reference purposes.

1. Prerequisites

  • NVIDIA CUDA: >= 12.1
  • NVIDIA Drivers: >= 384.81
  • NVIDIA Container Toolkit: nvidia-docker >= 2.0 OR nvidia-container-toolkit >= 1.7.0 (>= 1.11.0 required for integrated GPUs on Tegra-based systems).
  • Low-level Runtime: nvidia-container-runtime must be configured as the default low-level runtime.
  • Kubernetes Version: >= 1.10

2. Prepare GPU Nodes

Note: Before performing the following steps, ensure that your GPU nodes have been joined to the Kubernetes cluster and are recognized by the cluster.

These operations must be performed on all GPU nodes. This section covers configuration only and does not include NVIDIA driver installation. The primary goal is to set nvidia as the default runtime.

Below is an example configuration for containerd on Debian-based systems:

2.1 Install NVIDIA Container Toolkit

  1. Install prerequisite packages:

    sudo apt-get update && sudo apt-get install -y --no-install-recommends \
    ca-certificates \
    curl \
    gnupg2
  2. Configure the software repository:

    curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
    && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
    sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
    sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
  3. Update and install toolkit:

    sudo apt-get update
    sudo apt-get install -y nvidia-container-toolkit alsa-utils

2.2 Configure Default Runtime

sudo nvidia-ctk runtime configure --runtime=containerd --config=/etc/containerd/config.toml

3. Install NVIDIA DEVICE PLUGIN

3.1 Install nvidia-device-plugin

  1. Add Chart repository:

    helm repo add nvdp https://nvidia.github.io/k8s-device-plugin --force-update
    helm repo update
  2. Install Chart:

    helm upgrade -i nvdp nvdp/nvidia-device-plugin \
    --namespace nvdp \
    --create-namespace \
    --version v0.17.0 \
    --set gfd.enabled=true \
    --set runtimeClassName=nvidia \
    --set image.repository=opencsg-registry.cn-beijing.cr.aliyuncs.com/opencsghq/nvidia/k8s-device-plugin \
    --set nfd.image.repository=opencsg-registry.cn-beijing.cr.aliyuncs.com/opencsghq/nfd/node-feature-discovery
  3. [Optional] Adjust Device Discovery Strategy:

    If devices are not being scanned correctly with the default auto strategy, you can manually patch the DaemonSet to use nvml or tegra.

    • NVML Strategy:

      kubectl -n nvdp patch ds nvdp-nvidia-device-plugin --type='json' --patch='[{"op": "add", "path": "/spec/template/spec/containers/0/args", "value": ["--device-discovery-strategy=nvml"]}]'
    • Tegra Strategy:

      kubectl -n nvdp patch ds nvdp-nvidia-device-plugin --type='json' --patch='[{"op": "add", "path": "/spec/template/spec/containers/0/args", "value": ["--device-discovery-strategy=tegra"]}]'

Option 2: Install via YAML

kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.17.0/deployments/static/nvidia-device-plugin.yml

3.2 Create RuntimeClass

cat <<EOF | kubectl apply -f -
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
name: nvidia
handler: nvidia
EOF

3.3 Set Default Runtime

Edit /etc/containerd/config.toml and modify the following field:

[plugins]
[plugins."io.containerd.grpc.v1.cri".containerd]
default_runtime_name = "nvidia" # For NVIDIA GPU nodes only

Restart the containerd service after editing.

4. Manually Add Labels (If required)

The placeholder <NODE> refers to all GPU nodes.

kubectl label node "<NODE>" nvidia.com/mps.capable=true nvidia.com/gpu=true

Add Labels to distinguish GPU models:

Note: Starting from version 1.3.2, the CSGHub Helm Chart will automatically add these labels.

Example: If the GPU model is NVIDIA-A10:

Bash

kubectl label node "<NODE>" nvidia.com/nvidia_name=NVIDIA-A10