NVIDIA Device Plugin

Official Documentation:

Install NVIDIA Device Plugin

Note: This document is provided for reference purposes.

1. Prerequisites

NVIDIA CUDA: >= 12.1
NVIDIA Drivers: >= 384.81
NVIDIA Container Toolkit: nvidia-docker >= 2.0 OR nvidia-container-toolkit >= 1.7.0 (>= 1.11.0 required for integrated GPUs on Tegra-based systems).
Low-level Runtime: nvidia-container-runtime must be configured as the default low-level runtime.
Kubernetes Version: >= 1.10

2. Prepare GPU Nodes

Note: Before performing the following steps, ensure that your GPU nodes have been joined to the Kubernetes cluster and are recognized by the cluster.

These operations must be performed on all GPU nodes. This section covers configuration only and does not include NVIDIA driver installation. The primary goal is to set nvidia as the default runtime.

Below is an example configuration for containerd on Debian-based systems:

2.1 Install NVIDIA Container Toolkit

2.1.1 China Mirror (Recommended)

For users in China, you can use the USTC mirror to accelerate downloads:

# Install prerequisite packages
sudo apt-get update && sudo apt-get install -y --no-install-recommends \
   ca-certificates \
   curl \
   gnupg2

# Configure mirror and install
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
  && curl -s -L https://mirrors.ustc.edu.cn/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
    sed 's#deb https://nvidia.github.io#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://mirrors.ustc.edu.cn#g' | \
    sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit alsa-utils

📖 Reference: USTC NVIDIA Container Toolkit Mirror Guide

2.1.2 Official Source

Install prerequisite packages:

sudo apt-get update && sudo apt-get install -y --no-install-recommends \
   ca-certificates \
   curl \
   gnupg2

Configure the software repository:

curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
  && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
    sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
    sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

Update and install toolkit:

sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit alsa-utils

2.2 Configure Default Runtime

sudo nvidia-ctk runtime configure --runtime=containerd --config=/etc/containerd/config.toml

3. Install NVIDIA DEVICE PLUGIN

3.1 Install nvidia-device-plugin

Option 1: Install via Helm (Recommended)

Add Chart repository:

helm repo add nvdp https://nvidia.github.io/k8s-device-plugin --force-update
helm repo update

Install Chart:

helm upgrade -i nvdp nvdp/nvidia-device-plugin \
  --namespace nvdp \
  --create-namespace \
  --version v0.17.0 \
  --set gfd.enabled=true \
  --set runtimeClassName=nvidia \
  --set image.repository=opencsg-registry.cn-beijing.cr.aliyuncs.com/opencsghq/nvidia/k8s-device-plugin \
  --set nfd.image.repository=opencsg-registry.cn-beijing.cr.aliyuncs.com/opencsghq/nfd/node-feature-discovery

[Optional] Adjust Device Discovery Strategy:

If devices are not being scanned correctly with the default auto strategy, you can manually patch the DaemonSet to use nvml or tegra.

NVML Strategy:

kubectl -n nvdp patch ds nvdp-nvidia-device-plugin --type='json' --patch='[{"op": "add", "path": "/spec/template/spec/containers/0/args", "value": ["--device-discovery-strategy=nvml"]}]'

Tegra Strategy:

kubectl -n nvdp patch ds nvdp-nvidia-device-plugin --type='json' --patch='[{"op": "add", "path": "/spec/template/spec/containers/0/args", "value": ["--device-discovery-strategy=tegra"]}]'

Option 2: Install via YAML

kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.17.0/deployments/static/nvidia-device-plugin.yml

3.2 Create RuntimeClass

cat <<EOF | kubectl apply -f -
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
  name: nvidia 
handler: nvidia
EOF

3.3 Set Default Runtime

Edit /etc/containerd/config.toml and modify the following field:

[plugins]
  [plugins."io.containerd.grpc.v1.cri".containerd]
    default_runtime_name = "nvidia"  # For NVIDIA GPU nodes only

Restart the containerd service after editing.

4. Manually Add Labels (If required)

The placeholder <NODE> refers to all GPU nodes.

kubectl label node "<NODE>" nvidia.com/mps.capable=true nvidia.com/gpu=true

Add Labels to distinguish GPU models:

Note: Starting from version 1.3.2, the CSGHub Helm Chart will automatically add these labels.

Example: If the GPU model is NVIDIA-A10:

Bash

kubectl label node "<NODE>" nvidia.com/nvidia_name=NVIDIA-A10

1. Prerequisites​

2. Prepare GPU Nodes​

2.1 Install NVIDIA Container Toolkit​

2.1.1 China Mirror (Recommended)​

2.1.2 Official Source​

2.2 Configure Default Runtime​

3. Install NVIDIA DEVICE PLUGIN​

3.1 Install nvidia-device-plugin​

Option 1: Install via Helm (Recommended)​

Option 2: Install via YAML​

3.2 Create RuntimeClass​

3.3 Set Default Runtime​

4. Manually Add Labels (If required)​