Overview
This article explains how the NVIDIA container toolkit is configured in Base Command Manager (BCM) clusters and provides solutions for common issues in mixed CPU/GPU environments.
Default Behavior: When setting up Kubernetes with the NVIDIA GPU operator, BCM automatically:
- Configures the GPU operator not to deploy the NVIDIA container toolkit (handled by BCM instead)
- Installs
cm-nvidia-container-toolkitpackages on standard software images - Assumes
nvidia-container-toolkitis pre-installed on DGX OS software images - Configures containerd to use the NVIDIA container runtime by default on all nodes
Problem: The NVIDIA container runtime configuration on CPU-only nodes can cause pod failures with the error:
RunContainerError "failed to create containerd task"
Root Cause: Containers with the environment variable NVIDIA_VISIBLE_DEVICES=all (common in CUDA containers) will fail on CPU-only nodes when the NVIDIA runtime is configured but no GPUs are present.
Solution: Remove the NVIDIA container runtime configuration from CPU-only nodes to use the standard runc runtime instead.
Prerequisites
- BCM Version: 10.x through 10.25.03 or 11.x through 11.25.05
- Cluster Type: Mixed CPU and GPU nodes
- Kubernetes: Configured with NVIDIA GPU operator
Verify Your Environment
Check your node categories:
root@headnode:~# cmsh -c 'device list'
Verify GPU availability on nodes:
root@headnode:~# pdsh -w node00[1-6] nvidia-smi -L 2>/dev/null | sort
Example output for both commands:
root@headnode:~# cmsh -c 'device list'
Type Hostname (key) MAC Category IP Network Status
---------------- ---------------- ------------------ ---------------- ---------------- ---------------- --------------------------------
HeadNode headnode FA:16:3E:F1:86:46 10.141.255.254 internalnet [ UP ]
PhysicalNode node001 FA:16:3E:D4:67:D0 cpu 10.141.0.1 internalnet [ UP ], health check failed
PhysicalNode node002 FA:16:3E:0A:B9:56 cpu 10.141.0.2 internalnet [ UP ], health check failed
PhysicalNode node003 FA:16:3E:C7:C9:41 cpu 10.141.0.3 internalnet [ UP ], health check failed
PhysicalNode node004 FA:16:3E:42:A8:D0 gpu 10.141.0.4 internalnet [ UP ], health check failed
PhysicalNode node005 FA:16:3E:94:65:6F gpu 10.141.0.5 internalnet [ UP ], health check failed
PhysicalNode node006 FA:16:3E:CD:C0:29 gpu 10.141.0.6 internalnet [ UP ], health check failed
root@headnode:~# pdsh -w node00[1-6] nvidia-smi -L 2>/dev/null | sort
node001:
node001: NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
node002:
node002: NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
node003:
node003: NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
node004: GPU 0: NVIDIA A100-PCIE-40GB (UUID: GPU-ed8faffe-c04e-35de-f349-38ec0f412813)
node005: GPU 0: NVIDIA A100-PCIE-40GB (UUID: GPU-637a7ace-ef62-5b8c-c602-966edc8c00a8)
node006: GPU 0: NVIDIA A100-PCIE-40GB (UUID: GPU-00041523-f2c6-e25b-52bb-d727b9baee60)
Understanding the Default Configuration
GPU Operator Configuration
BCM configures the GPU operator not to deploy the toolkit (BCM manages it instead):
root@headnode:~# helm get values -n gpu-operator gpu-operator | grep toolkit -A 1
toolkit:
enabled: false
Containerd Runtime Configuration
The NVIDIA runtime configuration is located at:
/cm/local/apps/containerd/var/etc/conf.d/nvidia-cri.toml
This file configures containerd to use the NVIDIA container runtime as the default runtime for all containers.
Package Installation Control
To control NVIDIA package installation manually during setup, use:
cm-kubernetes-setup --adv-configuration-bcm-nvidia-packages
When using this flag, the setup wizard will prompt you to choose whether to install the NVIDIA Container Toolkit package. This gives manual control over the toolkit package installation regardless of whether the NVIDIA GPU operator has been selected for installation.
Note: While BCM provides the --adv-configuration-bcm-nvidia-packages flag to control NVIDIA package installation, the versions covered in this article do not offer a similar flag to control the drop-in containerd configuration. The NVIDIA runtime configuration is automatically applied to all nodes if EITHER:
- The GPU operator is enabled AND Kata containers are NOT being used
- Kata containers are an experimental feature
OR
- The answer to the question about installing the NVIDIA Container Toolkit package is answered with ‘yes’
- This is the configuration path
nvidia.install_toolkit_packagein thecm-kubernetes-setup.conffile.
- This is the configuration path
Identifying the Issue
Problem Symptoms
When pods without explicit GPU requests are scheduled on CPU-only nodes, they may fail with:
Error: failed to create containerd task: failed to create shim task: OCI runtime create failed:
runc create failed: unable to start container process: error during container init:
error running prestart hook #0: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
Reproducing the Issue
Create a test DaemonSet that simulates the problem:
kubectl apply -f https://support2.brightcomputing.com/kube/daemonset-gpu-with-cpu-fallback.yaml
Apply and check the pod status:
root@headnode:~# kubectl apply -f cuda-test-daemonset.yaml
root@headnode:~# kubectl get pod -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
cuda-test-daemonset-qwsfz 0/1 CrashLoopBackOff 3 (18s ago) 3m34s 172.29.152.137 node001 <none> <none>
cuda-test-daemonset-spl5d 0/1 CrashLoopBackOff 3 (17s ago) 3m35s 172.29.112.139 node002 <none> <none>
cuda-test-daemonset-rrtxr 0/1 CrashLoopBackOff 3 (30s ago) 3m33s 172.29.67.201 node003 <none> <none>
cuda-test-daemonset-2s89b 1/1 Running 1 (5s ago) 3m35s 172.29.107.143 node004 <none> <none>
cuda-test-daemonset-45xcp 1/1 Running 1 (8s ago) 3m34s 172.29.76.15 node005 <none> <none>
cuda-test-daemonset-j8m5g 1/1 Running 1 (7s ago) 3m34s 172.29.99.78 node006 <none> <none>
Note that pods on CPU nodes (node001-003) are failing while GPU nodes (node004-006) are running.
Solution: Remove NVIDIA Runtime from CPU Nodes
Step 1: Identify Configuration Overlays
Check which overlays are configured for your CPU and GPU nodes:
root@headnode:~# cmsh
[headnode]% configurationoverlay
[headnode->configurationoverlay]% list
Name (key) Priority All head nodes Nodes Categories Roles
-------------------- ---------- -------------- ----------------- ---------------- ---------------------------------------------------------
kube-default-etcd 500 no headnode Etcd::Host
kube-default-master 510 no cpu generic::containerd, Kubernetes::ApiServerProxy, kubelet
kube-default-worker 500 no gpu generic::containerd, Kubernetes::ApiServerProxy, kubelet
Step 2: Locate the NVIDIA Runtime Configuration
Navigate to the containerd configuration in the master overlay:
[headnode->configurationoverlay]% use kube-default-master
[headnode->configurationoverlay[kube-default-master]]% roles
[headnode->configurationoverlay[kube-default-master]->roles]% use generic::containerd
[headnode->configurationoverlay[kube-default-master]->roles[generic::containerd]]% configurations
[headnode->configurationoverlay[kube-default-master]->roles[generic::containerd]->configurations]% list
Type Name (key) Filename
------------ ---------------------- ---------------------------------------------------------------
static containerd-cdi /cm/local/apps/containerd/var/etc/conf.d/cdi.toml
static containerd-cri /cm/local/apps/containerd/var/etc/conf.d/cri.toml
static containerd-nvidia-cri /cm/local/apps/containerd/var/etc/conf.d/nvidia-cri.toml
templated containerd-hosts /cm/local/apps/containerd/var/etc/certs.d/docker.io/hosts.toml
Step 3: Remove NVIDIA Runtime Configuration
Remove the configuration from the CPU node overlay:
[headnode->configurationoverlay[kube-default-master]->roles[generic::containerd]->configurations]% remove containerd-nvidia-cri
[headnode->configurationoverlay*[kube-default-master*]->roles*[generic::containerd*]->configurations*]% commit
Step 4: Clean Up Existing Files on CPU Nodes
The configuration removal doesn’t delete existing files. Clean them up manually:
# Verify files exist
root@headnode:~# pdsh -w node00[1-3] file /cm/local/apps/containerd/var/etc/conf.d/nvidia-cri.toml
node001: /cm/local/apps/containerd/var/etc/conf.d/nvidia-cri.toml: ASCII text
node002: /cm/local/apps/containerd/var/etc/conf.d/nvidia-cri.toml: ASCII text
node003: /cm/local/apps/containerd/var/etc/conf.d/nvidia-cri.toml: ASCII text
# Remove the files
root@headnode:~# pdsh -w node00[1-3] rm -fv /cm/local/apps/containerd/var/etc/conf.d/nvidia-cri.toml
node001: removed '/cm/local/apps/containerd/var/etc/conf.d/nvidia-cri.toml'
node002: removed '/cm/local/apps/containerd/var/etc/conf.d/nvidia-cri.toml'
node003: removed '/cm/local/apps/containerd/var/etc/conf.d/nvidia-cri.toml'
# Restart containerd
root@headnode:~# pdsh -w node00[1-3] systemctl restart containerd
Step 5: Verify the Fix
Restart affected workloads to apply the changes:
root@headnode:~# kubectl rollout restart daemonset/cuda-test-daemonset
daemonset.apps/cuda-test-daemonset restarted
root@headnode:~# kubectl get pod -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
cuda-test-daemonset-6pbdl 1/1 Running 0 20s 172.29.76.16 node005 <none> <none>
cuda-test-daemonset-fklcs 1/1 Running 0 20s 172.29.107.144 node004 <none> <none>
cuda-test-daemonset-khh5h 1/1 Running 0 19s 172.29.112.140 node002 <none> <none>
cuda-test-daemonset-wkmqx 1/1 Running 0 20s 172.29.152.138 node001 <none> <none>
cuda-test-daemonset-zhpwd 1/1 Running 0 19s 172.29.67.202 node003 <none> <none>
cuda-test-daemonset-zr2gj 1/1 Running 0 20s 172.29.99.79 node006 <none> <none>
All pods should now be running successfully on both CPU and GPU nodes.
Alternative Solutions
Creating Separate Worker Overlays
If you don’t have a clear CPU/GPU distinction in your overlays, you can:
- Clone the worker overlay:
[headnode->configurationoverlay]% clone kube-default-worker kube-default-worker-gpu - Move GPU nodes to the new overlay:
[headnode->configurationoverlay]% use kube-default-worker [headnode->configurationoverlay[kube-default-worker]]% ... remove gpu nodes or category ... [headnode->configurationoverlay[kube-default-worker]]% commit [headnode->configurationoverlay]% use kube-default-worker-gpu [headnode->configurationoverlay[kube-default-worker-gpu]]% ... add them back here ... [headnode->configurationoverlay[kube-default-worker-gpu]]% commit
- Remove NVIDIA runtime from the original worker overlay following steps 3-5 above
Restoring NVIDIA Runtime Configuration
To restore the NVIDIA runtime configuration if needed:
[headnode]% configurationoverlay
[headnode->configurationoverlay]% use kube-default-worker
[headnode->configurationoverlay[kube-default-worker]]% roles
[headnode->configurationoverlay[kube-default-worker]->roles]% use generic::containerd
[headnode->configurationoverlay[kube-default-worker]->roles[generic::containerd]]% configurations
[headnode->configurationoverlay[kube-default-worker]->roles[generic::containerd]->configurations]% add static containerd-nvidia-cri
[headnode->configurationoverlay*[kube-default-worker*]->roles*[generic::containerd*]->configurations*[containerd-nvidia-cri*]]% set content /cm/local/apps/containerd/var/etc/conf.d/nvidia-cri.toml
[headnode->configurationoverlay*[kube-default-worker*]->roles*[generic::containerd*]->configurations*[containerd-nvidia-cri*]]% commit
Summary
The NVIDIA container runtime configuration in BCM works well for GPU nodes but can cause issues on CPU-only nodes when containers specify NVIDIA_VISIBLE_DEVICES=all. By removing the runtime configuration from CPU node overlays, pods can run successfully using the standard runc runtime while GPU nodes continue to use the NVIDIA runtime for GPU workloads.