1. Home
  2. BCM 10/11 – NVIDIA Container Runtime Configuration for Mixed CPU/GPU Clusters

BCM 10/11 – NVIDIA Container Runtime Configuration for Mixed CPU/GPU Clusters

Overview

This article explains how the NVIDIA container toolkit is configured in Base Command Manager (BCM) clusters and provides solutions for common issues in mixed CPU/GPU environments.

Default Behavior: When setting up Kubernetes with the NVIDIA GPU operator, BCM automatically:

  • Configures the GPU operator not to deploy the NVIDIA container toolkit (handled by BCM instead)
  • Installs cm-nvidia-container-toolkit packages on standard software images
  • Assumes nvidia-container-toolkit is pre-installed on DGX OS software images
  • Configures containerd to use the NVIDIA container runtime by default on all nodes

Problem: The NVIDIA container runtime configuration on CPU-only nodes can cause pod failures with the error:

RunContainerError "failed to create containerd task"

Root Cause: Containers with the environment variable NVIDIA_VISIBLE_DEVICES=all (common in CUDA containers) will fail on CPU-only nodes when the NVIDIA runtime is configured but no GPUs are present.

Solution: Remove the NVIDIA container runtime configuration from CPU-only nodes to use the standard runc runtime instead.

Prerequisites

  • BCM Version: 10.x through 10.25.03 or 11.x through 11.25.05
  • Cluster Type: Mixed CPU and GPU nodes
  • Kubernetes: Configured with NVIDIA GPU operator

Verify Your Environment

Check your node categories:

root@headnode:~# cmsh -c 'device list'

Verify GPU availability on nodes:

root@headnode:~# pdsh -w node00[1-6] nvidia-smi -L 2>/dev/null | sort

Example output for both commands:

root@headnode:~# cmsh -c 'device list'
Type             Hostname (key)   MAC                Category         IP               Network          Status
---------------- ---------------- ------------------ ---------------- ---------------- ---------------- --------------------------------
HeadNode         headnode         FA:16:3E:F1:86:46                   10.141.255.254   internalnet      [   UP   ]
PhysicalNode     node001          FA:16:3E:D4:67:D0  cpu              10.141.0.1       internalnet      [   UP   ], health check failed
PhysicalNode     node002          FA:16:3E:0A:B9:56  cpu              10.141.0.2       internalnet      [   UP   ], health check failed
PhysicalNode     node003          FA:16:3E:C7:C9:41  cpu              10.141.0.3       internalnet      [   UP   ], health check failed
PhysicalNode     node004          FA:16:3E:42:A8:D0  gpu              10.141.0.4       internalnet      [   UP   ], health check failed
PhysicalNode     node005          FA:16:3E:94:65:6F  gpu              10.141.0.5       internalnet      [   UP   ], health check failed
PhysicalNode     node006          FA:16:3E:CD:C0:29  gpu              10.141.0.6       internalnet      [   UP   ], health check failed

root@headnode:~# pdsh -w node00[1-6] nvidia-smi -L 2>/dev/null | sort
node001:
node001: NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
node002:
node002: NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
node003:
node003: NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
node004: GPU 0: NVIDIA A100-PCIE-40GB (UUID: GPU-ed8faffe-c04e-35de-f349-38ec0f412813)
node005: GPU 0: NVIDIA A100-PCIE-40GB (UUID: GPU-637a7ace-ef62-5b8c-c602-966edc8c00a8)
node006: GPU 0: NVIDIA A100-PCIE-40GB (UUID: GPU-00041523-f2c6-e25b-52bb-d727b9baee60)

Understanding the Default Configuration

GPU Operator Configuration

BCM configures the GPU operator not to deploy the toolkit (BCM manages it instead):

root@headnode:~# helm get values -n gpu-operator gpu-operator | grep toolkit -A 1
toolkit:
  enabled: false

Containerd Runtime Configuration

The NVIDIA runtime configuration is located at:

/cm/local/apps/containerd/var/etc/conf.d/nvidia-cri.toml

This file configures containerd to use the NVIDIA container runtime as the default runtime for all containers.

Package Installation Control

To control NVIDIA package installation manually during setup, use:

cm-kubernetes-setup --adv-configuration-bcm-nvidia-packages

When using this flag, the setup wizard will prompt you to choose whether to install the NVIDIA Container Toolkit package. This gives manual control over the toolkit package installation regardless of whether the NVIDIA GPU operator has been selected for installation.

Note: While BCM provides the --adv-configuration-bcm-nvidia-packages flag to control NVIDIA package installation, the versions covered in this article do not offer a similar flag to control the drop-in containerd configuration. The NVIDIA runtime configuration is automatically applied to all nodes if EITHER:

  1. The GPU operator is enabled AND Kata containers are NOT being used
    • Kata containers are an experimental feature

OR

  1. The answer to the question about installing the NVIDIA Container Toolkit package is answered with ‘yes’
    • This is the configuration path nvidia.install_toolkit_package in the cm-kubernetes-setup.conf file.

Identifying the Issue

Problem Symptoms

When pods without explicit GPU requests are scheduled on CPU-only nodes, they may fail with:

Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: 
runc create failed: unable to start container process: error during container init: 
error running prestart hook #0: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'

Reproducing the Issue

Create a test DaemonSet that simulates the problem:

kubectl apply -f https://support2.brightcomputing.com/kube/daemonset-gpu-with-cpu-fallback.yaml

Apply and check the pod status:

root@headnode:~# kubectl apply -f cuda-test-daemonset.yaml
root@headnode:~# kubectl get pod -o wide
NAME                        READY   STATUS             RESTARTS      AGE     IP               NODE      NOMINATED NODE   READINESS GATES
cuda-test-daemonset-qwsfz   0/1     CrashLoopBackOff   3 (18s ago)   3m34s   172.29.152.137   node001   <none>           <none>
cuda-test-daemonset-spl5d   0/1     CrashLoopBackOff   3 (17s ago)   3m35s   172.29.112.139   node002   <none>           <none>
cuda-test-daemonset-rrtxr   0/1     CrashLoopBackOff   3 (30s ago)   3m33s   172.29.67.201    node003   <none>           <none>
cuda-test-daemonset-2s89b   1/1     Running            1 (5s ago)    3m35s   172.29.107.143   node004   <none>           <none>
cuda-test-daemonset-45xcp   1/1     Running            1 (8s ago)    3m34s   172.29.76.15     node005   <none>           <none>
cuda-test-daemonset-j8m5g   1/1     Running            1 (7s ago)    3m34s   172.29.99.78     node006   <none>           <none>

Note that pods on CPU nodes (node001-003) are failing while GPU nodes (node004-006) are running.

Solution: Remove NVIDIA Runtime from CPU Nodes

Step 1: Identify Configuration Overlays

Check which overlays are configured for your CPU and GPU nodes:

root@headnode:~# cmsh
[headnode]% configurationoverlay
[headnode->configurationoverlay]% list
Name (key)           Priority   All head nodes Nodes             Categories       Roles
-------------------- ---------- -------------- ----------------- ---------------- ---------------------------------------------------------
kube-default-etcd    500        no             headnode                           Etcd::Host
kube-default-master  510        no                               cpu              generic::containerd, Kubernetes::ApiServerProxy, kubelet
kube-default-worker  500        no                               gpu              generic::containerd, Kubernetes::ApiServerProxy, kubelet

Step 2: Locate the NVIDIA Runtime Configuration

Navigate to the containerd configuration in the master overlay:

[headnode->configurationoverlay]% use kube-default-master
[headnode->configurationoverlay[kube-default-master]]% roles
[headnode->configurationoverlay[kube-default-master]->roles]% use generic::containerd
[headnode->configurationoverlay[kube-default-master]->roles[generic::containerd]]% configurations
[headnode->configurationoverlay[kube-default-master]->roles[generic::containerd]->configurations]% list
Type         Name (key)             Filename
------------ ---------------------- ---------------------------------------------------------------
static       containerd-cdi         /cm/local/apps/containerd/var/etc/conf.d/cdi.toml
static       containerd-cri         /cm/local/apps/containerd/var/etc/conf.d/cri.toml
static       containerd-nvidia-cri  /cm/local/apps/containerd/var/etc/conf.d/nvidia-cri.toml
templated    containerd-hosts       /cm/local/apps/containerd/var/etc/certs.d/docker.io/hosts.toml

Step 3: Remove NVIDIA Runtime Configuration

Remove the configuration from the CPU node overlay:

[headnode->configurationoverlay[kube-default-master]->roles[generic::containerd]->configurations]% remove containerd-nvidia-cri
[headnode->configurationoverlay*[kube-default-master*]->roles*[generic::containerd*]->configurations*]% commit

Step 4: Clean Up Existing Files on CPU Nodes

The configuration removal doesn’t delete existing files. Clean them up manually:

# Verify files exist
root@headnode:~# pdsh -w node00[1-3] file /cm/local/apps/containerd/var/etc/conf.d/nvidia-cri.toml
node001: /cm/local/apps/containerd/var/etc/conf.d/nvidia-cri.toml: ASCII text
node002: /cm/local/apps/containerd/var/etc/conf.d/nvidia-cri.toml: ASCII text
node003: /cm/local/apps/containerd/var/etc/conf.d/nvidia-cri.toml: ASCII text

# Remove the files
root@headnode:~# pdsh -w node00[1-3] rm -fv /cm/local/apps/containerd/var/etc/conf.d/nvidia-cri.toml
node001: removed '/cm/local/apps/containerd/var/etc/conf.d/nvidia-cri.toml'
node002: removed '/cm/local/apps/containerd/var/etc/conf.d/nvidia-cri.toml'
node003: removed '/cm/local/apps/containerd/var/etc/conf.d/nvidia-cri.toml'

# Restart containerd
root@headnode:~# pdsh -w node00[1-3] systemctl restart containerd

Step 5: Verify the Fix

Restart affected workloads to apply the changes:

root@headnode:~# kubectl rollout restart daemonset/cuda-test-daemonset
daemonset.apps/cuda-test-daemonset restarted

root@headnode:~# kubectl get pod -o wide
NAME                        READY   STATUS    RESTARTS   AGE   IP               NODE      NOMINATED NODE   READINESS GATES
cuda-test-daemonset-6pbdl   1/1     Running   0          20s   172.29.76.16     node005   <none>           <none>
cuda-test-daemonset-fklcs   1/1     Running   0          20s   172.29.107.144   node004   <none>           <none>
cuda-test-daemonset-khh5h   1/1     Running   0          19s   172.29.112.140   node002   <none>           <none>
cuda-test-daemonset-wkmqx   1/1     Running   0          20s   172.29.152.138   node001   <none>           <none>
cuda-test-daemonset-zhpwd   1/1     Running   0          19s   172.29.67.202    node003   <none>           <none>
cuda-test-daemonset-zr2gj   1/1     Running   0          20s   172.29.99.79     node006   <none>           <none>

All pods should now be running successfully on both CPU and GPU nodes.

Alternative Solutions

Creating Separate Worker Overlays

If you don’t have a clear CPU/GPU distinction in your overlays, you can:

  1. Clone the worker overlay:
    [headnode->configurationoverlay]% clone kube-default-worker kube-default-worker-gpu
  2. Move GPU nodes to the new overlay:
    [headnode->configurationoverlay]% use kube-default-worker
    [headnode->configurationoverlay[kube-default-worker]]% ... remove gpu nodes or category ...
    [headnode->configurationoverlay[kube-default-worker]]% commit
    [headnode->configurationoverlay]% use kube-default-worker-gpu
    [headnode->configurationoverlay[kube-default-worker-gpu]]% ... add them back here ...
    [headnode->configurationoverlay[kube-default-worker-gpu]]% commit
  3. Remove NVIDIA runtime from the original worker overlay following steps 3-5 above

Restoring NVIDIA Runtime Configuration

To restore the NVIDIA runtime configuration if needed:

[headnode]% configurationoverlay
[headnode->configurationoverlay]% use kube-default-worker
[headnode->configurationoverlay[kube-default-worker]]% roles
[headnode->configurationoverlay[kube-default-worker]->roles]% use generic::containerd
[headnode->configurationoverlay[kube-default-worker]->roles[generic::containerd]]% configurations
[headnode->configurationoverlay[kube-default-worker]->roles[generic::containerd]->configurations]% add static containerd-nvidia-cri
[headnode->configurationoverlay*[kube-default-worker*]->roles*[generic::containerd*]->configurations*[containerd-nvidia-cri*]]% set content /cm/local/apps/containerd/var/etc/conf.d/nvidia-cri.toml
[headnode->configurationoverlay*[kube-default-worker*]->roles*[generic::containerd*]->configurations*[containerd-nvidia-cri*]]% commit

Summary

The NVIDIA container runtime configuration in BCM works well for GPU nodes but can cause issues on CPU-only nodes when containers specify NVIDIA_VISIBLE_DEVICES=all. By removing the runtime configuration from CPU node overlays, pods can run successfully using the standard runc runtime while GPU nodes continue to use the NVIDIA runtime for GPU workloads.

Updated on July 17, 2025