A security issue has been found in nvidia-container-toolkit
. Since the NVIDIA GPU Operator
has the ability to take care of installing this toolkit on the Kubernetes hosts that require this provisioning for proper integration of GPUs with Kubernetes, this security issue might affect the GPU operator as well. This feature is controlled via the toolkit.enabled
setting in the Helm chart.
Please note that most BCM clusters we manage this toolkit through a package instead, it is therefore strongly advised to execute the steps from the following KB article first, since it focuses on upgrading the vulnerable package: https://kb.brightcomputing.com/knowledge-base/required-security-upgrade-for-nvidia-container-toolkit/. After that, please come back and follow this KB article to double check if the NVIDIA GPU Operator also requires an update for your cluster.
Affected versions
In the case of BCM
we relied on the NVIDIA GPU Operator feature to install the toolkit between the following BCM versions. We introduced support for the NVIDIA GPU Operator in version 9.2-4
(11th of August 2022) and in version 9.2-14
(25th of September 2023) we stopped using this functionality and started to rely on system packages instead. BCM
10.0
(since the first version 10.23.07
) has always deployed using system packages.
Check if your GPU Operator is affected
- We can check the installed NVIDIA GPU Operator version first. First determine under which namespace + name it is installed, defaults are “gpu-operator” for both.
root@bcm-cluster:~# helm list -n gpu-operator NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION gpu-operator gpu-operator 1 2024-09-30 15:31:30.392818917 +0200 CEST deployed gpu-operator-v23.9.1 v23.9.1
- We can see more info on the Helm chart with the history command:
root@bcm-cluster:~# helm history -n gpu-operator gpu-operator REVISION UPDATED STATUS CHART APP VERSION DESCRIPTION 1 Mon Sep 30 15:31:30 2024 deployed gpu-operator-v23.9.1 v23.9.1 Install complete
- Now we can download the “values” for the Helm chart (which is the configuration that was passed to the Helm chart the last installation or upgrade).
root@bcm-cluster:~# helm get values -n gpu-operator gpu-operator | sed '/USER-SUPPLIED VALUES:/d' > gpu-operator-values.yaml root@bcm-cluster:~# cat gpu-operator-values.yaml driver: enabled: false operator: defaultRuntime: containerd toolkit: enabled: true env: - name: CONTAINERD_CONFIG value: /cm/local/apps/containerd/var/etc/conf.d/nvidia-cri.toml
The above output for the values yaml configuration shows that in this case, the NVIDIA GPU Operator is in charge of deploying the toolkit. The configuration setting is set to “enabled = true”. This means, that updating this Helm chart is important!
Upgrading the Helm chart
- We can lookup the version we wish to install next.
root@bcm-cluster:~# module load kubernetes root@bcm-cluster:~# helm repo list NAME URL kyverno https://kyverno.github.io/kyverno/ prometheus-community https://prometheus-community.github.io/helm-charts nvidia https://helm.ngc.nvidia.com/nvidia
Next we search the provided packages.
root@bcm-cluster:~# helm repo update Hang tight while we grab the latest from your chart repositories... ...Successfully got an update from the "kyverno" chart repository ...Successfully got an update from the "prometheus-community" chart repository ...Successfully got an update from the "nvidia" chart repository Update Complete. ⎈Happy Helming!⎈ root@bcm-cluster:~# helm search repo nvidia -l | grep gpu-operator nvidia/gpu-operator v24.6.2 v24.6.2 NVIDIA GPU Operator creates/configures/manages ... nvidia/gpu-operator v24.6.1 v24.6.1 NVIDIA GPU Operator creates/configures/manages ... nvidia/gpu-operator v24.6.0 v24.6.0 NVIDIA GPU Operator creates/configures/manages ... nvidia/gpu-operator v24.3.0 v24.3.0 NVIDIA GPU Operator creates/configures/manages ... nvidia/gpu-operator v23.9.2 v23.9.2 NVIDIA GPU Operator creates/configures/manages ... nvidia/gpu-operator v23.9.1 v23.9.1 NVIDIA GPU Operator creates/configures/manages ... nvidia/gpu-operator v23.9.0 v23.9.0 NVIDIA GPU Operator creates/configures/manages ... nvidia/gpu-operator v23.6.2 v23.6.2 NVIDIA GPU Operator creates/configures/manages ... nvidia/gpu-operator v23.6.1 v23.6.1 NVIDIA GPU Operator creates/configures/manages ... nvidia/gpu-operator v23.6.0 v23.6.0 NVIDIA GPU Operator creates/configures/manages ... nvidia/gpu-operator v23.3.2 v23.3.2 NVIDIA GPU Operator creates/configures/manages ... nvidia/gpu-operator v23.3.1 v23.3.1 NVIDIA GPU Operator creates/configures/manages ... nvidia/gpu-operator v23.3.0 v23.3.0 NVIDIA GPU Operator creates/configures/manages ... nvidia/gpu-operator v22.9.2 v22.9.2 NVIDIA GPU Operator creates/configures/manages ... nvidia/gpu-operator v22.9.1 v22.9.1 NVIDIA GPU Operator creates/configures/manages ... nvidia/gpu-operator v22.9.0 v22.9.0 NVIDIA GPU Operator creates/configures/manages ... nvidia/gpu-operator v1.11.1 v1.11.1 NVIDIA GPU Operator creates/configures/manages ... nvidia/gpu-operator v1.11.0 v1.11.0 NVIDIA GPU Operator creates/configures/manages ... nvidia/gpu-operator v1.10.1 v1.10.1 NVIDIA GPU Operator creates/configures/manages ... nvidia/gpu-operator v1.10.0 v1.10.0 NVIDIA GPU Operator creates/configures/manages ... nvidia/gpu-operator v1.9.1 v1.9.1 NVIDIA GPU Operator creates/configures/manages ... nvidia/gpu-operator v1.9.0 v1.9.0 NVIDIA GPU Operator creates/configures/manages ... nvidia/gpu-operator v1.8.2 v1.8.2 NVIDIA GPU Operator creates/configures/manages ... nvidia/gpu-operator v1.8.1 v1.8.1 NVIDIA GPU Operator creates/configures/manages ... nvidia/gpu-operator v1.8.0 v1.8.0 NVIDIA GPU Operator creates/configures/manages ... nvidia/gpu-operator v1.7.1 v1.7.1 NVIDIA GPU Operator creates/configures/manages ... nvidia/gpu-operator v1.7.0 v1.7.0 NVIDIA GPU Operator creates/configures/manages ... nvidia/gpu-operator 1.6.2 1.6.2 NVIDIA GPU Operator creates/configures/manages ... nvidia/gpu-operator 1.6.1 1.6.1 NVIDIA GPU Operator creates/configures/manages ... nvidia/gpu-operator 1.6.0 1.6.0 NVIDIA GPU Operator creates/configures/manages ... nvidia/gpu-operator 1.5.2 1.5.2 NVIDIA GPU Operator creates/configures/manages ... nvidia/gpu-operator 1.5.1 1.5.1 NVIDIA GPU Operator creates/configures/manages ... nvidia/gpu-operator 1.5.0 1.5.0 NVIDIA GPU Operator creates/configures/manages ... nvidia/gpu-operator 1.4.0 1.4.0 NVIDIA GPU Operator creates/configures/manages ... nvidia/gpu-operator 1.3.0 1.3.0 NVIDIA GPU Operator creates/configures/manages ... nvidia/gpu-operator 1.2.0 1.2.0 NVIDIA GPU Operator creates/configures/manages ...
- You may update the values file we exported earlier if you wish to change the configuration. An older version might miss certain configuration settings, for clarity we will share the values file that we provide with the latest BCM 10.0 version (where we made the default
v24.6.2
since10.24.09
):cdi: default: false enabled: false dcgm: enabled: false dcgmExporter: enabled: true serviceMonitor: enabled: false devicePlugin: enabled: true env: - name: DEVICE_LIST_STRATEGY value: volume-mounts driver: enabled: false rdma: enabled: false kataManager: enabled: false env: - name: CONTAINERD_CONFIG value: /cm/local/apps/containerd/var/etc/conf.d/yy-kata-containers.toml mig: strategy: single migManager: enabled: true nfd: enabled: true sandboxWorkloads: enabled: false toolkit: enabled: false env: - name: CONTAINERD_CONFIG value: /cm/local/apps/containerd/var/etc/conf.d/nvidia-cri.toml - name: ACCEPT_NVIDIA_VISIBLE_DEVICES_ENVVAR_WHEN_UNPRIVILEGED value: "false" - name: ACCEPT_NVIDIA_VISIBLE_DEVICES_AS_VOLUME_MOUNTS value: "true" validator: driver: env: - name: DISABLE_DEV_CHAR_SYMLINK_CREATION value: "true"
This values file can be used as a replacement for the one we exported, or we can customize it. Please note that if the
toolkit.enabled
value wastrue
in your case, that thecm-nvidia-container-toolkit
(ornvidia-container-toolkit
) package has to be installed on the relevant hosts, or again change the value totrue
. - Next, we can do the upgrade. Please use the correct version in the
--version
parameter if it differs.helm upgrade --set operator.upgradeCRD=true --disable-openapi-validation -i --wait -f gpu-operator-values.yaml --version=v24.6.2 --timeout 10m0s --namespace gpu-operator gpu-operator nvidia/gpu-operator
More details on the
--set operator.upgradeCRD=true
and--disable-openapi-validation
can be found here on this website: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/upgrade.html.
Verify the upgrade
We can check Helm to see the status.
root@bcm-cluster:~# helm history -n gpu-operator gpu-operator REVISION UPDATED STATUS CHART APP VERSION DESCRIPTION 1 Mon Sep 30 16:29:19 2024 superseded gpu-operator-v23.3.2 v23.3.2 Install complete 2 Mon Sep 30 16:32:08 2024 deployed gpu-operator-v24.6.2 v24.6.2 Upgrade complete
On the nodes managed by the GPU operator we can (in the case toolkit.enabled
is set to true
) check the binaries deployed by the GPU operator to verify the version of the nvidia-container-toolkit
is greater than (or equal to) 1.16.2
.
[root@bcm-cluster ~]# ssh node004 ... [root@node004 ~]# /usr/local/nvidia/toolkit/nvidia-container-toolkit --version NVIDIA Container Runtime Hook version 1.16.2 commit: a5a5833c14a15fd9c86bcece85d5ec6621b65652