1. Home
  2. Required security upgrade for NVIDIA GPU Operator

Required security upgrade for NVIDIA GPU Operator

A security issue has been found in nvidia-container-toolkit. Since the NVIDIA GPU Operator has the ability to take care of installing this toolkit on the Kubernetes hosts that require this provisioning for proper integration of GPUs with Kubernetes, this security issue might affect the GPU operator as well. This feature is controlled via the toolkit.enabled setting in the Helm chart.

Please note that most BCM clusters we manage this toolkit through a package instead, it is therefore strongly advised to execute the steps from the following KB article first, since it focuses on upgrading the vulnerable package: https://kb.brightcomputing.com/knowledge-base/required-security-upgrade-for-nvidia-container-toolkit/. After that, please come back and follow this KB article to double check if the NVIDIA GPU Operator also requires an update for your cluster.

Affected versions

In the case of BCM we relied on the NVIDIA GPU Operator feature to install the toolkit between the following BCM versions. We introduced support for the NVIDIA GPU Operator in version  9.2-4 (11th of August 2022) and in version 9.2-14 (25th of September 2023) we stopped using this functionality and started to rely on system packages instead. BCM 10.0 (since the first version 10.23.07) has always deployed using system packages.

Check if your GPU Operator is affected
  1. We can check the installed NVIDIA GPU Operator version first. First determine under which namespace + name it is installed, defaults are “gpu-operator” for both.
    root@bcm-cluster:~# helm list -n gpu-operator
    NAME         NAMESPACE    REVISION UPDATED                                  STATUS   CHART                APP VERSION
    gpu-operator gpu-operator 1        2024-09-30 15:31:30.392818917 +0200 CEST deployed gpu-operator-v23.9.1 v23.9.1   
  2.  We can see more info on the Helm chart with the history command:
    root@bcm-cluster:~# helm history -n gpu-operator gpu-operator
    REVISION UPDATED                  STATUS   CHART                APP VERSION DESCRIPTION     
    1        Mon Sep 30 15:31:30 2024 deployed gpu-operator-v23.9.1 v23.9.1     Install complete
  3. Now we can download the “values” for the Helm chart (which is the configuration that was passed to the Helm chart the last installation or upgrade).
    root@bcm-cluster:~# helm get values -n gpu-operator gpu-operator | sed '/USER-SUPPLIED VALUES:/d' > gpu-operator-values.yaml
    root@bcm-cluster:~# cat gpu-operator-values.yaml 
    driver:
      enabled: false
    operator:
      defaultRuntime: containerd
    toolkit:
      enabled: true
      env:
      - name: CONTAINERD_CONFIG
        value: /cm/local/apps/containerd/var/etc/conf.d/nvidia-cri.toml

    The above output for the values yaml configuration shows that in this case, the NVIDIA GPU Operator is in charge of deploying the toolkit. The configuration setting is set to “enabled = true”. This means, that updating this Helm chart is important!

Upgrading the Helm chart
  1. We can lookup the version we wish to install next.
    root@bcm-cluster:~# module load kubernetes
    root@bcm-cluster:~# helm repo list
    NAME                 URL                                               
    kyverno              https://kyverno.github.io/kyverno/                
    prometheus-community https://prometheus-community.github.io/helm-charts
    nvidia               https://helm.ngc.nvidia.com/nvidia               

    Next we search the provided packages.

    root@bcm-cluster:~# helm repo update
    Hang tight while we grab the latest from your chart repositories...
    ...Successfully got an update from the "kyverno" chart repository
    ...Successfully got an update from the "prometheus-community" chart repository
    ...Successfully got an update from the "nvidia" chart repository
    Update Complete. ⎈Happy Helming!⎈
    
    root@bcm-cluster:~# helm search repo nvidia -l | grep gpu-operator
    nvidia/gpu-operator              v24.6.2       v24.6.2     NVIDIA GPU Operator creates/configures/manages ...
    nvidia/gpu-operator              v24.6.1       v24.6.1     NVIDIA GPU Operator creates/configures/manages ...
    nvidia/gpu-operator              v24.6.0       v24.6.0     NVIDIA GPU Operator creates/configures/manages ...
    nvidia/gpu-operator              v24.3.0       v24.3.0     NVIDIA GPU Operator creates/configures/manages ...
    nvidia/gpu-operator              v23.9.2       v23.9.2     NVIDIA GPU Operator creates/configures/manages ...
    nvidia/gpu-operator              v23.9.1       v23.9.1     NVIDIA GPU Operator creates/configures/manages ...
    nvidia/gpu-operator              v23.9.0       v23.9.0     NVIDIA GPU Operator creates/configures/manages ...
    nvidia/gpu-operator              v23.6.2       v23.6.2     NVIDIA GPU Operator creates/configures/manages ...
    nvidia/gpu-operator              v23.6.1       v23.6.1     NVIDIA GPU Operator creates/configures/manages ...
    nvidia/gpu-operator              v23.6.0       v23.6.0     NVIDIA GPU Operator creates/configures/manages ...
    nvidia/gpu-operator              v23.3.2       v23.3.2     NVIDIA GPU Operator creates/configures/manages ...
    nvidia/gpu-operator              v23.3.1       v23.3.1     NVIDIA GPU Operator creates/configures/manages ...
    nvidia/gpu-operator              v23.3.0       v23.3.0     NVIDIA GPU Operator creates/configures/manages ...
    nvidia/gpu-operator              v22.9.2       v22.9.2     NVIDIA GPU Operator creates/configures/manages ...
    nvidia/gpu-operator              v22.9.1       v22.9.1     NVIDIA GPU Operator creates/configures/manages ...
    nvidia/gpu-operator              v22.9.0       v22.9.0     NVIDIA GPU Operator creates/configures/manages ...
    nvidia/gpu-operator              v1.11.1       v1.11.1     NVIDIA GPU Operator creates/configures/manages ...
    nvidia/gpu-operator              v1.11.0       v1.11.0     NVIDIA GPU Operator creates/configures/manages ...
    nvidia/gpu-operator              v1.10.1       v1.10.1     NVIDIA GPU Operator creates/configures/manages ...
    nvidia/gpu-operator              v1.10.0       v1.10.0     NVIDIA GPU Operator creates/configures/manages ...
    nvidia/gpu-operator              v1.9.1        v1.9.1      NVIDIA GPU Operator creates/configures/manages ...
    nvidia/gpu-operator              v1.9.0        v1.9.0      NVIDIA GPU Operator creates/configures/manages ...
    nvidia/gpu-operator              v1.8.2        v1.8.2      NVIDIA GPU Operator creates/configures/manages ...
    nvidia/gpu-operator              v1.8.1        v1.8.1      NVIDIA GPU Operator creates/configures/manages ...
    nvidia/gpu-operator              v1.8.0        v1.8.0      NVIDIA GPU Operator creates/configures/manages ...
    nvidia/gpu-operator              v1.7.1        v1.7.1      NVIDIA GPU Operator creates/configures/manages ...
    nvidia/gpu-operator              v1.7.0        v1.7.0      NVIDIA GPU Operator creates/configures/manages ...
    nvidia/gpu-operator              1.6.2         1.6.2       NVIDIA GPU Operator creates/configures/manages ...
    nvidia/gpu-operator              1.6.1         1.6.1       NVIDIA GPU Operator creates/configures/manages ...
    nvidia/gpu-operator              1.6.0         1.6.0       NVIDIA GPU Operator creates/configures/manages ...
    nvidia/gpu-operator              1.5.2         1.5.2       NVIDIA GPU Operator creates/configures/manages ...
    nvidia/gpu-operator              1.5.1         1.5.1       NVIDIA GPU Operator creates/configures/manages ...
    nvidia/gpu-operator              1.5.0         1.5.0       NVIDIA GPU Operator creates/configures/manages ...
    nvidia/gpu-operator              1.4.0         1.4.0       NVIDIA GPU Operator creates/configures/manages ...
    nvidia/gpu-operator              1.3.0         1.3.0       NVIDIA GPU Operator creates/configures/manages ...
    nvidia/gpu-operator              1.2.0         1.2.0       NVIDIA GPU Operator creates/configures/manages ...
  2. You may update the values file we exported earlier if you wish to change the configuration. An older version might miss certain configuration settings, for clarity we will share the values file that we provide with the latest BCM 10.0 version (where we made the default v24.6.2 since 10.24.09):
    cdi:
      default: false
      enabled: false
    dcgm:
      enabled: false
    dcgmExporter:
      enabled: true
      serviceMonitor:
        enabled: false
    devicePlugin:
      enabled: true
      env:
      - name: DEVICE_LIST_STRATEGY
        value: volume-mounts
    driver:
      enabled: false
      rdma:
        enabled: false
    kataManager:
      enabled: false
      env:
      - name: CONTAINERD_CONFIG
        value: /cm/local/apps/containerd/var/etc/conf.d/yy-kata-containers.toml
    mig:
      strategy: single
    migManager:
      enabled: true
    nfd:
      enabled: true
    sandboxWorkloads:
      enabled: false
    toolkit:
      enabled: false
      env:
      - name: CONTAINERD_CONFIG
        value: /cm/local/apps/containerd/var/etc/conf.d/nvidia-cri.toml
      - name: ACCEPT_NVIDIA_VISIBLE_DEVICES_ENVVAR_WHEN_UNPRIVILEGED
        value: "false"
      - name: ACCEPT_NVIDIA_VISIBLE_DEVICES_AS_VOLUME_MOUNTS
        value: "true"
    validator:
      driver:
        env:
        - name: DISABLE_DEV_CHAR_SYMLINK_CREATION
          value: "true"

    This values file can be used as a replacement for the one we exported, or we can customize it. Please note that if the toolkit.enabled value was true in your case, that the cm-nvidia-container-toolkit (or nvidia-container-toolkit) package has to be installed on the relevant hosts, or again change the value to true.

     

  3. Next, we can do the upgrade. Please use the correct version in the --version parameter if it differs.
    helm upgrade --set operator.upgradeCRD=true --disable-openapi-validation -i --wait -f gpu-operator-values.yaml --version=v24.6.2 --timeout 10m0s --namespace gpu-operator gpu-operator nvidia/gpu-operator 

    More details on the --set operator.upgradeCRD=true and --disable-openapi-validation can be found here on this website: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/upgrade.html.

Verify the upgrade

We can check Helm to see the status.

root@bcm-cluster:~# helm history -n gpu-operator gpu-operator
REVISION UPDATED                  STATUS     CHART                APP VERSION DESCRIPTION     
1        Mon Sep 30 16:29:19 2024 superseded gpu-operator-v23.3.2 v23.3.2     Install complete
2        Mon Sep 30 16:32:08 2024 deployed   gpu-operator-v24.6.2 v24.6.2     Upgrade complete

On the nodes managed by the GPU operator we can (in the case toolkit.enabled is set to true) check the binaries deployed by the GPU operator to verify the version of the nvidia-container-toolkit is greater than (or equal to) 1.16.2.

[root@bcm-cluster ~]# ssh node004
...

[root@node004 ~]# /usr/local/nvidia/toolkit/nvidia-container-toolkit --version
NVIDIA Container Runtime Hook version 1.16.2
commit: a5a5833c14a15fd9c86bcece85d5ec6621b65652

 

Updated on September 30, 2024