Required security upgrade for nvidia-container-toolkit

Contents

A security issue has been found in nvidia-container-toolkit. All nodes that have this package installed need to have the package updated.

On BCM clusters, the package can be either called cm-nvidia-container-toolkit or nvidia-container-toolkit. The (cm-)nvidia-container-toolkit packages must be updated on the head nodes and in all relevant software images.

1. Checking for NVIDIA Container Toolkit Packages

One can use the following command to check whether one of these packages is installed.

Ubuntu

apt list --installed nvidia-container-toolkit cm-nvidia-container-toolkit

RPM based systems

rpm -q nvidia-container-toolkit cm-nvidia-container-toolkit

2. Updating the NVIDIA Container Toolkit Packages

Now that we know which package to update, we can proceed as follows.

We can start with the Head Nodes. It makes sense to first check (using the commands from Section 1) whether the packages are installed at all; otherwise, updating them won’t be necessary.

Ubuntu

apt update
apt-get install --only-upgrade cm-nvidia-container-toolkit  # or nvidia-container-toolkit

RHEL based systems

yum check-update
yum update cm-nvidia-container-toolkit  # or nvidia-container-toolkit

SLES

zypper refresh
zypper update cm-nvidia-container-toolkit  # or nvidia-container-toolkit

3. Updating Software Images

To check and install inside a software image, execute the commands from Sections 1 and 2 in a chroot environment created by cm-chroot-sw-img. For example:

root@bcm10-cluster:~# cm-chroot-sw-img /cm/images/default-image

mounted /cm/images/default-image/dev
mounted /cm/images/default-image/dev/pts
mounted /cm/images/default-image/proc
mounted /cm/images/default-image/sys
mounted /cm/images/default-image/run
mounted /run/systemd/resolve/stub-resolv.conf -> /cm/images/default-image/run/systemd/resolve/resolv.conf

Using chroot with mounted virtual filesystems to chroot in /cm/images/default-image....
  Type 'exit' or ctrl-D to exit from the chroot in the software image.
  This also unmounts the above mentioned /dev /dev/pts /proc /sys /run filesystems in the software image.


root@default-image:/# apt list --installed nvidia-container-toolkit cm-nvidia-container-toolkit
Listing... Done
cm-nvidia-container-toolkit/unknown,now 1.14.2-100091-cm-a32a4f4d71 amd64 [installed]
N: There are 2 additional versions. Please use the '-a' switch to see them.


root@default-image:/# apt update
...

root@default-image:/# apt install --only-upgrade cm-nvidia-container-toolkit
Reading package lists... Done
Building dependency tree
Reading state information... Done
The following packages will be upgraded:
  cm-nvidia-container-toolkit
1 upgraded, 0 newly installed, 0 to remove and 319 not upgraded.
Need to get 5,164 kB of archives.
After this operation, 11.0 MB of additional disk space will be used.
Get:1 http://dev.brightcomputing.com/bright/apt/cm/amd64/trunk/ubuntu/2004/base ./ cm-nvidia-container-toolkit 1.16.2-100099-cm-7290982b9f [5,164 kB]
Fetched 5,164 kB in 0s (53.7 MB/s)
(Reading database ... 201134 files and directories currently installed.)
Preparing to unpack .../cm-nvidia-container-toolkit_1.16.2-100099-cm-7290982b9f_amd64.deb ...
Unpacking cm-nvidia-container-toolkit (1.16.2-100099-cm-7290982b9f) over (1.14.2-100091-cm-a32a4f4d71) ...
Setting up cm-nvidia-container-toolkit (1.16.2-100099-cm-7290982b9f) ...
Installing new version of config file /etc/nvidia-container-runtime/config.toml ...
Configure library path
Processing triggers for libc-bin (2.31-0ubuntu9.9) ...

root@default-image:/# exit
exit
...

root@bcm10-cluster:~#

Typically, the above has to be repeated for all Software Images. Since the above shows the example of Ubuntu Linux, in the case of non-Ubuntu software images, the commands have to be the equivalent for the respective systems (see Sections 1 and 2 from this KB article).

4. Updating the Compute Nodes

Now that the Software images have been prepared, we can choose how to propagate the update to Compute Nodes in a couple of ways. Please note that the admin manual also covers various ways to update running nodes here: https://support.brightcomputing.com/manuals/10/admin-manual.pdf#page=290&zoom=100,77,437.
Please note that only ONE of these options is needed to apply the changes to the nodes successfully.

Reboot of the nodes (so they get re-provisioned at boot)
Keep them running and use imageupdate through BCM’s cmsh (instructions below)
Keep them running or use pdsh to install the package update specifically (least risky)

1. Reboot of the nodes

This can be done by draining the nodes first, waiting for important workloads to finish, and then rebooting. Provisioning will happen when they boot.

Note for this update no reboot is required, and nodes can be left running.

2. Execute an image update

The node will not be rebooted. Instead, the filesystem from the software image associated with that node will be synchronized to the running node.

The cmsh command is explained in the manual here: https://support.brightcomputing.com/manuals/10/admin-manual.pdf#page=298&zoom=100,77,380.

Below is an example of how to issue an imageupdate for a node using cmsh, and how to do it for a whole category as well. The -w flag is to disable the dry-run (which is the default).

root@bcm10-cluster:~# cmsh
[bcm10-cluster]% device
[bcm10-cluster->device]% imageupdate -w node001
Thu Sep 26 14:45:07 2024 [notice] bcm10-cluster: Provisioning started: sending bcm10-cluster:/cm/images/default-image to node001:/, mode UPDATE, dry run = no
Thu Sep 26 14:46:21 2024 [notice] bcm10-cluster: Provisioning completed: sent bcm10-cluster:/cm/images/default-image to node001:/, mode UPDATE, dry run = no
imageupdate -w node001 [ COMPLETED ]
[bcm10-cluster->device]% imageupdate -w -c default
Thu Sep 26 14:46:52 2024 [notice] bcm10-cluster: Provisioning started: sending bcm10-cluster:/cm/images/default-image to node001:/, mode UPDATE, dry run = no
Thu Sep 26 14:46:52 2024 [notice] bcm10-cluster: Provisioning started: sending bcm10-cluster:/cm/images/default-image to node002:/, mode UPDATE, dry run = no
Thu Sep 26 14:46:52 2024 [notice] bcm10-cluster: Provisioning started: sending bcm10-cluster:/cm/images/default-image to node003:/, mode UPDATE, dry run = no
Thu Sep 26 14:46:52 2024 [notice] bcm10-cluster: Provisioning started: sending bcm10-cluster:/cm/images/default-image to node004:/, mode UPDATE, dry run = no
Thu Sep 26 14:46:52 2024 [notice] bcm10-cluster: Provisioning started: sending bcm10-cluster:/cm/images/default-image to node005:/, mode UPDATE, dry run = no
...

The dry-run is also helpful; it allows you to execute the command and inspect the synchronization logs in the /var/spool/cmd directory. An image update can potentially remove local modifications to a node that are not present in its software image. To see what would have been synchronized, deleted, etc. In the above command, the log file for node001 would be: /var/spool/cmd/node001-\\.rsync.

3. Install package updates using pdsh (safest)

The pdsh command is a parallel execution helper, as discussed in our admin manual here: https://support.brightcomputing.com/manuals/10/admin-manual.pdf#page=727&zoom=100,77,642.

Below are a few examples to update a single node, and a category of nodes. Please note that a couple of commands require an additional flag, such as -y because pdsh does not support interactive prompts.

# Ubuntu examples

pdsh -w node001          "apt update && apt-get install --only-upgrade -y cm-nvidia-container-toolkit"

pdsh -w node00[1-2]      "apt update && apt-get install --only-upgrade -y cm-nvidia-container-toolkit"

pdsh -w category=default "apt update && apt-get install --only-upgrade -y cm-nvidia-container-toolkit"

A couple more examples for RHEL and SLES:

# RHEL

pdsh -w category=default "yum check-update && yum update -y cm-nvidia-container-toolkit"

# SLES
pdsh -w category=default "zypper refresh && zypper update -y cm-nvidia-container-toolkit"

Updated on September 27, 2024