A security issue has been found in nvidia-container-toolkit
. All nodes that have this package installed need to have the package updated.
On BCM clusters, the package can be either called cm-nvidia-container-toolkit
or nvidia-container-toolkit
. The (cm-)nvidia-container-toolkit packages must be updated on the head nodes and in all relevant software images.
1. Checking for NVIDIA Container Toolkit Packages
One can use the following command to check whether one of these packages is installed.
Ubuntu
apt list --installed nvidia-container-toolkit cm-nvidia-container-toolkit
RPM based systems
rpm -q nvidia-container-toolkit cm-nvidia-container-toolkit
2. Updating the NVIDIA Container Toolkit Packages
Now that we know which package to update, we can proceed as follows.
We can start with the Head Nodes. It makes sense to first check (using the commands from Section 1) whether the packages are installed at all; otherwise, updating them won’t be necessary.
Ubuntu
apt update apt-get install --only-upgrade cm-nvidia-container-toolkit # or nvidia-container-toolkit
RHEL based systems
yum check-update yum update cm-nvidia-container-toolkit # or nvidia-container-toolkit
SLES
zypper refresh zypper update cm-nvidia-container-toolkit # or nvidia-container-toolkit
3. Updating Software Images
To check and install inside a software image, execute the commands from Sections 1 and 2 in a chroot
environment created by cm-chroot-sw-img
. For example:
root@bcm10-cluster:~# cm-chroot-sw-img /cm/images/default-image mounted /cm/images/default-image/dev mounted /cm/images/default-image/dev/pts mounted /cm/images/default-image/proc mounted /cm/images/default-image/sys mounted /cm/images/default-image/run mounted /run/systemd/resolve/stub-resolv.conf -> /cm/images/default-image/run/systemd/resolve/resolv.conf Using chroot with mounted virtual filesystems to chroot in /cm/images/default-image.... Type 'exit' or ctrl-D to exit from the chroot in the software image. This also unmounts the above mentioned /dev /dev/pts /proc /sys /run filesystems in the software image. root@default-image:/# apt list --installed nvidia-container-toolkit cm-nvidia-container-toolkit Listing... Done cm-nvidia-container-toolkit/unknown,now 1.14.2-100091-cm-a32a4f4d71 amd64 [installed] N: There are 2 additional versions. Please use the '-a' switch to see them. root@default-image:/# apt update ... root@default-image:/# apt install --only-upgrade cm-nvidia-container-toolkit Reading package lists... Done Building dependency tree Reading state information... Done The following packages will be upgraded: cm-nvidia-container-toolkit 1 upgraded, 0 newly installed, 0 to remove and 319 not upgraded. Need to get 5,164 kB of archives. After this operation, 11.0 MB of additional disk space will be used. Get:1 http://dev.brightcomputing.com/bright/apt/cm/amd64/trunk/ubuntu/2004/base ./ cm-nvidia-container-toolkit 1.16.2-100099-cm-7290982b9f [5,164 kB] Fetched 5,164 kB in 0s (53.7 MB/s) (Reading database ... 201134 files and directories currently installed.) Preparing to unpack .../cm-nvidia-container-toolkit_1.16.2-100099-cm-7290982b9f_amd64.deb ... Unpacking cm-nvidia-container-toolkit (1.16.2-100099-cm-7290982b9f) over (1.14.2-100091-cm-a32a4f4d71) ... Setting up cm-nvidia-container-toolkit (1.16.2-100099-cm-7290982b9f) ... Installing new version of config file /etc/nvidia-container-runtime/config.toml ... Configure library path Processing triggers for libc-bin (2.31-0ubuntu9.9) ... root@default-image:/# exit exit ... root@bcm10-cluster:~#
Typically, the above has to be repeated for all Software Images. Since the above shows the example of Ubuntu Linux, in the case of non-Ubuntu software images, the commands have to be the equivalent for the respective systems (see Sections 1 and 2 from this KB article).
4. Updating the Compute Nodes
Now that the Software images have been prepared, we can choose how to propagate the update to Compute Nodes in a couple of ways. Please note that the admin manual also covers various ways to update running nodes here: https://support.brightcomputing.com/manuals/10/admin-manual.pdf#page=290&zoom=100,77,437.
Please note that only ONE of these options is needed to apply the changes to the nodes successfully.
- Reboot of the nodes (so they get re-provisioned at boot)
- Keep them running and use
imageupdate
through BCM’scmsh
(instructions below) - Keep them running or use
pdsh
to install the package update specifically (least risky)
1. Reboot of the nodes
This can be done by draining the nodes first, waiting for important workloads to finish, and then rebooting. Provisioning will happen when they boot.
Note for this update no reboot is required, and nodes can be left running.
2. Execute an image update
The node will not be rebooted. Instead, the filesystem from the software image associated with that node will be synchronized to the running node.
The cmsh
command is explained in the manual here: https://support.brightcomputing.com/manuals/10/admin-manual.pdf#page=298&zoom=100,77,380.
Below is an example of how to issue an imageupdate
for a node using cmsh
, and how to do it for a whole category as well. The -w
flag is to disable the dry-run (which is the default).
root@bcm10-cluster:~# cmsh [bcm10-cluster]% device [bcm10-cluster->device]% imageupdate -w node001 Thu Sep 26 14:45:07 2024 [notice] bcm10-cluster: Provisioning started: sending bcm10-cluster:/cm/images/default-image to node001:/, mode UPDATE, dry run = no Thu Sep 26 14:46:21 2024 [notice] bcm10-cluster: Provisioning completed: sent bcm10-cluster:/cm/images/default-image to node001:/, mode UPDATE, dry run = no imageupdate -w node001 [ COMPLETED ] [bcm10-cluster->device]% imageupdate -w -c default Thu Sep 26 14:46:52 2024 [notice] bcm10-cluster: Provisioning started: sending bcm10-cluster:/cm/images/default-image to node001:/, mode UPDATE, dry run = no Thu Sep 26 14:46:52 2024 [notice] bcm10-cluster: Provisioning started: sending bcm10-cluster:/cm/images/default-image to node002:/, mode UPDATE, dry run = no Thu Sep 26 14:46:52 2024 [notice] bcm10-cluster: Provisioning started: sending bcm10-cluster:/cm/images/default-image to node003:/, mode UPDATE, dry run = no Thu Sep 26 14:46:52 2024 [notice] bcm10-cluster: Provisioning started: sending bcm10-cluster:/cm/images/default-image to node004:/, mode UPDATE, dry run = no Thu Sep 26 14:46:52 2024 [notice] bcm10-cluster: Provisioning started: sending bcm10-cluster:/cm/images/default-image to node005:/, mode UPDATE, dry run = no ...
The dry-run is also helpful; it allows you to execute the command and inspect the synchronization logs in the /var/spool/cmd
directory. An image update can potentially remove local modifications to a node that are not present in its software image. To see what would have been synchronized, deleted, etc. In the above command, the log file for node001 would be: /var/spool/cmd/node001-\\.rsync
.
3. Install package updates using pdsh (safest)
The pdsh
command is a parallel execution helper, as discussed in our admin manual here: https://support.brightcomputing.com/manuals/10/admin-manual.pdf#page=727&zoom=100,77,642.
Below are a few examples to update a single node, and a category of nodes. Please note that a couple of commands require an additional flag, such as -y
because pdsh
does not support interactive prompts.
# Ubuntu examples pdsh -w node001 "apt update && apt-get install --only-upgrade -y cm-nvidia-container-toolkit" pdsh -w node00[1-2] "apt update && apt-get install --only-upgrade -y cm-nvidia-container-toolkit" pdsh -w category=default "apt update && apt-get install --only-upgrade -y cm-nvidia-container-toolkit"
A couple more examples for RHEL and SLES:
# RHEL pdsh -w category=default "yum check-update && yum update -y cm-nvidia-container-toolkit" # SLES pdsh -w category=default "zypper refresh && zypper update -y cm-nvidia-container-toolkit"