How do I upgrade CMDaemon and DCGM packages on head nodes and compute nodes?

Contents

August 2021: Due to recently uncovered security issues there is a need to update cmdaemon and cuda-dcgm to mitigate known software defects. Below are instructions for identifying and updating sites to use the new packages. We strongly encourage users to update these packages at their earliest convenience.

Am I affected?

CMDaemon

If your system is not using pre-release or nightly packages you should upgrade your head nodes, software images, and re-provision your compute nodes to get the latest cmd packages. You will need to confirm that your cmdaemon package is updated to the minimum version below as well as all dependencies.

If you are unsure if you are running pre-release or nightly builds please use the script below to analyse your system and take appropriate corrective action.

NOTE: To fully apply this fix both the head nodes and software images must be updated and compute resources restarted. We understand this will take time but all copies of cmd must be patched.

NOTE: Before patching please make sure you have a complete backup of systems.

BCM Version	Minimum hotfix version	Release
9.1	147456_cm9.1_ae8dc7d7e1	9.1-7
9.0	144433_cm9.0_d239521cf8	9.0-15
8.2	139020_cm8.2_5651b2d10a	8.2-26
8.1	132235_cm8.1_f416f8b942	8.1-23b
8.0	128612_cm8.0_6a5faf07f2	8.0-21a
8.0 on SLES 11	128610_cm8.0_ff875f0eff	8.0-21a
7.3	123578_cm7.3_78ed8cc7fe	7.3-23b
7.2	122046_cm7.2_262b6a1aba	7.2-34a

Safe cmdaemon patch levels for systems not running pre-release or nightly builds

cuda-dcgm

For cuda-dcgm you will need to update to version 2.2.9 or later. For many systems cuda-dcgm will not be installed if you are not supporting a NVIDIA based computational cards or you are not running a Bright packaged DCGM; in those cases it would be normal to not find the package installed on your systems. Like cmdaemon, cuda-dcgm is installed locally it will need to be updated on all systems where it has been installed. You can easily determine what systems it’s running on by running the scripts or with manual steps included later in this article.

NOTE: cuda-dcgm fixes are unavailable for SLES 11 based systems due the age of system dependencies. Sites using cuda-dcgm on SLES11 will need to reinstall with a cuda-dcgm 2 supported platform.

Automatically via script

Bright is providing a python script to automatically determine if your cmdaemon needs to be updated. This script will confirm if the installed version of bright has the hotfix and will suggest corrective steps. We have taken many steps to make sure this will run on as many systems as possible but if it does not operate properly on your head node please use the manual steps or open a ticket for additional assistance. If the md5sum does not match, please refresh this page to see if there was an update.

# curl -o hotfix_check.py https://support.brightcomputing.com/hotfix/hotfix_check.py                                                                                                                                  
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 13643  100 13643    0     0   120k      0 --:--:-- --:--:-- --:--:--  120k
# ls -l hotfix_check.py
-rw-r--r-- 1 root root 13643 Aug 19 14:28 hotfix_check.py
# md5sum hotfix_check.py
679884b96f8f3fd55df4f6da352f217a  hotfix_check.py
# python hotfix_check.py 

        Type             Hostname               CMD Version           CMD Action    DCGM Version DCGM Action         
--------------------------------------------------------------------------------------------------------------------------------------
    HeadNode    ew-b82-c7u5-08-19   137541_cm8.2_075522ab7d     UPGRADE NORMALLY                 NOT INSTALLED       

           Name Path                                                     CMD Version           CMD Action    DCGM Version          DCGM Action Reprovision Nodes
----------------------------------------------------------------------------------------------------------------------------------------------------------------
        compute /cm/images/compute                           137541_cm8.2_075522ab7d     UPGRADE NORMALLY             N/A        NOT INSTALLED UPDATE IMAGE FIRST
  default-image /cm/images/default-image                     137541_cm8.2_075522ab7d     UPGRADE NORMALLY             N/A        NOT INSTALLED UPDATE IMAGE FIRST
            gpu /cm/images/gpu                               137541_cm8.2_075522ab7d     UPGRADE NORMALLY 1.4.6.1-59_cm8.     UPGRADE NORMALLY UPDATE IMAGE FIRST

This shows a typical CentOS 7 head node running Bright Cluster Manager 8.2. There is a head node and 3 software images, one running gpu tools. As we can see in the example above the ACTION for the HeadNode is UPGRADE NORMALLY.

# yum update cmdaemon
[ output removed ]
# python hotfix_check.py 

        Type             Hostname               CMD Version           CMD Action    DCGM Version DCGM Action         
--------------------------------------------------------------------------------------------------------------------------------------
    HeadNode    ew-b82-c7u5-08-19   139020_cm8.2_5651b2d10a    NO UPGRADE NEEDED                 NOT INSTALLED       

           Name Path                                                     CMD Version           CMD Action    DCGM Version          DCGM Action Reprovision Nodes
----------------------------------------------------------------------------------------------------------------------------------------------------------------
        compute /cm/images/compute                           137541_cm8.2_075522ab7d     UPGRADE NORMALLY             N/A        NOT INSTALLED UPDATE IMAGE FIRST
  default-image /cm/images/default-image                     137541_cm8.2_075522ab7d     UPGRADE NORMALLY             N/A        NOT INSTALLED UPDATE IMAGE FIRST
            gpu /cm/images/gpu                               137541_cm8.2_075522ab7d     UPGRADE NORMALLY 1.4.6.1-59_cm8.     UPGRADE NORMALLY UPDATE IMAGE FIRST

Now the headnode is up-to-date and and you can see cmdaemon lists NO UPGRADE NEEDED. Next we will move on to updating the software images. You will see UPGRADE IMAGE FIRST listed under Reprovision Nodes because the script understands that the software image needs to be updated first.

We need to update cmdaemon on default-image and compute and cmdaemon + cuda-dcgm on gpu. The example below if for RHEL and CentOS, for other platforms please refer to the Administration manual for instructions on managing software images. Please note that the latest cm-libpam update should also be installed for clusters that do not allow a user to log into compute nodes unless that user has a job running on the node.

# yum --installroot /cm/images/default-image update cmdaemon cm-libpam
[ output removed ]
# yum --installroot /cm/images/compute update cmdaemon cm-libpam
[ output removed ]
# yum --installroot /cm/images/gpu update cmdaemon cuda-dcgm cm-libpam
[ output removed ]
# python hotfix_check.py

        Type             Hostname               CMD Version           CMD Action    DCGM Version DCGM Action         
--------------------------------------------------------------------------------------------------------------------------------------
    HeadNode    ew-b82-c7u5-08-19   139020_cm8.2_5651b2d10a    NO UPGRADE NEEDED                 NOT INSTALLED       

           Name Path                                                     CMD Version           CMD Action    DCGM Version          DCGM Action Reprovision Nodes
----------------------------------------------------------------------------------------------------------------------------------------------------------------
        compute /cm/images/compute                           139020_cm8.2_5651b2d10a    NO UPGRADE NEEDED             N/A        NOT INSTALLED node[002-004]
  default-image /cm/images/default-image                     139020_cm8.2_5651b2d10a    NO UPGRADE NEEDED             N/A        NOT INSTALLED node001
            gpu /cm/images/gpu                               139020_cm8.2_5651b2d10a    NO UPGRADE NEEDED 2.2.9.1-114_cm8    NO UPGRADE NEEDED node005

Here we see that for each software image it is now instructing us to re-provision the nodes assigned to those images because it has detected that cmdaemon is still vulnerable on the running node. We will reboot the nodes to update the software image, sites should use their preferred methods to remove nodes from operation and re-image them.

# cmsh -c 'device; foreach -n node001..node005 ( reboot )'
node001: Reboot in progress ...
node002: Reboot in progress ...
node003: Reboot in progress ...
node004: Reboot in progress ...
node005: Reboot in progress ...

After the images are re-imaged they will no longer show as needing reprovision.

# python hotfix_check.py 

        Type             Hostname               CMD Version           CMD Action    DCGM Version DCGM Action         
--------------------------------------------------------------------------------------------------------------------------------------
    HeadNode    ew-b82-c7u5-08-19   139020_cm8.2_5651b2d10a    NO UPGRADE NEEDED                 NOT INSTALLED       

           Name Path                                                     CMD Version           CMD Action    DCGM Version          DCGM Action Reprovision Nodes
----------------------------------------------------------------------------------------------------------------------------------------------------------------
        compute /cm/images/compute                           139020_cm8.2_5651b2d10a    NO UPGRADE NEEDED             N/A        NOT INSTALLED 
  default-image /cm/images/default-image                     139020_cm8.2_5651b2d10a    NO UPGRADE NEEDED             N/A        NOT INSTALLED 
            gpu /cm/images/gpu                               139020_cm8.2_5651b2d10a    NO UPGRADE NEEDED 2.2.9.1-114_cm8    NO UPGRADE NEEDED

At this point your cluster is patched. All nodes and software images are NO UPGRADE NEEDED and/or NOT INSTALLED.

Checking nodes via PDSH on RPM based system

This works across all versions of Bright Cluster Manager presuming there is ssh access. This command should work on RHEL, CentOS, and SLES based systems. You can use the same command but substitute cuda-dcgm to confirm the other package.

# pdsh -a -- rpm -q cmdaemon 2> /dev/null | dshbak -c
----------------
node001,ew-bright80
----------------
cmdaemon-8.0-128606_cm8.0_f97a574a9d.x86_64

Here we have all systems on version 8.0 on release 128606 with hash f97a574a9d. Looking at the table above we can see that it is below fixed version 128610 and will need to update.

Checking nodes via PDSH on Ubuntu based systems

This works across all versions of Bright Cluster Manager presuming there is ssh access. This command should work on Ubuntu based sytems

# pdsh -a dpkg-query --show cmdaemon 2> /dev/null | dshbak -c
----------------
node001,ew-82ubuntu
----------------
cmdaemon        8.2-139020-cm8.2-5651b2d10a

Here we find version 8.2 on release 139020 with hash 5651b2d10a. This version is up-to-date and does not need to be patched since 139020=fixed version AND 5651b2d10a=fixed-version-hash.

Checking softwareimages via chroot

This example should work for both rpm and deb based images.

# cmsh -c 'softwareimage; list -f path:0' | ( while read imagepath; do echo -n "$imagepath : ";chroot $imagepath rpm -q cmdaemon 2> /dev/null || chroot $imagepath dpkg-query --show cmdaemon; done )
/cm/images/default-image : cmdaemon-9.1-147456_cm9.1_ae8dc7d7e1.x86_64
/cm/images/default-image-centos8-x86_64 : cmdaemon-9.1-147161_cm9.1_13cd06648a.x86_64
/cm/images/default-image-ubuntu1804-x86_64 : cmdaemon   9.1-147161-cm9.1-13cd06648a

Updating Packages

This will be a summary of the official documentation in the Bright Administration Manual. In the manual you will find detailed instructions covering software management. If you have questions you should refer back to the Administration Manual.

Version	URL
9.1	Chapter 12 – Post-Installation Software Management
9.0	Chapter 11 – Post-Installation Software Management
8.2	Chapter 11 – Post-Installation Software Management
8.1	Chapter 11 – Post-Installation Software Management
8.0	Chapter 11 – Post-Installation Software Management
7.3	Chapter 9 – Post-Installation Software Management
7.2	Chapter 9 – Post-Installation Software Management

There are 3 major steps that need to occur when updating these packages.

Update the head node(s)
Update the software images
Re-provision the node to receive the new software packages.
NOTE: for datanodes that can not be re-provisioned software packages will need to be installed on the host directly similar to the procedures of the head node.

Updating the head node

For each head node please use the package manager to install cmdaemon, cuda-dcgm ( if applicable ) and all dependencies. This process should be repeated for the passive head node if there is one in your environment. It is not necessary to reboot the head node after update.

# For RHEL based systems
# yum update cmdaemon  ( and cuda-dcgm if needed on head node )

# For Ubuntu based systems
# apt update
# apt install cmdaemon  ( and cuda-dcgm if needed on head node )

# For SLES based systems
# zypper update cmdaemon ( and cuda-dcgm if needed on head node )

Updating software images

For each software image you will need to update the installed cmdaemon package and then reprovision the nodes affected. There are many simpler ways to update systems but the method below should work in a wide number of situations. Remember that updating the image alone is not enough, you have to redeploy the image to update the running cmdaemon packages. Please note that the latest cm-libpam update should also be installed for clusters that do not allow a user to log into compute nodes unless that user has a job running on the node.

# Enter the image ( 8.2 and higher )
# cm-chroot-sw-img /cm/images/images

# Enter the image 7.2 - 8.1
# chroot /cm/images/images /bin/bash

# For RHEL based systems
# yum update cmdaemon cm-libpam ( and cuda-dcgm if needed on software image )

# For Ubuntu based systems
# apt update
# apt install cmdaemon cm-libpam ( and cuda-dcgm if needed on software image )

# For SLES based systems
# zypper update cmdaemon cm-libpam ( and cuda-dcgm if needed on software image )

When the software image has been updated it’s then necessary to provision those changes out to the computenodes in the cluster, this can most easily be achieved with one of these two steps.

Reboot the node, it should automatically get the new software on sync
Use the imageupdate -w command in cmsh to manually start a sync AND then restart the cmd and cuda-dcgm service on the node.

Data nodes

For nodes with persistent data where a full sync is impossible you should update the software local to the node using the steps in the head node section. This will only survive until the node is reinstalled so please do not forget to update the software image as well.

Notes and Known issues

Sometimes when updating older releases you may run into issues which are resolved with newer packages. Below are some of the issues some people may encounter when updating.

Version 8.2: Transaction error from node-installer-nfsroot

Transaction check error:
file /cm/node-installer/usr/share/redhat-release from install of node-installer-nfsroot-8.2-824_cm8.2.x86_64 conflicts with file from package node-installer-nfsroot-8.2-697_cm8.2.x86_64

If you get this error please complete the following steps and then retry the cmdaemon upgrade.

# yum remove node-installer-nfsroot
# yum install node-installer-nfsroot

Version 8.2: cmd fails to start after update

If your cmdaemon will not start please check the following.

# ldd `which cmd` | grep 'not found'
        libcrypto.so.1.0.0 => not found

If you get the above results please execute the following command on all head nodes.

# yum update net-snmp-recent

Version 8.x: cmdaemon scriptlet failure / db upgrade failure

If you receive the following scriptlet error or similar messages from systemctl status cmd please open a support ticket so that we may assist you in repairing the database. There was a defect in MariaDB which caused this test___ table to end up in a state where it can neither be created nor destroyed. More recent versions of Bright Cluster Manager no longer use this table.

Error: Could not update cmdaemon

Debug: FAILED: CREATE TABLE cloned_table_880 LIKE test___

Updated on December 3, 2021