August 2021: Due to recently uncovered security issues there is a need to update cmdaemon and cuda-dcgm to mitigate known software defects. Below are instructions for identifying and updating sites to use the new packages. We strongly encourage users to update these packages at their earliest convenience.
Am I affected?
CMDaemon
If your system is not using pre-release or nightly packages you should upgrade your head nodes, software images, and re-provision your compute nodes to get the latest cmd packages. You will need to confirm that your cmdaemon package is updated to the minimum version below as well as all dependencies.
If you are unsure if you are running pre-release or nightly builds please use the script below to analyse your system and take appropriate corrective action.
NOTE: To fully apply this fix both the head nodes and software images must be updated and compute resources restarted. We understand this will take time but all copies of cmd must be patched.
NOTE: Before patching please make sure you have a complete backup of systems.
BCM Version | Minimum hotfix version | Release |
---|---|---|
9.1 | 147456_cm9.1_ae8dc7d7e1 | 9.1-7 |
9.0 | 144433_cm9.0_d239521cf8 | 9.0-15 |
8.2 | 139020_cm8.2_5651b2d10a | 8.2-26 |
8.1 | 132235_cm8.1_f416f8b942 | 8.1-23b |
8.0 | 128612_cm8.0_6a5faf07f2 | 8.0-21a |
8.0 on SLES 11 | 128610_cm8.0_ff875f0eff | 8.0-21a |
7.3 | 123578_cm7.3_78ed8cc7fe | 7.3-23b |
7.2 | 122046_cm7.2_262b6a1aba | 7.2-34a |
cuda-dcgm
For cuda-dcgm you will need to update to version 2.2.9 or later. For many systems cuda-dcgm will not be installed if you are not supporting a NVIDIA based computational cards or you are not running a Bright packaged DCGM; in those cases it would be normal to not find the package installed on your systems. Like cmdaemon, cuda-dcgm is installed locally it will need to be updated on all systems where it has been installed. You can easily determine what systems it’s running on by running the scripts or with manual steps included later in this article.
NOTE: cuda-dcgm fixes are unavailable for SLES 11 based systems due the age of system dependencies. Sites using cuda-dcgm on SLES11 will need to reinstall with a cuda-dcgm 2 supported platform.
Automatically via script
Bright is providing a python script to automatically determine if your cmdaemon needs to be updated. This script will confirm if the installed version of bright has the hotfix and will suggest corrective steps. We have taken many steps to make sure this will run on as many systems as possible but if it does not operate properly on your head node please use the manual steps or open a ticket for additional assistance. If the md5sum does not match, please refresh this page to see if there was an update.
# curl -o hotfix_check.py https://support.brightcomputing.com/hotfix/hotfix_check.py
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 13643 100 13643 0 0 120k 0 --:--:-- --:--:-- --:--:-- 120k
# ls -l hotfix_check.py
-rw-r--r-- 1 root root 13643 Aug 19 14:28 hotfix_check.py
# md5sum hotfix_check.py
679884b96f8f3fd55df4f6da352f217a hotfix_check.py
# python hotfix_check.py
Type Hostname CMD Version CMD Action DCGM Version DCGM Action
--------------------------------------------------------------------------------------------------------------------------------------
HeadNode ew-b82-c7u5-08-19 137541_cm8.2_075522ab7d UPGRADE NORMALLY NOT INSTALLED
Name Path CMD Version CMD Action DCGM Version DCGM Action Reprovision Nodes
----------------------------------------------------------------------------------------------------------------------------------------------------------------
compute /cm/images/compute 137541_cm8.2_075522ab7d UPGRADE NORMALLY N/A NOT INSTALLED UPDATE IMAGE FIRST
default-image /cm/images/default-image 137541_cm8.2_075522ab7d UPGRADE NORMALLY N/A NOT INSTALLED UPDATE IMAGE FIRST
gpu /cm/images/gpu 137541_cm8.2_075522ab7d UPGRADE NORMALLY 1.4.6.1-59_cm8. UPGRADE NORMALLY UPDATE IMAGE FIRST
This shows a typical CentOS 7 head node running Bright Cluster Manager 8.2. There is a head node and 3 software images, one running gpu tools. As we can see in the example above the ACTION for the HeadNode is UPGRADE NORMALLY.
# yum update cmdaemon
[ output removed ]
# python hotfix_check.py
Type Hostname CMD Version CMD Action DCGM Version DCGM Action
--------------------------------------------------------------------------------------------------------------------------------------
HeadNode ew-b82-c7u5-08-19 139020_cm8.2_5651b2d10a NO UPGRADE NEEDED NOT INSTALLED
Name Path CMD Version CMD Action DCGM Version DCGM Action Reprovision Nodes
----------------------------------------------------------------------------------------------------------------------------------------------------------------
compute /cm/images/compute 137541_cm8.2_075522ab7d UPGRADE NORMALLY N/A NOT INSTALLED UPDATE IMAGE FIRST
default-image /cm/images/default-image 137541_cm8.2_075522ab7d UPGRADE NORMALLY N/A NOT INSTALLED UPDATE IMAGE FIRST
gpu /cm/images/gpu 137541_cm8.2_075522ab7d UPGRADE NORMALLY 1.4.6.1-59_cm8. UPGRADE NORMALLY UPDATE IMAGE FIRST
Now the headnode is up-to-date and and you can see cmdaemon lists NO UPGRADE NEEDED. Next we will move on to updating the software images. You will see UPGRADE IMAGE FIRST listed under Reprovision Nodes because the script understands that the software image needs to be updated first.
We need to update cmdaemon on default-image and compute and cmdaemon + cuda-dcgm on gpu. The example below if for RHEL and CentOS, for other platforms please refer to the Administration manual for instructions on managing software images. Please note that the latest cm-libpam update should also be installed for clusters that do not allow a user to log into compute nodes unless that user has a job running on the node.
# yum --installroot /cm/images/default-image update cmdaemon cm-libpam
[ output removed ]
# yum --installroot /cm/images/compute update cmdaemon cm-libpam
[ output removed ]
# yum --installroot /cm/images/gpu update cmdaemon cuda-dcgm cm-libpam
[ output removed ]
# python hotfix_check.py
Type Hostname CMD Version CMD Action DCGM Version DCGM Action
--------------------------------------------------------------------------------------------------------------------------------------
HeadNode ew-b82-c7u5-08-19 139020_cm8.2_5651b2d10a NO UPGRADE NEEDED NOT INSTALLED
Name Path CMD Version CMD Action DCGM Version DCGM Action Reprovision Nodes
----------------------------------------------------------------------------------------------------------------------------------------------------------------
compute /cm/images/compute 139020_cm8.2_5651b2d10a NO UPGRADE NEEDED N/A NOT INSTALLED node[002-004]
default-image /cm/images/default-image 139020_cm8.2_5651b2d10a NO UPGRADE NEEDED N/A NOT INSTALLED node001
gpu /cm/images/gpu 139020_cm8.2_5651b2d10a NO UPGRADE NEEDED 2.2.9.1-114_cm8 NO UPGRADE NEEDED node005
Here we see that for each software image it is now instructing us to re-provision the nodes assigned to those images because it has detected that cmdaemon is still vulnerable on the running node. We will reboot the nodes to update the software image, sites should use their preferred methods to remove nodes from operation and re-image them.
# cmsh -c 'device; foreach -n node001..node005 ( reboot )' node001: Reboot in progress ... node002: Reboot in progress ... node003: Reboot in progress ... node004: Reboot in progress ... node005: Reboot in progress ...
After the images are re-imaged they will no longer show as needing reprovision.
# python hotfix_check.py Type Hostname CMD Version CMD Action DCGM Version DCGM Action -------------------------------------------------------------------------------------------------------------------------------------- HeadNode ew-b82-c7u5-08-19 139020_cm8.2_5651b2d10a NO UPGRADE NEEDED NOT INSTALLED Name Path CMD Version CMD Action DCGM Version DCGM Action Reprovision Nodes ---------------------------------------------------------------------------------------------------------------------------------------------------------------- compute /cm/images/compute 139020_cm8.2_5651b2d10a NO UPGRADE NEEDED N/A NOT INSTALLED default-image /cm/images/default-image 139020_cm8.2_5651b2d10a NO UPGRADE NEEDED N/A NOT INSTALLED gpu /cm/images/gpu 139020_cm8.2_5651b2d10a NO UPGRADE NEEDED 2.2.9.1-114_cm8 NO UPGRADE NEEDED
At this point your cluster is patched. All nodes and software images are NO UPGRADE NEEDED and/or NOT INSTALLED.
Checking nodes via PDSH on RPM based system
This works across all versions of Bright Cluster Manager presuming there is ssh access. This command should work on RHEL, CentOS, and SLES based systems. You can use the same command but substitute cuda-dcgm to confirm the other package.
# pdsh -a -- rpm -q cmdaemon 2> /dev/null | dshbak -c
----------------
node001,ew-bright80
----------------
cmdaemon-8.0-128606_cm8.0_f97a574a9d.x86_64
Here we have all systems on version 8.0 on release 128606 with hash f97a574a9d. Looking at the table above we can see that it is below fixed version 128610 and will need to update.
Checking nodes via PDSH on Ubuntu based systems
This works across all versions of Bright Cluster Manager presuming there is ssh access. This command should work on Ubuntu based sytems
# pdsh -a dpkg-query --show cmdaemon 2> /dev/null | dshbak -c
----------------
node001,ew-82ubuntu
----------------
cmdaemon 8.2-139020-cm8.2-5651b2d10a
Here we find version 8.2 on release 139020 with hash 5651b2d10a. This version is up-to-date and does not need to be patched since 139020=fixed version AND 5651b2d10a=fixed-version-hash.
Checking softwareimages via chroot
This example should work for both rpm and deb based images.
# cmsh -c 'softwareimage; list -f path:0' | ( while read imagepath; do echo -n "$imagepath : ";chroot $imagepath rpm -q cmdaemon 2> /dev/null || chroot $imagepath dpkg-query --show cmdaemon; done ) /cm/images/default-image : cmdaemon-9.1-147456_cm9.1_ae8dc7d7e1.x86_64 /cm/images/default-image-centos8-x86_64 : cmdaemon-9.1-147161_cm9.1_13cd06648a.x86_64 /cm/images/default-image-ubuntu1804-x86_64 : cmdaemon 9.1-147161-cm9.1-13cd06648a
Updating Packages
This will be a summary of the official documentation in the Bright Administration Manual. In the manual you will find detailed instructions covering software management. If you have questions you should refer back to the Administration Manual.
There are 3 major steps that need to occur when updating these packages.
- Update the head node(s)
- Update the software images
- Re-provision the node to receive the new software packages.
NOTE: for datanodes that can not be re-provisioned software packages will need to be installed on the host directly similar to the procedures of the head node.
Updating the head node
For each head node please use the package manager to install cmdaemon, cuda-dcgm ( if applicable ) and all dependencies. This process should be repeated for the passive head node if there is one in your environment. It is not necessary to reboot the head node after update.
# For RHEL based systems # yum update cmdaemon ( and cuda-dcgm if needed on head node ) # For Ubuntu based systems # apt update # apt install cmdaemon ( and cuda-dcgm if needed on head node ) # For SLES based systems # zypper update cmdaemon ( and cuda-dcgm if needed on head node )
Updating software images
For each software image you will need to update the installed cmdaemon package and then reprovision the nodes affected. There are many simpler ways to update systems but the method below should work in a wide number of situations. Remember that updating the image alone is not enough, you have to redeploy the image to update the running cmdaemon packages. Please note that the latest cm-libpam update should also be installed for clusters that do not allow a user to log into compute nodes unless that user has a job running on the node.
# Enter the image ( 8.2 and higher ) # cm-chroot-sw-img /cm/images/images # Enter the image 7.2 - 8.1 # chroot /cm/images/images /bin/bash # For RHEL based systems # yum update cmdaemon cm-libpam ( and cuda-dcgm if needed on software image ) # For Ubuntu based systems # apt update # apt install cmdaemon cm-libpam ( and cuda-dcgm if needed on software image ) # For SLES based systems # zypper update cmdaemon cm-libpam ( and cuda-dcgm if needed on software image )
When the software image has been updated it’s then necessary to provision those changes out to the computenodes in the cluster, this can most easily be achieved with one of these two steps.
- Reboot the node, it should automatically get the new software on sync
- Use the
imageupdate -w
command in cmsh to manually start a sync AND then restart the cmd and cuda-dcgm service on the node.
Data nodes
For nodes with persistent data where a full sync is impossible you should update the software local to the node using the steps in the head node section. This will only survive until the node is reinstalled so please do not forget to update the software image as well.
Notes and Known issues
Sometimes when updating older releases you may run into issues which are resolved with newer packages. Below are some of the issues some people may encounter when updating.
Version 8.2: Transaction error from node-installer-nfsroot
Transaction check error: file /cm/node-installer/usr/share/redhat-release from install of node-installer-nfsroot-8.2-824_cm8.2.x86_64 conflicts with file from package node-installer-nfsroot-8.2-697_cm8.2.x86_64
If you get this error please complete the following steps and then retry the cmdaemon upgrade.
# yum remove node-installer-nfsroot # yum install node-installer-nfsroot
Version 8.2: cmd fails to start after update
If your cmdaemon will not start please check the following.
# ldd `which cmd` | grep 'not found' libcrypto.so.1.0.0 => not found
If you get the above results please execute the following command on all head nodes.
# yum update net-snmp-recent
Version 8.x: cmdaemon scriptlet failure / db upgrade failure
If you receive the following scriptlet error or similar messages from systemctl status cmd
please open a support ticket so that we may assist you in repairing the database. There was a defect in MariaDB which caused this test___ table to end up in a state where it can neither be created nor destroyed. More recent versions of Bright Cluster Manager no longer use this table.
Error: Could not update cmdaemon Debug: FAILED: CREATE TABLE cloned_table_880 LIKE test___