Upgrading Slurm

Contents

This article will go over the steps needed to upgrade the Bright provided SLURM packages to a newer major version as well as the caveats and warnings necessary for a successful migration. These instructions have been tested on Bright versions 9.0, 9.1, 9.2, and BCM10. These instructions will retain all Bright settings, queues, roles, and configurations. This guide should allow the upgrade without loss of settings and minimal, if any, changes to configurations.

NOTE: SLURM under BCM11 is restructured and the upgrade procedure is streamlined. Do not use these instructions for BCM11

This article is not applicable if the site already has the correct major version of slurm and only need to do a minor release update. To upgrade from slurm20.11.7 to slurm20.11.8 you only need to perform a yum update slurm20.11\* on your head node and software images.

This article presumes that the slurm controller and slurm database services are provided by the head node(s). If either the slurm controller or slurm database services are run on nodes controlled by the software images then these procedures may need to be altered.

IMPORTANT NOTE: With the recent release of serious CVEs against SLURM ( CVE-2022-29500, 29501, 2950 ), SchedMD will only be publishing fixed releases of SLURM 20.11 and 21.08, users are strongly encouraged to update to 21.08.8, 20.11.9 or newer. Bright is working to provide those packages as quickly as possible.

IMPORTANT NOTE: If Slurm is using Pyxis, then upgrading the Slurm version means that Pyxis needs to be reinstalled after the upgrade using cm-wlm-setup. The reinstallation run for Pyxis compiles Pyxis and recreates a plugin directory for the new Slurm version under /cm/shared/apps/slurm/.

Pre-requisites

The base operating system and all packages are up-to-date. With the complexity of Bright Cluster Manager, testing is performed against up-to-date systems.
Bright Cluster Manager packages have been updated
The current version of SLURM is known ( see below )
The current release of cmdaemon is known ( see below )
There is sufficient space in /tmp for slurm to alter the accounting tables.
Appropriate access to the desired slurm packages.
Appropriate knowledge of slurm.conf and/or other slurm settings being “Frozen”

## update all
# yum update
## update Bright Packages
# yum update --disablerepo=* --enablerepo=cm*
## Get Slurm Version 
# scontrol version
slurm 20.11.8
## Get Bright Version
# cmsh -c 'main; versioninfo' | grep Cluster
Cluster Manager          9.0

Slurm Release Schedules and Stepping up

SLURM allows sites to skip one release when upgrading, skipping more than one release risks the loss of SLURM state and the SLURM job accounting database. In order to reach a target version of slurm without loss of accounting or job information a cluster may need to upgrade to intermediary versions first.

NOTE: Due to a SLURM bug you can not upgrade more than 2 versions if there are pending or running jobs.

Slurm can be upgraded from version 20.02 or 20.11 to version 21.08 without loss of jobs or other state information. Upgrading directly from an earlier version of Slurm will result in loss of state information.
https://slurm.schedmd.com/news.html

The releases that Bright supports in version 8.1 and higher are as follows. Not every version is available for older releases of BCM, please check the package dashboard for released versions.

Slurm 17.11 — must upgrade to 18.08 before upgrading further
Slurm 18.08 — must upgrade to 20.02 first before upgrading further
Slurm 19.05 — must upgrade to 20.11 before upgrading further
Slurm 20.02 — must upgrade to 21.08 before upgrading further
Slurm 20.11 — must upgrade to 22.05 before upgrading further
Slurm 21.08 — must upgrade to 23.02 before upgrading further
Slurm 22.05 — must upgrade to 23.11 before upgrading further
Slurm 23.02 — Bright 9.0-20, 9.1-17, 9.2-11 or higher only
Slurm 23.11 — as of December 2024, this branch is supported by SchedMD – Bright 9.0-21, 9.1-18, 9.2-16, and 10.24.03 and higher only.
Slurm 24.05 — as of December 2024 this branch is supported by SchedMD — Bright 9.2-17, BCM 10.24.07 and higher only.

If the cluster is running Slurm 20.02 or older, pay close attention to which version may be upgraded directly. Skipping too many versions may result in the loss of stored accounting data and state information of the cluster.

Upgrade Procedure

This procedure is documented in the Administration manual for 9.0 and higher. The steps are as follows with full examples enumerated below.

It is recommended that no jobs are running. Draining nodes is one way to accomplish this. If not doing a direct upgrade ( 1 step ) then all pending jobs must be canceled.
Slurm server processes ( slurmctld, slurmdbd ) should be stopped.
The old Slurm packages should then be removed.
The new packages can then be installed.
The new Slurm version is then set in cmsh or Bright View, in the Slurm WLM cluster configuration.
Slurm server services slurmctld and slurmdbd should then be started again using cmsh or Bright View
Update the compute resources with new slurm software
Undrain cluster.

NOTE: If the slurm is being upgraded for the purposes of moving to a newer versions, the steps for installing the clients/software images may be skipped. (For example, if the cluster is being upgraded from 18.08 to 20.02 and then are immediately planning to upgrade to 21.08). It is critical that the slurmdbd and slurmctld are allowed to run so that all database and state information is updated before the next upgrade is started.

1. Quiet Cluster – Draining cluster and stopping jobs

The cluster should not have jobs running during the upgrade as the shared binaries have to be removed and upgraded. Draining all slurm clients can be achieved with the following cmsh command.

# cmsh
% device
% drain -l slurmclient

Confirm that there are no running or completing jobs. Queued jobs may be left to wait.

# squeue -t R,CG

Alternately, all running jobs may be canceled with the following slurm command. This must be done if doing a multi-step upgrade.

# scancel -t R  ## This will cancel all running commands

2. Stopping the slurm controller and account server

The slurm controller and accounting server should be stopped, use the following cmsh commands to shutdown all controllers and accounting daemons.

# cmsh
% device
% foreach -l slurmserver ( services; stop slurmctld )
% foreach -l slurmaccounting ( services; stop slurmdbd )

3. Removing Slurm packages

The old Slurm packages should then be removed. There can be only one version of Slurm at a time. The package manager will not allow both installed simultaneously.

Removal can be carried out from the primary head node on RHEL-based systems with, for example:

[root@bright91 ~]# yum remove slurm19*

If you have multiple head nodes, please remove the slurm packages from both before proceeding to install new packages. Removing the packages from the second head node may throw some error messages as some files may already be removed from /cm/shared.

The old packages must also be removed from each software image that uses it:

[root@bright91 ~]# cm-chroot-sw-img /cm/images/<software image>
...
[root@<software image> /]# yum remove slurm19*
...removal takes place...
[root@<software image> /]# exit

4. Install new Slurm packages

The new packages can then be installed. For installation onto the RHEL head node, the installation might be carried out as follows on the primary head node:

[root@bright91 ~]# yum install slurm20.11 slurm20.11-client slurm20.11-contribs slurm20.11-slurmdbd

The client package can be installed in each software image with, for example:

[root@bright91 ~]# cm-chroot-sw-img /cm/images/<software image>
...
[root@<software image> /]# yum install slurm20.11-client
...installation takes place
[root@<software image> /]# exit

5. Update the WLM version in Bright

NOTE: This section does not apply to Bright Cluster Manager 8.2

The new Slurm version is then set in cmsh or Bright View, in the Slurm WLM cluster configuration:

[root@bright91 ~]# cmsh
[bright91]% wlm use slurm
[bright91->wlm[slurm]]% set version 20.11; commit

6. Optional: Reinstalling Pyxis enroot.

If Slurm is using Pyxis, then upgrading the Slurm version requires that Pyxis be reinstalled using cm-wlm-setup. The reinstallation run for Pyxis compiles Pyxis and recreates a plugin directory for the new Slurm version under /cm/shared/apps/slurm/

[root@bright91 ~]# cm-wlm-setup --install-pyxis

7. Restart slurm controller and accounting daemon

Slurm server services slurmctld and slurmdbd should then be started again using cmsh or Bright
View.

# cmsh
% device
% foreach -l slurmaccounting ( services; start slurmdbd )
% foreach -l slurmserver ( services; start slurmctld )

Please confirm that these services have started successfully before continuing. The slurm.conf may need to be edited if settings have been deprecated and removed from the configuration.

8. Update the compute resources with the new slurm software

The nodes can then have their new image placed on them, and the new Slurm configuration can then be taken up. This can be done in the following two ways:

The regular nodes can then be restarted to supply the live nodes with the new image and get the new Slurm configuration running.
The imageupdate command may be used to update the running image, after which executing systemctl daemon-reload will force a re-read of the slurm client. More information about this option can be found in the Administration manual.

9. Undrain the cluster

Presuming the nodes were drained before the process began, they can undrained now and then confirm the operation of the cluster.

# cmsh
% device
% undrain -l slurmclient

Updated on December 13, 2024