How to Use a Compute Node as a Redundant Slurm Controller?

Let’s say that the BCM cluster already has Slurm deployed and that the cluster has one head node that is serving as the lone Slurm server (controller). Perhaps you do not have additional hardware that you can use as a secondary head node that can be configured for head node high availability and serve as a redundant Slurm controller. You may follow these procedures to make a compute node a Slurm controller.

First, install the slurm package onto the software image that will be used by this compute node. For example, let’s say that the cluster is currently running Slurm 23.02 and that the software image is called “slurmctld-image”. You would run these commands on the active head node:

For RHEL/Rocky Linux…

# cm-chroot-sw-img /cm/images/slurmctld-image % dnf install slurm23.02 % exit

For Ubuntu…

# cm-chroot-sw-img /cm/images/slurmctld-image % apt update % apt install slurm23.02 % exit

For SLES…

# cm-chroot-sw-img /cm/images/slurmctld-image % zypper in slurm23.02 % exit

Next, reprovision the node using that image so that the Slurm controller software that comes with that package will be present on the node. Rebooting the node so that it boots over the network is the easiest way to do this.

After the node comes online, assign the slurmserver role to that node. For example, let’s say that the node’s name is node001 and that there is an existing configuration overlay that is called “slurm-server” and is defined in BCM. The following commands may be run on the active head node:

# cmsh % configurationoverlay use slurm-server % set nodes node001 % commit

You should soon see that slurmctld has been started on the new node.

Updated on January 29, 2025

Tagged: HA slurm

Related Articles