This article is being updated. Please be aware the content herein, not limited to version numbers and slight syntax changes, may not match the output from the most recent versions of Bright. This notation will be removed when the content has been updated
When enabling shared resources in Slurm as per the article here, you may see the following error in /var/log/slurmctld on the headnode:
we don't have select plugin type 102
Checking through the logs you may also see:
error: Incomplete job record fatal:
Incomplete job state save file, start with '-i' to ignore this
Occasionally, when enabling shared resources in Slurm, the job state save file becomes incomplete. To work around this issue, perform the following steps.
First, stop slurmctld
in Bright:
# cmsh
% device use master
% services
% stop slurm
% quit
Next, have you SelectType and SelectTypeParameters set how you want them to be configured in slurm.conf.
Then, start slurmctld by running the following command on your head node:
# /cm/shared/apps/slurm/current/sbin/slurmctld -i
That will tell slurmctld to start while ignoring the incomplete job state save file error.
After that, kill the process for slurmctld:
# killall slurmctld
Then, start slurmctld from Bright again:
# cmsh
% device use master
% services
% start slurm
Now slurmctld should be starting properly using your desired slurm.conf settings.
You may also need to run the scontrol reconfigure command once slurmctld is started to notify the compute nodes.
scontrol reconfigure
This KB article helped us get our cluster back online after switching SLURM fromselect/linear to select/cons_res. Thanks for posting.