This article will describe additional validation tests that can be completed on a HA cluster to confirm proper operation. This article should be seen as a supplement to the documentation in the Administration Manual.
Presumptions
For this article, we are presuming that you have a HA cluster setup with an optional WLM setup. Our example cluster will be using SLURM.
root@ew-haverify-a:~# cmha status
Node Status: running in active mode
ew-haverify-a* -> ew-haverify-b
failoverping [ OK ]
mysql [ OK ]
ping [ OK ]
status [ OK ]
ew-haverify-b -> ew-haverify-a*
failoverping [ OK ]
mysql [ OK ]
ping [ OK ]
status [ OK ]
root@ew-haverify-a:~# sinfo -la
Mon Oct 21 17:33:28 2024
PARTITION AVAIL TIMELIMIT JOB_SIZE ROOT OVERSUBS GROUPS NODES STATE RESERVATION NODELIST
defq* up infinite 1-infinite no NO all 5 idle node[001-005]
root@ew-haverify-a:~# srun hostname
node001
root@ew-haverify-a:~#
Testing Manual Failover
Manual failover is the process of forcing the passive head node to become the active. This will confirm that BCM is appropriately configured and that IP failover is successful but does not confirm that powercontrol or STONITH works. This is initiated with the following command on the passive.
cmha makeactive
You can see an example of it in action below.
root@ew-haverify-b:~# cmha makeactive
===========================================================================
This is the passive head node. Please confirm that this node should become
the active head node. After this operation is complete, the HA status of
the head nodes will be as follows:
ew-haverify-b will become active head node (current state: passive)
ew-haverify-a will become passive head node (current state: active)
===========================================================================
Continue(c)/Exit(e)? c
Initiating failover.............................. [ OK ]
ew-haverify-b is now active head node, makeactive successful
root@ew-haverify-b:~# cmha status
Node Status: running in active mode
ew-haverify-b* -> ew-haverify-a
failoverping [ OK ]
mysql [ OK ]
ping [ OK ]
status [ OK ]
ew-haverify-a -> ew-haverify-b*
failoverping [ OK ]
mysql [ OK ]
ping [ OK ]
status [ OK ]
root@ew-haverify-b:~#
Rebooting and Power cycling from the HA head node
While the alternate head node is now active you should test both the reboot
and power cycle
command from within cmsh
to make sure that the node can activate and reboot compute nodes.
Testing Automated Failover
Automated failover is where we simulate a crash of the active headnode and confirm that the passive headnode becomes active. This will validate that things like power control and IP takeover are working.
echo c > /proc/sysrq-trigger
After executing this command the active head node will fail and should be automatically rebooted. The passive head node should takeover the role and IP of the host. After making sure the other head node is powered on the cmha status
should return to a normal state in a few minutes.
root@ew-haverify-b:~# cmha status
Node Status: running in active mode
ew-haverify-b* -> ew-haverify-a
failoverping [ OK ]
mysql [ OK ]
ping [ OK ]
status [ OK ]
ew-haverify-a -> ew-haverify-b*
failoverping [ OK ]
mysql [ OK ]
ping [ OK ]
status [ OK ]
root@ew-haverify-b:~#
WLM Testing
For the following tests we will use the cmsupport
user and the following dummy job to load the job scheduler.
#!/bin/bash
echo "starting job"
sleep 600
echo "ending job"
We can then load up enough jobs to create a set of running jobs as well as a backlog.
cmsupport@ew-haverify-a:~$ for x in {1..100}; do sbatch test_job.sh ; done
We can now see 100 jobs either running or queued.
Manual Failover
We can then execute cmha makeactive
on the secondary and confirm that jobs are still running and queued.
root@ew-haverify-b:~# cmha makeactive
===========================================================================
This is the passive head node. Please confirm that this node should become
the active head node. After this operation is complete, the HA status of
the head nodes will be as follows:
ew-haverify-b will become active head node (current state: passive)
ew-haverify-a will become passive head node (current state: active)
===========================================================================
Continue(c)/Exit(e)? c
Initiating failover.............................. [ OK ]
ew-haverify-b is now active head node, makeactive successful
root@ew-haverify-b:~# squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
8 defq test_job cmsuppor PD 0:00 1 (Resources)
9 defq test_job cmsuppor PD 0:00 1 (Priority)
10 defq test_job cmsuppor PD 0:00 1 (Priority)
11 defq test_job cmsuppor PD 0:00 1 (Priority)
...
102 defq test_job cmsuppor PD 0:00 1 (Priority)
103 defq test_job cmsuppor PD 0:00 1 (Priority)
3 defq test_job cmsuppor R 5:52 1 node001
4 defq test_job cmsuppor R 5:14 1 node002
5 defq test_job cmsuppor R 5:14 1 node003
6 defq test_job cmsuppor R 5:14 1 node004
7 defq test_job cmsuppor R 5:14 1 node005
Automated Failover
For the next test we will execute a echo c > /proc/sysrq-trigger
while there are jobs running and confirm that jobs continue to run and queue.
NOTE: Because there is a crash of the host it may take a few minutes for the system to stabilize and for slurm to show running and queued jobs but they should return.