Extended Validation of HA Clusters

Contents

This article will describe additional validation tests that can be completed on a HA cluster to confirm proper operation. This article should be seen as a supplement to the documentation in the Administration Manual.

Presumptions

For this article, we are presuming that you have a HA cluster setup with an optional WLM setup. Our example cluster will be using SLURM.

root@ew-haverify-a:~# cmha status
Node Status: running in active mode

ew-haverify-a* -> ew-haverify-b
  failoverping  [  OK  ]
  mysql         [  OK  ]
  ping          [  OK  ]
  status        [  OK  ]

ew-haverify-b -> ew-haverify-a*
  failoverping  [  OK  ]
  mysql         [  OK  ]
  ping          [  OK  ]
  status        [  OK  ]

root@ew-haverify-a:~# sinfo -la
Mon Oct 21 17:33:28 2024
PARTITION AVAIL  TIMELIMIT   JOB_SIZE ROOT OVERSUBS     GROUPS  NODES       STATE RESERVATION NODELIST
defq*        up   infinite 1-infinite   no       NO        all      5        idle             node[001-005]
root@ew-haverify-a:~# srun hostname
node001
root@ew-haverify-a:~#

Testing Manual Failover

Manual failover is the process of forcing the passive head node to become the active. This will confirm that BCM is appropriately configured and that IP failover is successful but does not confirm that powercontrol or STONITH works. This is initiated with the following command on the passive.

cmha makeactive

You can see an example of it in action below.

root@ew-haverify-b:~# cmha makeactive
                                                                                                                                                                                                                                                                                        ===========================================================================
This is the passive head node. Please confirm that this node should become
the active head node. After this operation is complete, the HA status of
the head nodes will be as follows:

ew-haverify-b will become active head node (current state: passive)
ew-haverify-a will become passive head node (current state: active)
===========================================================================

Continue(c)/Exit(e)? c

Initiating failover.............................. [  OK  ]

ew-haverify-b is now active head node, makeactive successful

root@ew-haverify-b:~# cmha status
Node Status: running in active mode

ew-haverify-b* -> ew-haverify-a
  failoverping  [  OK  ]
  mysql         [  OK  ]
  ping          [  OK  ]
  status        [  OK  ]
                                                                                                                                                                                                                                                                                        ew-haverify-a -> ew-haverify-b*
  failoverping  [  OK  ]
  mysql         [  OK  ]
  ping          [  OK  ]
  status        [  OK  ]

root@ew-haverify-b:~#

Rebooting and Power cycling from the HA head node

While the alternate head node is now active you should test both the reboot and power cycle command from within cmsh to make sure that the node can activate and reboot compute nodes.

Testing Automated Failover

Automated failover is where we simulate a crash of the active headnode and confirm that the passive headnode becomes active. This will validate that things like power control and IP takeover are working.

echo c > /proc/sysrq-trigger

After executing this command the active head node will fail and should be automatically rebooted. The passive head node should takeover the role and IP of the host. After making sure the other head node is powered on the cmha status should return to a normal state in a few minutes.

root@ew-haverify-b:~# cmha status
Node Status: running in active mode

ew-haverify-b* -> ew-haverify-a
  failoverping  [  OK  ]
  mysql         [  OK  ]
  ping          [  OK  ]
  status        [  OK  ]

ew-haverify-a -> ew-haverify-b*
  failoverping  [  OK  ]
  mysql         [  OK  ]
  ping          [  OK  ]
  status        [  OK  ]

root@ew-haverify-b:~#

WLM Testing

For the following tests we will use the cmsupport user and the following dummy job to load the job scheduler.

#!/bin/bash

echo "starting job"
sleep 600
echo "ending job"

We can then load up enough jobs to create a set of running jobs as well as a backlog.

cmsupport@ew-haverify-a:~$ for x in {1..100}; do sbatch test_job.sh ; done

We can now see 100 jobs either running or queued.

Manual Failover

We can then execute cmha makeactive on the secondary and confirm that jobs are still running and queued.

root@ew-haverify-b:~# cmha makeactive

===========================================================================
This is the passive head node. Please confirm that this node should become
the active head node. After this operation is complete, the HA status of
the head nodes will be as follows:

ew-haverify-b will become active head node (current state: passive)
ew-haverify-a will become passive head node (current state: active)
===========================================================================

Continue(c)/Exit(e)? c

Initiating failover.............................. [  OK  ]

ew-haverify-b is now active head node, makeactive successful

root@ew-haverify-b:~# squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                 8      defq test_job cmsuppor PD       0:00      1 (Resources)
                 9      defq test_job cmsuppor PD       0:00      1 (Priority)
                10      defq test_job cmsuppor PD       0:00      1 (Priority)
                11      defq test_job cmsuppor PD       0:00      1 (Priority)
...
               102      defq test_job cmsuppor PD       0:00      1 (Priority)
               103      defq test_job cmsuppor PD       0:00      1 (Priority)
                 3      defq test_job cmsuppor  R       5:52      1 node001
                 4      defq test_job cmsuppor  R       5:14      1 node002
                 5      defq test_job cmsuppor  R       5:14      1 node003
                 6      defq test_job cmsuppor  R       5:14      1 node004
                 7      defq test_job cmsuppor  R       5:14      1 node005

Automated Failover

For the next test we will execute a echo c > /proc/sysrq-trigger while there are jobs running and confirm that jobs continue to run and queue.

NOTE: Because there is a crash of the host it may take a few minutes for the system to stabilize and for slurm to show running and queued jobs but they should return.

Updated on October 21, 2024