Purpose
Setting a node to the CLOSED state typically removes an unhealthy node from the cluster management system. The node can still be UP and display UP/CLOSED.
However, the node can continue running workload jobs in this state, since workload managers run independently of CMDaemon.
If the workload manager is still running, the jobs themselves are still handled by the workload manager, even if CMDaemon is no longer aware of the node state until the node is reopened. For this reason, draining a node is often done before closing a node.
Other common purposes for draining include:
- Planned maintenance
- Hardware troubleshooting
- Preventing new jobs during system changes
- Isolating problematic nodes
Steps
- Enter device mode in cmsh.
# cmsh
% device - Select the node that you want to drain via the use command:
% use <node>
% drain - Alternatively, rather than selecting an individual node, you can drain a group of nodes:
% drain -n <nodes>
- You can also drain a node category if you need to drain a set of nodes:
% drain -c <category>
- And you can drain a configuration overlay, which will drain all nodes in that overlay:
% drain -e <overlay>
-
After work is completed on the node, or nodes, the node can then be undrained by running the command:
% undrain
-
This command uses the same options as the drain command:
% undrain -n <nodes>
% undrain -c <category>
% undrain -e <overlay>
Additional Details
You can see a complete list of available options for draining nodes by running the following command on the active head node:
# cmsh -c "device help drain"
For example:
Name: drain - Drain jobs (not data) on a set of nodes
Options:
-n, --nodes <node>
-g, --group <group>
Include all nodes that belong to the node group, e.g. testnodes or
test01,test03
-c, --category <category>
Include all nodes that belong to the category, e.g. default or default,gpu
Examples:
drain Drain the current node
drain node001 Drain node001
drain -r rack01 Drain all nodes in rack01
drain --setactions reboot Drain the current node, and append reboot when all jobs are completed
drain --appendactions reboot Append reboot to existing drain actions for the current node