In certain scenarios disabling a node GPU can be necessary, for example when a GPU on a node becomes faulty and replacement is about to arrive. In this article we will show 2 possible ways for disabling an NVIDIA GPU on a compute node.
Method 1
* Collect the UUID of the target GPU by running the following nvidia-smi command on the compute node:
# nvidia-smi -L
* Create /etc/modprobe.d/nvidia.conf in the software image assigned to the node and add the following line:
options nvidia NVreg_ExcludedGpus=<the collected GPU UUID>
If the software image is used to provision other nodes, then the image should be cloned, and the changes should be done in the cloned image and finally the cloned image should be assigned to the target node (where the GPU needs to be disabled). Cloning software image has been covered in the admin manual.
* Reboot the node. Once the node is back up, the GPU shouldn’t be listed in the nvidia-smi output.
Method 2
This procedure can be followed in case rebooting the node is not desired.
Upon reboot, the changes made by following this procedure will be undone.
* Collect the PCI BUS ID of the GPU
# lspci | grep controller | grep NVIDIA
* Remove the device
# echo 1 > /sys/bus/pci/devices/0000:<collected BUS ID>/remove
* The GPU shouldn’t be listed in nvidia-smi output thereafter.
Important note
If the GPU being removed was used to run Workload Manager jobs, then the Workload Manager configuration also needs to be updated. To update the Workload Manager configuration for a single node, the relevant WLM role (e.g. slurmclient) should be assigned to that node and the assigned role should be configured based on the node and cluster specifications. Role and GPU related Workload Manager configuration have been discussed in the admin manual.