Loading the correct kernel modules
If you are going to use the built-in gigabit Ethernet interface as your internal cluster network between the head node(s) and the DGX nodes, there is nothing special that needs to be done in terms of loading kernel modules. This is because the igb
module is already present by default in the software image’s list of kernel modules that are included in the initrd.
If you are going to use one of the Mellanox interfaces for the internal cluster network, it is important to add the mlx5_core
kernel module to your software image. Without this kernel module, the Mellanox interface will not be visible during the node’s PXE booting process. This can be done as follows:
[root@mycluster ~]# cmsh
[mycluster]% softwareimage
[mycluster->softwareimage]% use dgxa100-image
[mycluster->softwareimage[dgxa100-image]]% kernelmodules
[mycluster->softwareimage[dgxa100-image]->kernelmodules]% add mlx5_core
[mycluster->softwareimage[dgxa100-image*]->kernelmodules]% commit
Fixing predictable device names
If you are using one of the Mellanox interfaces on the DGX nodes for the internal cluster network, depending on which Linux distribution and kernel you are using, the network interface names that are seen during the node installation phase may deviate from the network interface names that are seen once the OS is fully booted. In particular, an interface such as enp225s0f0
may initially come up as enp225s0f0np0
. It is currently unknown why this happens, but there is a simple workaround, which is to set the following finalize script for your node category:
#!/bin/bash
#
scriptsdir=/etc/sysconfig/network-scripts
if [ -d /etc/network/interfaces.d ];then
scriptsdir=/etc/network/interfaces.d
fi
for f in /localdisk/$scriptsdir/*np?; do
if [ ! -e $f ]; then continue; fi
newname=${f%np?}
mv $f $newname
filename=`basename $f`
ifacename=${filename#ifcfg-}
newifacename=${ifacename%np?}
subst="'s/$ifacename/$newifacename/g'"
eval sed -i $subst $newname
done
Preventing false positive ipmihealth
failures
After the cluster has been installed, it is fairly common to see healthcheck failures such as this:
Thu Mar 11 09:50:05 2021 [warning] dgx-03: The trigger 'Failing health checks' is active because the measurable 'ipmih
ealth' is FAIL (37984)
This is not necessarily a problem, and the ipmihealth
healthcheck can simply be disabled as follows:
[root@mycluster ~]# cmsh
[mycluster]% monitoring
[mycluster->monitoring]% measurable
[mycluster->monitoring->measurable]% use ipmihealth
[mycluster->monitoring->measurable[ipmihealth]]% set disabled yes
[mycluster->monitoring->measurable*[ipmihealth*]]% commit
[mycluster->monitoring->measurable[ipmihealth]]%
Installing CUDA
If you will be using containerized workload exclusively on the DGX cluster, it is not necessary to install CUDA. If you intend to run GPU workload natively on the nodes (i.e. without using containers), you will have to install the Bright CUDA on the head node. CUDA will be installed in the /cm/shared
tree which is available on all of the nodes in the cluster.
At the time of writing, installing the DGX OS software stack will install the 450 version of the NVIDIA driver. This version is not compatible with the latest version of CUDA, so it is recommended to install an older version of CUDA.
If you intend to use Bright’s ML packages on your DGX cluster, it is a good idea to check for which version of CUDA the packages that you would like to use are available, keeping in mind that this CUDA version must be compatible with the NVIDIA driver that is installed as part of the DGX OS software stack.
At the time of writing, it is recommended to install CUDA 10.2:
[root@mycluster~]# yum install cuda10.2-sdk cuda10.2-toolkit
Setting an appropriate disk setup
DGX nodes typically have 2 smaller NVME drives and 4 larger NVME drives. The 2 smaller NVME drives tend to be used as the OS drive in RAID1 and the 4 larger NVME drives are typically configured in RAID0 as a scratch drive.
The default Bright disk layout only uses the first NVME drive, so in order to be able to use all 6 drives, the disk needs to be changed. Disk layouts can be set for individual nodes, but it is recommended to set it for a category of nodes.
The disk layout below assumes that the 2 smaller drives are /dev/nvme1n1
and /dev/nvme2n1
and that the 4 larger drives are /dev/nvme0n1
, /dev/nvme3n1
, /dev/nvme4n1
and /dev/nvme5n1
. This can be verified as follows:
[root@dgx-node ~]# fdisk -l | grep Disk | grep /dev/nvme
Disk /dev/nvme4n1: 3.5 TiB, 3840755982336 bytes, 7501476528 sectors
Disk /dev/nvme5n1: 3.5 TiB, 3840755982336 bytes, 7501476528 sectors
Disk /dev/nvme1n1: 1.8 TiB, 1920383410176 bytes, 3750748848 sectors
Disk /dev/nvme2n1: 1.8 TiB, 1920383410176 bytes, 3750748848 sectors
Disk /dev/nvme3n1: 3.5 TiB, 3840755982336 bytes, 7501476528 sectors
Disk /dev/nvme0n1: 3.5 TiB, 3840755982336 bytes, 7501476528 sectors
The disk-layout below can be set by saving it to a file (e.g. /tmp/mydisklayout.xml
) and then using CMSH to load it. Note that CMSH will read the contents of the file, and will set it as an XML value. Therefore when the file is updated, the following steps will have to be repeated to update the disk setup that will be used for nodes as they boot.
[root@mycluster ~]# cmsh
[mycluster]% category use dgxa100
[mycluster->category[dgxa100]]% set disksetup /tmp/mydisksetup
[mycluster->category[dgxa100*]]% commit
The following disk setup will use RAID1 for the two smaller OS NVME drives and RAID0 for creating a filesystem that spans the 4 larger NVME drives which is mounted under /project.
<diskSetup>
<device>
<blockdev>/dev/nvme1n1</blockdev>
<partition id="boot1" partitiontype="esp">
<size>512M</size>
<type>linux raid</type>
</partition>
<partition id="swap1" partitiontype="esp">
<size>16G</size>
<type>linux raid</type>
</partition>
<partition id="os1" partitiontype="esp">
<size>max</size>
<type>linux raid</type>
</partition>
</device>
<device>
<blockdev>/dev/nvme2n1</blockdev>
<partition id="boot2" partitiontype="esp">
<size>512M</size>
<type>linux raid</type>
</partition>
<partition id="swap2" partitiontype="esp">
<size>16G</size>
<type>linux raid</type>
</partition>
<partition id="os2" partitiontype="esp">
<size>max</size>
<type>linux raid</type>
</partition>
</device>
<device>
<blockdev>/dev/nvme0n1</blockdev>
<partition id="project1" partitiontype="esp">
<size>max</size>
<type>linux raid</type>
</partition>
</device>
<device>
<blockdev>/dev/nvme3n1</blockdev>
<partition id="project2" partitiontype="esp">
<size>max</size>
<type>linux raid</type>
</partition>
</device>
<device>
<blockdev>/dev/nvme4n1</blockdev>
<partition id="project3" partitiontype="esp">
<size>max</size>
<type>linux raid</type>
</partition>
</device>
<device>
<blockdev>/dev/nvme5n1</blockdev>
<partition id="project4" partitiontype="esp">
<size>max</size>
<type>linux raid</type>
</partition>
</device>
<raid id="boot">
<member>boot1</member>
<member>boot2</member>
<level>1</level>
<filesystem>ext2</filesystem>
<mountPoint>/boot</mountPoint>
<mountOptions>defaults,noatime,nodiratime</mountOptions>
</raid>
<raid id="swap">
<member>swap1</member>
<member>swap2</member>
<level>1</level>
<swap/>
</raid>
<raid id="os">
<member>os1</member>
<member>os2</member>
<level>1</level>
<filesystem>xfs</filesystem>
<mountPoint>/</mountPoint>
<mountOptions>defaults,noatime,nodiratime</mountOptions>
</raid>
<raid id="project">
<member>project1</member>
<member>project2</member>
<member>project3</member>
<member>project4</member>
<level>0</level>
<filesystem>xfs</filesystem>
<mountPoint>/project</mountPoint>
<mountOptions>defaults,noatime,nodiratime</mountOptions>
</raid>
</diskSetup>
Because this disk layout uses RAID1 and RAID0, it is necessary to schedule the relevant kernel modules to be loaded for the software image. This will cause the kernel modules to be included in the initrd that is loaded when the nodes are booted.
[root@mycluster ~]# cmsh
[mycluster]% softwareimage use dgxa100-image
[mycluster->softwareimage[dgxa100-image]]% kernelmodules
[mycluster->softwareimage[dgxa100-image-]->kernelmodules]% add raid0
[mycluster->softwareimage[dgxa100-image*]->kernelmodules]% add raid1
[mycluster->softwareimage[dgxa100-image*]->kernelmodules]% commit
Lastly, you will want to make sure that the /project directory exists in the software image.
[root@mycluster ~]# mkdir -p /cm/images/dgxa100-image/project