This document describes the procedure for installing the official Nvidia DGX software stack in a Bright Ubuntu 20.04 software image. The instructions in this document target the DGX A100, but the same procedure can be used for other DGX systems such as DGX-1 and DGX-2.
Note: We now publish a pre-made DGX Software image which eliminates the need for these steps. The following example is only applicable to those who cannot run that software image. The pre-made image can be obtained from support.
Step 1: Prepare a copy of the software image
This example uses the default software image as the base image. Any other software image can be used for this purpose.
Clone the base software image to an image called dgxa100-image.
# Create a clone of the default Bright software image
$ cmsh
softwareimage
clone default-image dgxa100-image
commit
# Important: Remove the Bright cuda-dcgm/cuda-driver packages from software image(if installed)
cm-chroot-sw-img /cm/images/dgxa100-image
apt remove cuda-dcgm cuda-driver
exit
Step 2: Assign new image to the node(s) or category
Define a new node. say dgx-a100 and assign the prepared dgx-image to it. After the initial ramdisk has been generated, provision the DGX node.
In the example below we will set the software image for an individual node, but if you have many DGX nodes it makes more sense to create a node category, make your DGX nodes part of this category, and set the software image for the category (which will then be inherited by all nodes in the category).
$ cmsh
device use dgx-a100;
set softwareimage dgxa100-image;
commit
# Wait for ramdisk to be generated
Step 3: Provision DGX node
Provision the DGX node dgx-a100
Step 4: Install DGX software stack
The steps in this section must be performed on the DGX node dgx-a100 provisioned in Step 3. The URLs, names of the repositories and driver versions in this section are subject to change. Please refer to the official DGX Software Stack Install Guide for latest information on DGX repositories when required.
Install DGX tools and Nvidia drivers
# Enable NVIDIA repositories
curl https://repo.download.nvidia.com/baseos/ubuntu/focal/dgx-repo-files.tgz | sudo tar xzf - -C /
# Update internal APT database
apt update
# Recommended: Upgrade all software packages with the latest versions
apt upgrade
# Install DGX system tools and configurations
apt install -y dgx-a100-system-configurations dgx-a100-system-tools dgx-a100-system-tools-extra
# Disable the ondemand governor to set the governor to performance mode
systemctl disable ondemand
# Recommended: Disable unattended upgrades
apt purge -y unattended-upgrades
# Install latest kernel
apt install -y linux-generic
# Install NVIDIA CUDA driver
apt install -y nvidia-driver-470-server linux-modules-nvidia-470-server-generic libnvidia-nscq-470 nvidia-modprobe nvidia-fabricmanager-470 datacenter-gpu-manager nv-persistence-mode
# Enable required services
systemctl enable nvidia-fabricmanager nvidia-persistenced nvidia-dcgm
# Install Serial over LAN and NVIDIA System Management tool packages:
apt install -y nvidia-ipmisol nvsm
Step 5: Install Mellanox OFED
# Important: Remove ibsim-utils, ibutils and libumad2sim0 packages if installed
apt remove ibsim-utils ibutils libumad2sim0
# Install the MOFED Driver
apt install -y mlnx-ofed-all nvidia-mlnx-ofed-misc
# Enable and start the openibd service
systemctl enable --now openibd
If the systemctl command above fails, please check the KB article The Mellanox IB kernel modules fail to load with mlnx-ofed-49, mlnx-ofed-50, mlnx-ofed-51, mlnx-ofed-52 for resolution.
Step 6: Grab image from DGX node
Sync the disk image from the running DGX node onto the software image dgxa100-image on the head node.
$ cmsh
device use dgx-a100
grabimage -w
To prevent the system from being suspended after inactivity, the sleep, suspend
and hibernate
systemd targets in the software image must be masked as follows:
$ cm-chroot-sw-img /cm/images/dgxa100-image
systemctl mask sleep.target suspend.target hibernate.target
Step 7: Add nvidia_peermem module
The nvidia_peermem
module gets installed as part of Step 5. In order to use GPUDirect over Infiniband, the module must be configured to be loaded automatically when the system boots. Doing this by adding it to the kernel modules list for the software image is not recommended, because the nvidia_peermem
kernel module relies on other kernel modules that may not be loadable from the initrd. Write the following file in the software image on the head node:
echo "nvidia_peermem" > /cm/images/dgxa100-image/etc/modules-load.d/nvidia_peermem.conf
Step 8: Set kernel version and create initial ramdisk
Installation of OS updates in Step 4 can result in a new kernel getting installed (as part of installing the linux-generic package), and the required kernel modules are built against the new kernel. In such a scenario, the kernelversion
setting of the software image must be updated as follows:
$ cmsh
softwareimage use dgxa100-image
set kernelversion <TAB> (choose the correct kernel version)
commit
# Wait for ramdisk to be generated
If no additional kernel was installed, then the ramdisk re-creation must be triggered manually as follows:
$ cmsh
softwareimage createramdisk dgxa100-image
Step 9: Provision node(s) with new image
The image is now ready to boot any number of DGX nodes. If you have created a Bright node category, you can configure any node to be part of that category by setting its category
property. This will make the nodes inherit the softwareimage
setting that you have defined for the category. Alternatively, you can configure the softwareimage
property for individual nodes.
When nodes are powered on, they will be imaged using the Bright software image to which we have added the DGX software stack. The following command can be used to verify that the Nvidia drivers and services are working as expected:
[root@dgx-a100 ~]# dcgmi discovery -l
8 GPUs found.
+--------+----------------------------------------------------------------------+
| GPU ID | Device Information |
+--------+----------------------------------------------------------------------+
| 0 | Name: A100-SXM4-40GB |
| | PCI Bus ID: 00000000:07:00.0 |
| | Device UUID: GPU-8cda93c7-c7b3-5bb1-c5ae-d18f14ec21b5 |
+--------+----------------------------------------------------------------------+
| 1 | Name: A100-SXM4-40GB |
| | PCI Bus ID: 00000000:0F:00.0 |
| | Device UUID: GPU-4b1d7230-8739-451f-4143-19e35cc34e3b |
+--------+----------------------------------------------------------------------+
| 2 | Name: A100-SXM4-40GB |
| | PCI Bus ID: 00000000:47:00.0 |
| | Device UUID: GPU-7d4ef5ea-b2f2-d509-a507-a20f2e656dc7 |
+--------+----------------------------------------------------------------------+
| 3 | Name: A100-SXM4-40GB |
| | PCI Bus ID: 00000000:4E:00.0 |
| | Device UUID: GPU-a41e5efb-3df9-9846-8ab7-4adb39f0467f |
+--------+----------------------------------------------------------------------+
| 4 | Name: A100-SXM4-40GB |
| | PCI Bus ID: 00000000:87:00.0 |
| | Device UUID: GPU-900945ac-42c9-b6df-e263-9f1477fab578 |
+--------+----------------------------------------------------------------------+
| 5 | Name: A100-SXM4-40GB |
| | PCI Bus ID: 00000000:90:00.0 |
| | Device UUID: GPU-5f5041fd-791b-de04-d7fd-510e4c78de3f |
+--------+----------------------------------------------------------------------+
| 6 | Name: A100-SXM4-40GB |
| | PCI Bus ID: 00000000:B7:00.0 |
| | Device UUID: GPU-a558bb3a-8f95-0487-2b73-46ec419410bf |
+--------+----------------------------------------------------------------------+
| 7 | Name: A100-SXM4-40GB |
| | PCI Bus ID: 00000000:BD:00.0 |
| | Device UUID: GPU-d2839297-b553-cea3-e2bf-c76eff99279b |
+--------+----------------------------------------------------------------------+
6 NvSwitches found.
+-----------+
| Switch ID |
+-----------+
| 9 |
| 13 |
| 11 |
| 10 |
| 8 |
| 12 |
+-----------+
The node should come up with the correct Mellanox OFED kernel modules and nvidia_peermem kernel module loaded. This can be verified as follows:
root@dgx-a100:~# lsmod | grep nvidia_peermem
nvidia_peermem 16384 0
nvidia 39116800 382 nvidia_uvm,nvidia_peermem,nvidia_modeset
ib_uverbs 131072 2 nvidia_peermem,mlx5_ib