1. Home
  2. Installing NVIDIA DGX software stack in Bright Ubuntu 20.04 software images

Installing NVIDIA DGX software stack in Bright Ubuntu 20.04 software images

This document describes the procedure for installing the official Nvidia DGX software stack in a Bright Ubuntu 20.04 software image. The instructions in this document have been tested on the DGX A100 systems.

Step 1: Prepare a copy of the software image

This example uses the default software image as the base image. Any other software image can be used for this purpose.

Clone the base software image to an image called dgxa100-image.

# Create a clone of the default Bright software image
$ cmsh
  softwareimage
  clone default-image dgxa100-image
  commit

# Important: Remove the Bright cuda-dcgm/cuda-driver packages from software image(if installed)
apt --root /cm/images/dgxa100-image remove cuda-dcgm cuda-driver

Step 2: Assign new image to the node(s) or category

Define a new node. say dgx-a100 and assign the prepared dgx-image to it. After the initial ramdisk has been generated, provision the DGX node.

In the example below we will set the software image for an individual node, but if you have many DGX nodes it makes more sense to create a node category, make your DGX nodes part of this category, and set the software image for the category (which will then be inherited by all nodes in the category).

$ cmsh
  device use dgx-a100;
  set softwareimage dgxa100-image;
  commit

  # Wait for ramdisk to be generated

Step 3: Provision DGX node

Provision the DGX node dgx-a100

Step 4: Install DGX software stack

Install Ansible and Git

# On an Ubuntu 20.04 host (can be a Bright head node)
apt-add-repository ppa:ansible/ansible
apt install ansible git

Install DGX tools and Nvidia drivers

Clone DeepOps repository and setup environment. Add hostname of DGX node (which is dgx-a100 in the example below) to the Ansible inventory and run the playbook to deploy the DGX software stack on the node.

The instruction below uses DeepOps release 21.09, which is the latest stable release that supports Ubuntu 20.04 at the time of this writing (November 12th 2021).

# Clone repository and prepare environment
git clone https://github.com/NVIDIA/deepops.git
cd deepops
git checkout 21.09
scripts/setup.sh

# Update inventory and config
cp -a config.example config
cp config/inventory{,.bk}
echo dgx-a100 > config/inventory
sed -i.bak 's/nvidia_driver_skip_reboot: false/nvidia_driver_skip_reboot: true/g' roles/nvidia-dgx/defaults/main.yml

# Run nvidia-dgx playbook
ansible-playbook -vvv playbooks/nvidia-dgx/nvidia-dgx.yml

Step 5: Grab image from DGX node

Sync the disk image from the running DGX node onto the software image dgxa100-image on the head node.

$ cmsh
  device use dgx-a100
  grabimage -w

To prevent the system from being suspended after inactivity, the sleep, suspend and hibernate systemd targets in the software image must be masked as follows:

$ cm-chroot-sw-img /cm/images/dgxa100-image
  systemctl mask sleep.target suspend.target hibernate.target

Step 6: Add nv_peer_mem module

The nv_peer_mem module gets installed as part of the DGX OS installation. In order to use GPUDirect over Infiniband, the module must be configured to be loaded automatically when the system boots. Doing this by adding it to the kernel modules list for the software image is not recommended, because the nv_peer_mem kernel module relies on other kernel modules that may not be loadable from the initrd. Write the following file in the software image on the head node:

echo "nv_peer_mem" > /cm/images/dgxa100-image/etc/modules-load.d/nv_peer_mem.conf

Step 7: Set kernel version and create initial ramdisk

Installation of DGX OS can result in a new kernel getting installed, and the required kernel modules are built against the new kernel. In such a scenario, the kernelversion setting of the software image must be updated as follows:

$ cmsh
  softwareimage use dgxa100-image
  set kernelversion <TAB> (choose the correct kernel version)
  commit

  # Wait for ramdisk to be generated

If no additional kernel was installed, then the ramdisk re-creation must be triggered manually as follows:

$ cmsh
  softwareimage createramdisk dgxa100-image

Step 8: Provision node(s) with new image

The image is now ready to boot any number of DGX nodes. If you have created a Bright node category, you can configure any node to be part of that category by setting its category property. This will make the nodes inherit the softwareimage setting that you have defined for the category. Alternatively, you can configure the softwareimage property for individual nodes.

When nodes are powered on, they will be imaged using the Bright software image to which we have added the DGX software stack. The following command can be used to verify that the Nvidia drivers and services are working as expected:

[root@dgx-a100 ~]# dcgmi discovery -l
8 GPUs found.
+--------+----------------------------------------------------------------------+
| GPU ID | Device Information                                                   |
+--------+----------------------------------------------------------------------+
| 0      | Name: A100-SXM4-40GB                                                 |
|        | PCI Bus ID: 00000000:07:00.0                                         |
|        | Device UUID: GPU-8cda93c7-c7b3-5bb1-c5ae-d18f14ec21b5                |
+--------+----------------------------------------------------------------------+
| 1      | Name: A100-SXM4-40GB                                                 |
|        | PCI Bus ID: 00000000:0F:00.0                                         |
|        | Device UUID: GPU-4b1d7230-8739-451f-4143-19e35cc34e3b                |
+--------+----------------------------------------------------------------------+
| 2      | Name: A100-SXM4-40GB                                                 |
|        | PCI Bus ID: 00000000:47:00.0                                         |
|        | Device UUID: GPU-7d4ef5ea-b2f2-d509-a507-a20f2e656dc7                |
+--------+----------------------------------------------------------------------+
| 3      | Name: A100-SXM4-40GB                                                 |
|        | PCI Bus ID: 00000000:4E:00.0                                         |
|        | Device UUID: GPU-a41e5efb-3df9-9846-8ab7-4adb39f0467f                |
+--------+----------------------------------------------------------------------+
| 4      | Name: A100-SXM4-40GB                                                 |
|        | PCI Bus ID: 00000000:87:00.0                                         |
|        | Device UUID: GPU-900945ac-42c9-b6df-e263-9f1477fab578                |
+--------+----------------------------------------------------------------------+
| 5      | Name: A100-SXM4-40GB                                                 |
|        | PCI Bus ID: 00000000:90:00.0                                         |
|        | Device UUID: GPU-5f5041fd-791b-de04-d7fd-510e4c78de3f                |
+--------+----------------------------------------------------------------------+
| 6      | Name: A100-SXM4-40GB                                                 |
|        | PCI Bus ID: 00000000:B7:00.0                                         |
|        | Device UUID: GPU-a558bb3a-8f95-0487-2b73-46ec419410bf                |
+--------+----------------------------------------------------------------------+
| 7      | Name: A100-SXM4-40GB                                                 |
|        | PCI Bus ID: 00000000:BD:00.0                                         |
|        | Device UUID: GPU-d2839297-b553-cea3-e2bf-c76eff99279b                |
+--------+----------------------------------------------------------------------+
6 NvSwitches found.
+-----------+
| Switch ID |
+-----------+
| 9         |
| 13        |
| 11        |
| 10        |
| 8         |
| 12        |
+-----------+

The node should come up with the correct Mellanox OFED kernel modules and nv_peer_mem kernel module loaded. This can be verified as follows:

[root@dgx-a100 ~]# lsmod | grep nv_peer_mem
nv_peer_mem            16384  0
nvidia              19378176  449 nvidia_uvm,nv_peer_mem,nvidia_modeset
ib_core               425984  9 rdma_cm,ib_ipoib,nv_peer_mem,iw_cm,ib_umad,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm

Step 9: Disable cuda-dcgm service monitoring

By default Bright will try to monitor the service cuda-dcgm on nodes where Nvidia GPUs have been detected. For DGX A100 the Bright cuda-dcgm package will not installed (see Step 1) and hence the service will not be available. To prevent CMDaemon from automatically monitoring the service, one of the following must be done:

To apply the setting cluster wide, the following advanced config flag must be added to the AdvancedConfig section in /cm/local/apps/cmd/etc/cmd.confon the active head node.

RoleService.cuda-dcgm=0

And the CMDaemon service on the head node must be restarted after adding the advanced config.

systemctl restart cmd

Alternatively, the cuda-dcgm service for each node can be configured not to be monitored. This can be done from cmsh as follows:

$ cmsh
  device services dgx-a100;
  set cuda-dcgm monitored no;
  set cuda-dcgm autostart no;
  commit
Updated on November 12, 2021

Was this article helpful?

Leave a Comment