1. Home
  3. Installing NVIDIA DGX software stack in Bright RHEL7 software images

Installing NVIDIA DGX software stack in Bright RHEL7 software images

This document describes the procedure for installing the official Nvidia DGX software stack in a Bright RHEL7 software image. The instructions in this document target the DGX A100, but the same procedure can be used for other DGX systems such as DGX-1, DGX-2 and DGX Station.

Step 1: Prepare a copy of the software image

This example uses the default software image as the base image. Any other software image can be used for this purpose.

Clone the base software image to an image called dgxa100-image.

# Create a clone of the default Bright software image, add raid modules to the new image if necessary
$ cmsh
  clone default-image dgxa100-image
  list | grep raid
  add raid0
  add raid1

# Important: Remove the Bright cuda-dcgm/cuda-driver packages(if installed)
yum --installroot /cm/images/dgxa100-image remove cuda-dcgm cuda-driver

Step 2: Assign new image to the node(s) or category

Define a new node. say dgx-a100 and assign the prepared dgxa100-image to it. After the initial ramdisk has been generated, provision the DGX node.

In the example below we will set the software image for an individual node, but if you have many DGX nodes it makes more sense to create a node category, make your DGX nodes part of this category, and set the software image for the category (which will then be inherited by all nodes in the category).

$ cmsh
  device use dgx-a100;
  set softwareimage dgxa100-image;

  # Wait for ramdisk to be generated

Step 3: Provision DGX node

Provision the DGX node dgx-a100

Step 4: Install DGX software stack

The steps in this section must be performed on the DGX node dgx-a100 provisioned in Step 3. The URLs, names of the repositories and package versions in this section are subject to change. Please refer to the official DGX RHEL7 Install Guide for latest information on DGX repositories and additional RHEL7 repositories that must be enabled.

Enable repositories

# Enable additional RHEL7 repositories
subscription-manager repos --enable=rhel-7-server-extras-rpms
subscription-manager repos --enable=rhel-7-server-optional-rpms

Perform OS update

yum update

Install DGX tools and drivers

# Install DGX repository package
yum install -y https://international.download.nvidia.com/dgx/repos/rhel-files/dgx-repo-setup-21.11-1.el7.x86_64.rpm

# Enable nvidia-dgx-7-r450-cuda11-0 and nvidia-dgx-7-r470-cuda11-4 repositories

yum-config-manager --enable nvidia-dgx-7-r450-cuda11-0
yum-config-manager --enable nvidia-dgx-7-r470-cuda11-4

# Install DGX configurations packages
yum groupinstall --disablerepo=cm* -y 'DGX A100 Configurations'

Install CUDA drivers

Install cuda-driver and dgx-persistence-mode packages.

yum install --disablerepo=cm* -y cuda-drivers dgx-persistence-mode 

On DGX Station 100, the nvidia-conf-xconfig needs to be installed also.

yum install -y nvidia-conf-xconfig

Install diagnostic tools

Enable rhel-server-rhscl-7-rpms repository to install the diagnostic tools.

subscription-manager repos --enable=rhel-server-rhscl-7-rpms

Install rh-python36.

yum install -y rh-python36

Now install the DGX System Management package group.

yum groupinstall -y 'DGX System Management'

Optional package installation

Installation of CUDA toolkit, NVIDIA Collectives Communication Library (NCCL) Runtime, CUDA Deep Neural Networks (cuDNN) Library Runtime or TensorRT is optional, those can be installed by following the relevant section at the official DGX Software installation guide for RHEL 7.

Step 5: Grab image from DGX node and reboot

Sync the disk image from the running DGX node onto the software image dgxa100-image on the head node.

$ cmsh
  device use dgx-a100
  grabimage -w

As a result of the OS update performed in Step 4, a newer version of the kernel could have been installed, and hence it is required to point the software image to use the new kernel.

$ cmsh
  softwareimage use dgxa100-image
  set kernelversion <TAB> # select the latest kernel version

# Wait for ramdisk to be generated

Finally reboot the node.

$ cmsh
  device reboot dgx-a100

Step 6: Provision node(s) with new image

The image is now ready to boot any number of DGX nodes. If you have created a Bright node category, you can configure any node to be part of that category by setting its category property. This will make the nodes inherit the softwareimage setting that you have defined for the category. Alternatively, you can configure the softwareimage property for individual nodes.

When nodes are powered on, they will be imaged using the Bright software image to which we have added the DGX software stack. The following command can be used to verify that the Nvidia drivers and services are working as expected:

[root@dgx-a100 ~]# dcgmi discovery -l
8 GPUs found.
| GPU ID | Device Information                                                   |
| 0      | Name: A100-SXM4-40GB                                                 |
|        | PCI Bus ID: 00000000:07:00.0                                         |
|        | Device UUID: GPU-8cda93c7-c7b3-5bb1-c5ae-d18f14ec21b5                |
| 1      | Name: A100-SXM4-40GB                                                 |
|        | PCI Bus ID: 00000000:0F:00.0                                         |
|        | Device UUID: GPU-4b1d7230-8739-451f-4143-19e35cc34e3b                |
| 2      | Name: A100-SXM4-40GB                                                 |
|        | PCI Bus ID: 00000000:47:00.0                                         |
|        | Device UUID: GPU-7d4ef5ea-b2f2-d509-a507-a20f2e656dc7                |
| 3      | Name: A100-SXM4-40GB                                                 |
|        | PCI Bus ID: 00000000:4E:00.0                                         |
|        | Device UUID: GPU-a41e5efb-3df9-9846-8ab7-4adb39f0467f                |
| 4      | Name: A100-SXM4-40GB                                                 |
|        | PCI Bus ID: 00000000:87:00.0                                         |
|        | Device UUID: GPU-900945ac-42c9-b6df-e263-9f1477fab578                |
| 5      | Name: A100-SXM4-40GB                                                 |
|        | PCI Bus ID: 00000000:90:00.0                                         |
|        | Device UUID: GPU-5f5041fd-791b-de04-d7fd-510e4c78de3f                |
| 6      | Name: A100-SXM4-40GB                                                 |
|        | PCI Bus ID: 00000000:B7:00.0                                         |
|        | Device UUID: GPU-a558bb3a-8f95-0487-2b73-46ec419410bf                |
| 7      | Name: A100-SXM4-40GB                                                 |
|        | PCI Bus ID: 00000000:BD:00.0                                         |
|        | Device UUID: GPU-d2839297-b553-cea3-e2bf-c76eff99279b                |
6 NvSwitches found.
| Switch ID |
| 9         |
| 13        |
| 11        |
| 10        |
| 8         |
| 12        |

Step 7: Install Mellanox OFED (Optional)

This can be performed on any one of the DGX nodes that was provisioned in Step 6. Follow instructions in the official DGX RHEL7 Install Guide for installing the Mellanox OFED software stack on a DGX server. If the NVIDIA peer memory module package (nvidia-peer-memory-dkms) is installed as part of the OFED software stack installation, please do not load the module (nv_peer_mem) at this step, loading the module has been taken care of at the next step.

Step 8: Finalize setup and grab image from DGX node

If Step 7 was performed, the changes need to be synchronized back to the software image and the initial ramdisk must be recreated. This can be done with cmsh using:

$ cmsh
  device use dgx-a100;
  grabimage -w

If the nv_peer_mem module was installed, then schedule the nv_peer_mem kernel module to be loaded automatically when the system boots. Doing this by adding it to the kernel modules list for the software image is not recommended, because the nv_peer_mem kernel module relies on other kernel modules that may not be loadable from the initrd. Write the following file in the software image on the head node:

echo "nv_peer_mem" > /cm/images/dgxa100-image/etc/modules-load.d/nv_peer_mem.conf

Re-create initial ramdisk for the software image dgxa100-image.

$ cmsh
  softwareimage use dgxa100-image
# Wait for ramdisk to be generated

When the DGX nodes that are set to use the sofware image are rebooted, they should come up with the correct Mellanox OFED kernel modules and nv_peer_mem kernel module loaded. This can be verified as follows:

[root@dgx-a100 ~]# lsmod | grep nv_peer_mem
nv_peer_mem            13163  0 
nvidia              35378917  370 nv_peer_mem,nvidia_modeset,nvidia_uvm
ib_core               379808  6 ib_cm,nv_peer_mem,mlx4_ib,mlx5_ib,ib_uverbs,ib_ipoib
Updated on September 27, 2022

Related Articles

Leave a Comment