1. Home
  3. Installing NVIDIA DGX software stack in Bright software images

Installing NVIDIA DGX software stack in Bright software images

This document describes the procedure for installing the official Nvidia DGX software stack in a Bright software image. The instructions in this document target the DGX A100, but the same procedure can be used for other DGX systems such as DGX-1, DGX-2 and DGX Station. The procedure has been validated for the following Linux distributions:

  • Ubuntu 18.04
  • RedHat Enterprise Linux 7
  • RedHat Enterprise Linux 8

Step 1: Prepare a copy of the software image

This example uses the default software image as the base image. Any other software image can be used for this purpose.

Clone the base software image to an image called dgxa100-image.

$ cmsh
  clone default-image dgxa100-image

Step 2: Assign new image to the node(s) or category

Define a new node. say dgx-a100 and assign the prepared dgx-image to it. After the initial ramdisk has been generated, provision the DGX node.

In the example below we will set the software image for an individual node, but if you have many DGX nodes it makes more sense to create a node category, make your DGX nodes part of this category, and set the software image for the category (which will then be inherited by all nodes in the category).

$ cmsh
  device use dgx-a100;
  set softwareimage dgxa100-image;

  # Wait for ramdisk to be generated

Step 3: Install DGX software stack

In the RHEL7 and RHEL8 examples below we will install the DGX stack into the software image on the head node, and hence all operations are performed on the head node. However the same steps can also be performed directly on the DGX node, if the node has already been provisioned with the image created in Step 1. When installing directly on a DGX node, the --installroot option should not be used, and Step 4 (grab image) must be completed to capture the changes on the node back in the image.


Enable repositories

# Enable additional RHEL7 repositories
subscription-manager repos --enable=rhel-7-server-extras-rpms
subscription-manager repos --enable=rhel-7-server-optional-rpms

# Enable DGX repositories in software image
yum --installroot /cm/images/dgxa100-image install -y https://international.download.nvidia.com/dgx/repos/rhel-files/dgx-repo-setup-20.03-1.el7.x86_64.rpm

# Enable DGX repositories
yum-config-manager --installroot /cm/images/dgxa100-image -y --enable nvidia-dgx-7-r450-cuda11-0

Note: The URLs, names of the repositories and versions in the above example are subject to change. Please refer to the official DGX RHEL7 Install Guide for latest information on DGX repositories.

Install DGX tools and Nvidia drivers

# Remove conflicting Bright cuda-dcgm/cuda-driver packages(if installed)
yum --installroot /cm/images/dgxa100-image remove cuda-dcgm cuda-driver

# Install DGX tools
yum --installroot /cm/images/dgxa100-image groupinstall -y 'DGX A100 Configurations'

# Install Nvidia drivers
yum --installroot /cm/images/dgxa100-image install -y cuda-drivers dgx-persistence-mode


Enable repositories

# Enable additional RHEL8 repositories
subscription-manager repos --enable=rhel-8-for-x86_64-appstream-rpms
subscription-manager repos --enable=rhel-8-for-x86_64-baseos-rpms
subscription-manager repos --enable=codeready-builder-for-rhel-8-x86_64-

# Enable DGX repositories in software image
dnf --installroot /cm/images/dgxa100-image install -y https://repo.download.nvidia.com/baseos/el/el-files/8/nvidia-repo-setup-20.11-0.el8.x86_64.rpm

# Install Nvidia RPM GPG keys on head node
cp /cm/images/dgxa100-image/etc/pki/rpm-gpg/RPM-GPG-KEY-cuda /etc/pki/rpm-gpg/
cp /cm/images/dgxa100-image/etc/pki/rpm-gpg/RPM-GPG-KEY-dgx-cosmos-support /etc/pki/rpm-gpg/

Note: The URLs, names of the repositories and versions in the above example are subject to change. Please refer to the official DGX RHEL8 Install Guide for latest information on DGX repositories.

Install DGX tools and Nvidia drivers

# Remove conflicting Bright cuda-dcgm/cuda-driver packages(if installed)
dnf --installroot /cm/images/dgxa100-image remove cuda-dcgm cuda-driver

# Install DGX tools
dnf --installroot /cm/images/dgxa100-image groupinstall -y 'DGX A100 Configurations'

# Install Nvidia drivers
dnf --installroot /cm/images/dgxa100-image module install -y nvidia-driver:450/fm
dnf --installroot /cm/images/dgxa100-image install -y nv-persistence-mode nvidia-fm-enable

UBUNTU 18.04

Uninstall conflicting distro packages from cloned software image

cm-chroot-sw-img /cm/images/dgxa100-image
apt remove libumad2sim0 ibsim-utils ibutils

Install Ansible and Git

# On an Ubuntu 18.04 host (can be a Bright head node)
apt-add-repository ppa:ansible/ansible
apt install ansible git

Run Ansible playbook

Clone DeepOps repository and setup environment. Add hostname of DGX node (which is dgx-a100 in the example below) to the Ansible inventory and run the playbook to deploy the DGX software stack on the node.

# Clone repository and prepare environment
git clone https://github.com/NVIDIA/deepops.git
cd deepops
git checkout df188eb7083a89ebce8fa8bc2bf24a7a9dcb6acd

# Update inventory and config
cp config/inventory{,.bk}
echo dgx-a100 > config/inventory
sed -i.bak 's/nvidia_driver_skip_reboot: false/nvidia_driver_skip_reboot: true/g' roles/nvidia-dgx/defaults/main.yml

# Run nvidia-dgx playbook
ansible-playbook -vvv playbooks/nvidia-dgx.yml

Step 4: Grab image from DGX node

Important: This step must be performed only when the DGX software stack was installed on the node directly in the previous step.

Sync the disk image from the running DGX node onto the software image dgxa100-image on the head node.

$ cmsh
  device use dgx-a100
  grabimage -w

Step 5: Check DCGM compatiblity

The version of DCGM provided by Nvidia datacenter-gpu-manager can be different than the one provided by Bright. For example, at the time of this writing Bright 9.1 GPU monitoring and configuration integrates with DCGM 2.0.10. If the version provided by Nvidia is not API compatible with this version, then it must be replaced with the Bright package cuda-dgcm .

cm-chroot-swimage /cm/images/dgxa100-image
apt remove datacenter-gpu-manager
apt install cuda-dcgm

If the version of DGCM provided by Nvidia datacenter-gpu-manager is compatible with Bright GPU monitoring it is not required to install the Bright package and it is just sufficient to disable cuda-dgcm service management in Bright (to suppress error notifications).

$ cmsh
  device services dgx-a100
  use cuda-dcgm
  set monitored no
  set autostart no

Step 6: Check CUDA driver compatibility

The nvidia-driver package and dependencies that gets installed from the NVIDIA repositories can be outdated and as a result, utilities like nvidia-smi do not work on DGX A100 systems.

Solution: Replace the nvidia-driver package and dependencies with the Bright package cuda-driver.

cm-chroot-swimage /cm/images/dgxa100-image
yum remove nvidia-driver
yum install cuda-driver

Step 7: Provision node(s) with new image

The image is now ready to boot any number of DGX nodes. If you have created a Bright node category, you can configure any node to be part of that category by setting its category property. This will make the nodes inherit the softwareimage setting that you have defined for the category. Alternatively, you can configure the softwareimage property for individual nodes.

When nodes are powered on, they will be imaged using the Bright software image to which we have added the DGX software stack.

Step 8: Install Mellanox OFED (Optional)

Bright provides a package that lets you conveniently deploy the Mellanox OFED stack to your software images. At the time of writing, Mellanox OFED 5.1 is the recommended version to be used with the DGX software stack (although newer versions may be available).

The mlnx-ofed51 Bright package should be installed to the head node of a cluster (even though this machine may not even have any IB interface). Afterwards, an installation script should be invoked to add the Mellanox OFED stack to the software image.

yum install mlnx-ofed51
/cm/local/apps/mlnx-ofed51/current/bin/mlnx-ofed51-install.sh -s dgxa100-image

Step 9: Install nv_peer_mem kernel module (Optional)

In order to use GPUDirect over Infiniband, it is necessary to install the nv_peer_mem kernel module. After installing the Mellanox OFED stack (see previous section), the nv_peer_mem can be built on one of the DGX nodes and the resulting kernel modules can be grabbed back to the image.

First, make sure the dkms kernel module is part of the software image:

yum install --installroot=/cm/images/dgxa100-image install dkms

Then boot a DGX node with using this software image and issue the following command on the DGX node:

yum install nvidia-peer-memory-dkms

The post-install scriptlet that is part of the RPM will take care of building the kernel module with dkms against the running kernel. The changes now need to be synchronized back to the software image. This can be done with cmsh using:

$ cmsh
  device use dgx-a100;
  grabimage -w

Lastly, we need to schedule the nv_peer_mem kernel module to be loaded automatically when the system boots. Doing this by adding it to the kernel modules list for the software image is not recommended, because the nv_peer_mem kernel module relies on other kernel modules that may not be loadable from the initrd.

echo "nv_peer_mem" > /cm/images/dgxa100-image/etc/modules-load.d/nv_peer_mem.conf

When the DGX nodes that are set to use the sofware image are rebooted, they should come up with the nv_peer_mem kernel module loaded. This can be verified as follows:

[root@dgx-01 ~]# lsmod | grep nv_peer_mem
nv_peer_mem            16384  0
nvidia              19378176  449 nvidia_uvm,nv_peer_mem,nvidia_modeset
ib_core               425984  9 rdma_cm,ib_ipoib,nv_peer_mem,iw_cm,ib_umad,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm

Updated on March 3, 2021

Was this article helpful?

Related Articles

Leave a Comment