1. Home
  3. Installing NVIDIA DGX software stack in Bright software images

Installing NVIDIA DGX software stack in Bright software images

This document describes the procedure for installing the official Nvidia DGX software stack in a Bright software image using the Ansible playbook provided by Nvidia. This procedure has been validated for the following Linux distributions:

  • Ubuntu 18.04
  • RedHat Enterprise Linux 7

Install Ansible and Git

# On an Ubuntu 18.04 Bright head node
$ apt-add-repository ppa:ansible/ansible
$ apt install ansible git

# On a RedHat 7 based Bright head node
$ yum install ansible git

Prepare copy of the default software image

Clone the default software image to an image called dgx-image.

$ cmsh
  clone default-image dgx-image

(UBUNTU 18.04) Remove conflicting packages

Uninstall conflicting packages that would interfere with the installation of the DGX software stack.

$ cm-chroot-sw-img /cm/images/dgx-image
$ apt remove libumad2sim0 ibsim-utils ibutils
$ exit

Provision the DGX node

Define a new node. say dgx-a100 and assign the prepared dgx-image to it. After the initial ramdisk has been generated, provision the DGX node.

In the example below we will set the software image for an individual node, but if you have many DGX nodes it makes more sense to create a node category, make your DGX nodes part of this category, and set the software image for the category (which will then be inherited by all nodes in the category).

$ cmsh
  device use dgx-a100;
  set softwareimage dgx-image;

  # Wait for ramdisk to be generated
  # Boot DGX node

Deploy DGX software stack

Clone DeepOps repository and setup environment. Add hostname of DGX node (which is dgx-a100 in the example below) to the Ansible inventory and run the playbook to deploy the DGX software stack on the node.

# Clone repository and prepare environment
$ git clone https://github.com/NVIDIA/deepops.git
$ cd deepops
$ git checkout df188eb7083a89ebce8fa8bc2bf24a7a9dcb6acd
$ scripts/setup.sh

# Update inventory and config
$ cp config/inventory{,.bk}
$ echo dgx-a100 > config/inventory
$ sed -i.bak 's/nvidia_driver_skip_reboot: false/nvidia_driver_skip_reboot: true/g' roles/nvidia-dgx/defaults/main.yml

# Run nvidia-dgx playbook
$ ansible-playbook -vvv playbooks/nvidia-dgx.yml

(UBUNTU 18.04) Install Bright cuda-dgcm

The version of DCGM provided by Nvidia datacenter-gpu-manager on Ubuntu 18.04 is not API compatible with Bright GPU monitoring. Hence replace with Bright cuda-dgcm package.

$ ssh dgx-a100
$ apt remove datacenter-gpu-manager
$ apt install cuda-dcgm

(RHEL 7) Disable Bright cuda-dcgm monitoring & autostart

The version of DGCM provided by Nvidia datacenter-gpu-manager on RHEL 7 is compatible with Bright GPU monitoring and hence it is not required to install the Bright package. Disable cuda-dgcm service management in Bright, to suppress error notifications.

$ cmsh
  device services dgx-a100
  use cuda-dcgm
  set monitored no
  set autostart no

Grab image from DGX node

Sync the disk image from the running DGX node onto the software image dgx-image on the head node.

$ cmsh
  device use dgx-a100
  grabimage -w

Re-provision node with grabbed image

$ cmsh
  device use dgx-a100
Updated on October 27, 2020

Was this article helpful?

Leave a Comment