1. Home
  3. Installing NVIDIA DGX software stack in Bright software images

Installing NVIDIA DGX software stack in Bright software images

This document describes the procedure for installing the official Nvidia DGX software stack in a Bright software image using the Ansible playbook provided by Nvidia. This procedure has been validated for the following Linux distributions:

  • Ubuntu 18.04
  • RedHat Enterprise Linux 7

Install Ansible and Git

# On Ubuntu 18.04
$ sudo apt-add-repository ppa:ansible/ansible
$ apt install ansible git

# On RedHat 7 based distributions
$ yum install ansible git

Prepare copy of the default software image

Clone the default software image to an image called dgx-image.

$ cmsh
  clone default-image dgx-image

Remove conflicting packages

Uninstall conflicting packages that would interfere with the installation of the DGX software stack.

$ chroot /cm/images/dgx-image
$ apt remove libumad2sim0 ibsim-utils

Provision the DGX node

Define a new node. say dgx-a100 and assign the prepared dgx-image to it. After the initial ramdisk has been generated, provision the DGX node.

$ cmsh
  device use dgx-a100;
  set softwareimage dgx-image;
  set installbootrecord yes

  # Wait for ramdisk to be generated
  # Boot DGX node

Deploy DGX software stack

Clone DeepOps repository and setup environment. Add hostname of DGX node to the Ansible inventory and run the playbook to deploy the DGX software stack on the node.

# Clone repository and prepare environment
$ git clone https://github.com/NVIDIA/deepops.git
$ cd deepops
$ git checkout df188eb7083a89ebce8fa8bc2bf24a7a9dcb6acd
$ scripts/setup.sh

# Update inventory and config
$ cp config/inventory{,.bk}
$ echo dgx-a100 > config/inventory
$ sed -i 's/nvidia_driver_skip_reboot: false/nvidia_driver_skip_reboot: true/g' roles/nvidia-dgx/defaults/main.yml

# Run nvidia-dgx playbook
$ ansible-playbook -vvv playbooks/nvidia-dgx.yml

(UBUNTU 18.04) Install Bright cuda-dgcm

The version of DCGM provided by Nvidia datacenter-gpu-manager on Ubuntu 18.04 is not API compatible with Bright GPU monitoring. Hence replace with Bright cuda-dgcm package.

$ ssh dgx-a100
$ apt remove datacenter-gpu-manager
$ apt install cuda-dcgm

(RHEL 7) Disable Bright cuda-dcgm monitoring & autostart

The version of DGCM provided by Nvidia datacenter-gpu-manager on RHEL 7 is compatible with Bright GPU monitoring and hence it is not required to install the Bright package. Disable cuda-dgcm service management in Bright, to suppress error notifications.

$ cmsh
  device services dgx-a100
  use cuda-dcgm
  set monitored no
  set autostart no

Grab image from DGX node

Sync the disk image from the running DGX node onto the software image dgx-image on the head node.

$ cmsh
  device use dgx-a100
  grabimage -w

Re-provision node with grabbed image

$ cmsh
  device use dgx-a100

Updated on August 25, 2020

Was this article helpful?

Leave a Comment