Categories

ID #1473

How do I add NVIDIA DGX nodes to a Bright cluster using the official Ubuntu DGX software stack?

 

How do I add NVIDIA DGX nodes to a Bright cluster using the official Ubuntu DGX software stack?

 

Bright provides a software image for the official DGX software stack. Apart from the official DGX software available for Ubuntu 18.04 and a minimal selection of Bright’s software packages, the software image contains the following additional packages:


  • ntp
  • nfs-kernel-server
  • ifupdown
  • slurm-client
  • munge
  • libssh2-1
  • libcublas9.1
  • build-essential
  • python-pip
  • thin-provisioning-tools
  • lvm2


All commands in the following sections should be executed on the head node of the cluster, unless specified otherwise.


Contents

Preparation

Using the DGX software image

Using Slurm

Running a GPU benchmark application through Slurm

Setting up Kubernetes

Running NGC Containers in Kubernetes

Running NGC Containers in Slurm

 

 


Preparation


  1. Install a Bright head node using the standard installation procedure as described in the Bright Installation Manual.

  1. Log into the head node after installation, and download the Bright DGX software image.

# wget https://dgxdownloads.nvidia.com/custhelp/DGX_OS/bright-8.2-dgx-latest.tar.gz

  1. Decompress and untar in the head node within the images directory:

# cd /cm/images
# tar xzvf /path/to/bright-8.2-dgx-latest.tar.gz --acls --xattrs

  1. Copy the default SSH keys to the image.

# cp -a /cm/images/default-image/root/.ssh /cm/images/dgx-image/root/



Using the DGX software image


  1. Add the software image in Bright Cluster Manager using cmsh.

% softwareimage add dgx-image
% set path /cm/images/dgx-image
% set kernelversion <Tab><Tab>

(Pressing <Tab> twice will bring up a list of kernels.)


% commit

(Warnings about a missing kernel module dm-mod can be safely ignored.)


  1. Apply the software image on an existing or new category using cmsh.

% category clone default dgx
% set softwareimage dgx-image
% commit

  1. Set the dgx category for all of DGX nodes (e.g. node001 below) using cmsh.

% device set node001 category dgx
% device commit

  1. Configure all DGX nodes to always PXE boot off of the interface that connects them to the cluster internal network. Make sure to set the following options in the BIOS setup utility:


Under Advanced > CSM Configuration:



Under Boot:


To get into the BIOS setup utility, press <F2> while the machine is booting. To be able to use Console Redirection, it may be necessary to add the URL (e.g. http://10.168.129.253) into the security exceptions list for the Java installation on the workstation, or you may not be able to connect.


  1. Boot the nodes, and if everything was set up correctly, they will be provisioned with a software image after node identification has been completed.


Using Slurm

 

In order to achieve optimal utilization on a multi-user system, it is often beneficial to run a workload through a job scheduler. We will assume that Slurm will be used, but there are also other options such as PBS Pro, LSF or Univa. For setting up a different job scheduler than Slurm, please refer to the Bright Administrator Manual for instructions.


If you plan on using Slurm to schedule workload, please carry out the following steps to configure the DGX nodes under Slurm. We will assume that a basic Slurm set up was already created as a result of Slurm being selected as the workload management system during the head node installation. If this is not the case, you will want to issue the following command to set up Slurm:

 

wlm-setup -w slurm -s


To configure Slurm on the nodes in the dgx category:


  1. Assign the Slurm client role to the category.

# wlm-setup -w slurm -c dgx

  1. In order to be able to properly schedule GPUs using Slurm, append a few custom parameters to the Slurm configuration, at /etc/slurm/slurm.conf, on the head node. These lines should be added after the autogenerated section of the configuration file:

JobAcctGatherType=jobacct_gather/cgroup
AccountingStorageTRES=gres/gpu
SelectType=select/cons_res
SelectTypeParameters=CR_Core

  1. Restart the Slurm service using cmsh.

% device services master
% restart slurm

  1. Add a generic resource for each GPU using cmsh.

% category roles dgx
% use slurmclient
% genericresources
% add gpu0
% set name gpu
% set file /dev/nvidia0
% exit
% clone gpu0 gpu1
% clone gpu0 gpu2
% clone gpu0 gpu3
% clone gpu0 gpu4
% clone gpu0 gpu5
% clone gpu0 gpu6
% clone gpu0 gpu7
% set gpu1 file /dev/nvidia1
% set gpu2 file /dev/nvidia2
% set gpu3 file /dev/nvidia3
% set gpu4 file /dev/nvidia4
% set gpu5 file /dev/nvidia5
% set gpu6 file /dev/nvidia6
% set gpu7 file /dev/nvidia7
% exit
% commit

You eventually should end up with:


% list
 Alias (key) Name     Type Count File
 ----------- -------- -------- -------- ----------------
 gpu0        gpu               /dev/nvidia0
 gpu1        gpu               /dev/nvidia1
 gpu2        gpu               /dev/nvidia2
 gpu3        gpu               /dev/nvidia3
 gpu4        gpu               /dev/nvidia4
 gpu5        gpu               /dev/nvidia5
 gpu6        gpu               /dev/nvidia6
 gpu7        gpu               /dev/nvidia7

  1. Enable device constraints in the Slurm server so that GPUs are only accessible to jobs if they have been allocated.

% device roles master
% use slurmserver
% cgroups
% set constraindevices yes
% commit

  1. By default, Slurm just allows a single job to be executed per node. To change this behavior, it is necessary to allow oversubscription to e.g. 8 jobs per node.

% jobqueue use slurm defq
% set oversubscribe YES:8
% commit


Running a GPU benchmark application through Slurm


  1. Create an account (user addin cmsh) or use the cmsupport account.

  1. Install the cmake utility and other prerequisites in the head node.

# apt update
# apt-get install cuda10.1-toolkit cuda10.1-sdk

Install cmake.

# apt-get install python-pip
# pip install cmake --upgrade

  1. Switch to your account.

# su - cmsupport

  1. Load the CUDA environment modules.

$ module load cuda10.1/toolkit

  1. Clone the mgbench Git repo.

$ git clone https://github.com/tbennun/mgbench.git

  1. Build it.

$ cd mgbench
$ sh build.sh

  1. Optionally, verify manually on a DGX node that mgbench will run.

$ cd mgbench
$ module load cuda10.1/toolkit
$ sh run.sh

  1. Create a file, mgbench.slurm, in your home directory with the following content:

#!/bin/bash

# Request 1 CPU core and 2 GPUs
#SBATCH --cpus-per-task=1
#SBATCH --gres=gpu:2

# Assign name to this job
#SBATCH -J mgbench
# Allow this job to run non-exclusively on a node
#SBATCH --share
dir=~/mgbench/job-$SLURM_JOB_ID
mkdir -p $dir
cd $dir
ln -s ../build
module load cuda10.1/toolkit
sh ../run.sh

  1. Submit a number of jobs.

$ for i in ´seq 1 4´; do sbatch mgbench.slurm; done


Setting up Kubernetes


  1. Install cm-docker on the head node.

# apt-get install cm-docker -y

  1. Set the disk partitioning.

% category use dgx
% set disksetup

Remove the swap partition.

    <partition id="a4">
     <size>12G</size>
     <type>linux swap</type>
   </partition>

Add the logical volume for the Docker thin pool on the second harddrive (/dev/sdb).

 <device>
   <blockdev>/dev/sdb</blockdev>
   <partition id="sdb-docker_thin_pool">
     <size>max</size>
     <type>linux lvm</type>
   </partition>
 </device>
 <volumeGroup>
   <name>docker_thin_pool</name>
   <extentSize>4M</extentSize>
   <physicalVolumes>
     <member>sdb-docker_thin_pool</member>
   </physicalVolumes>
   <logicalVolumes>
     <volume metadatasize="2G" thinpool="1">
       <name>docker_data</name>
       <size>max</size>
       <filesystem>ext4</filesystem>
       <mountPoint>/var/lib/docker</mountPoint>
       <mountOptions>defaults,noatime,nodiratime</mountOptions>
     </volume>
   </logicalVolumes>
 </volumeGroup>

Save (<Esc> :wq) and commit.


  1. Use Nvidia as the default Docker runtime for the DGX nodes.

Edit /cm/images/dgx-image/etc/docker/daemon.json, and add the “default-runtime” directive (in bold below).

{
   "default-runtime": "nvidia",
   "runtimes": {
       "nvidia": {
           "path": "nvidia-container-runtime",
           "runtimeArgs": []
       }
   }
}

  1. Add the following entry to the dgx category’s update exclude list (category set dgx excludelistupdate in cmsh):

- /etc/systemd/system/multi-user.target.wants/docker.service
- /var/lib/dockershim
- /etc/docker/key.json
- /var/lib/docker

Add the following entries to the dgx category’s sync exclude list (category set dgx excludelistsyncinstall in cmsh):

- /var/lib/dockershim
- /etc/docker/key.json
- /var/lib/docker

  1. Install Bright Kubernetes. This assumes that Docker is already installed as part of the official DGX software stack.

# cm-kubernetes-setup --skip-docker

Select the defaults, use e.g. the head node as the Kubernetes master, and assign DGX nodes for the Kubernetes nodes and etcd server. Select the Nvidia runtime plugin. Wait for the DGX nodes to reboot.


  1. Add a non-root user to manage Kubernetes.

# cm-kubernetes-setup --add-user cmsupport --role cluster-admin

Wait for a few minutes for the certificates to be issued and propagated. Test.

# su - cmsupport
$ module load kubernetes/default
$ kubectl describe nodes | grep -B6 gpu

  1. Test scheduling workload on the GPUs.

Create a YAML file that defines a pod.

apiVersion: v1
kind: Pod
metadata:
 name: cuda-vector-add
spec:
 restartPolicy: OnFailure
 containers:
   - name: cuda-vector-add
     # https://github.com/kubernetes/kubernetes/blob/v1.7.11/test/images/nvidia-cuda/Dockerfile
     image: "k8s.gcr.io/cuda-vector-add:v0.1"
     resources:
       limits:
         nvidia.com/gpu: 3

Create and run the pod.

$ kubectl apply -f yaml-file-above.yml
$ kubectl logs -f cuda-vector-add

The output should be:

[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done

 


Running NGC Containers in Kubernetes

 

Assuming that your head node is called bright-hn01, the following steps will allow an NGC container to be scheduled in Kubernetes:


  1. Install a local Docker registry using cm-docker-registry-setup. Accept the defaults and set up the registry on the head node. The Docker registry will be available at e.g. bright-hn01.cm.cluster:5000.

  1. Synchronize nodes against software image to distribute certificates.

% device imageupdate -w -m dgx-image

  1. Pull e.g. the PyTorch container and push it to the local Docker registry.

$ module load docker
$ docker pull nvcr.io/nvidia/pytorch:19.05-py3
$ docker image tag nvcr.io/nvidia/pytorch:19.05-py3 bright-hn01.cm.cluster:5000/nvidia-pytorch:19.05-py3
$ docker push bright-hn01.cm.cluster:5000/nvidia-pytorch:19.05-py3

  1. Create a YAML file named nvidia-pytorch.yaml:

apiVersion: v1
kind: Pod
metadata:
 name: nvidia-pytorch
spec:
 restartPolicy: Never
 containers:
 - name: pytorch
   image: bright-hn01.cm.cluster:5000/nvidia-pytorch:19.05-py3
   args: ["python", "/workspace/examples/upstream/mnist/main.py"]
   resources:
     limits:
       nvidia.com/gpu: 8

  1. Deploy the pod in the Kubernetes cluster.

$ module load kubernetes
$ kubectl apply -f nvidia-pytorch.yaml
$ kubectl get pods -o wide -w

  1. Monitor the output.

$ kubectl logs pod/nvidia-pytorch

The output should be:

=============
== PyTorch ==
=============
NVIDIA Release 19.05 (build 6411784)
PyTorch Version 1.1.0a0+828a6a3

Container image Copyright (c) 2019, NVIDIA CORPORATION.  All rights reserved.



Test set: Average loss: 0.0313, Accuracy: 9891/10000 (99%)


Running NGC Containers in Slurm

 

You can also run NGC containers in Slurm using Singularity.


  1. Install Singularity on head node.

# apt-get install cm-singularity

  1. Install Singularity in the software image.

# chroot /cm/images/dgx-image
# apt-get install cm-singularity
# exit

  1. Distribute changes using cmsh.

% device imageupdate -m dgx-image -w

  1. Prepare job files as a user (e.g. cmsupport).

# su - cmsupport
$ mkdir -p ~/pytorch
$ cd ~/pytorch
$ wget https://pytorch.org/tutorials/_downloads/two_layer_net_tensor.py

  1. Tweak two_layer_net_tensor.py by uncommenting the following line:

#device = torch.device("cuda:0")
# Uncomment this to run on GPU

To make the job run a bit longer, change 500 in the following line to some larger value:

for t in range(500):

  1. Store the following job script in e.g. pytorch.slurm:

#!/bin/bash
# Request 1 CPU core and 1 GPUs
#SBATCH --cpus-per-task=1
#SBATCH --gres=gpu:1
#SBATCH -J pytorch
#SBATCH --share
module load singularity
module load cuda10.1/toolkitexport SINGULARITY_TMPDIR=/local
singularity run --nv docker://nvcr.io/nvidia/pytorch:19.05-py3 python /home/cmsupport/pytorch/two_layer_net_tensor.py

  1. Submit a number of jobs and observe how each job is allocated its own GPU.

$ for i in ´seq 1 16´; do sbatch pytorch.slurm; done

  1. After the job is finished running, the output file should look like:

=============
== PyTorch ==
=============
NVIDIA Release 19.05 (build 6411784)
PyTorch Version 1.1.0a0+828a6a3
Container image Copyright (c) 2019, NVIDIA CORPORATION.  All rights reserved.

498 5.751993376179598e-05
499 5.676814544131048e-05

Tags: DGX, NVIDIA

Related entries:

You cannot comment on this entry