How can I run a simple test to stress test my GPUs?

Make sure CUDA, git and cmake are installed on the head node of the cluster:

# yum install cuda11.0-toolkit git cmake

Clone the Multi GPU Benchmark (mgbench) repository under a user account (e.g. cmsupport):

su - cmsupport
git clone https://github.com/tbennun/mgbench.git

Load the CUDA environment module:

module initadd cuda11.0/toolkit
module load cuda11.0/toolkit

Build it:

cd mgbench
export CUDA_BIN_PATH=$CUDA_ROOT
sh build.sh

Create a file mgbench.slurm with the following contents:

#!/bin/bash
#SBATCH --nodes=1
#SBATCH -J mgbench

# Remove # from the line below to request 8 GPUs if GPUs are consumable resources in your Slurm configuration
##SBATCH --gres=gpu:8

dir=~/mgbench/job-$SLURM_JOB_ID
mkdir -p $dir
cd $dir
ln -s ../build
sh ../run.sh

Submit a number of jobs:

 for i in `seq 1 10`; do sbatch mgbench.slurm; done

Each job will take approximately 20m and will run on a single machine. Output can be found in ~/mgbench/job-X (where X is the job-id).

To verify that the GPUs are being utilized for a particular job, you may use Bright View’s job monitoring capabilities. Open the Monitoring view, and select Workload -> cmsupport -> current date -> job id -> GPU. Then visualize the gpu_fb_used and gpu_utilization metrics:

Updated on March 4, 2021

Related Articles

Leave a Comment Cancel