1. Home
  2. How to install GPUDirect Storage (GDS) on Bright 9.2 (DGX – BaseOS 5.4)

How to install GPUDirect Storage (GDS) on Bright 9.2 (DGX – BaseOS 5.4)

This document is verified on Bright 9.2 with Ubuntu 20.04 with GDS 11.8 and MOFED 5.4

Preparation

  1. Clone the default software image “default-image” to dgx54-gds-image:
# cmsh
% softwareimage
% clone default-image dgx54-gds-image
% commit
(wait until the initrd is generated)
  1. Convert the dgx54-gds-image into a DGX OS image by following these steps:

a. Pick a DGX compute node and set its software image to the cloned dgx54-gds-image:

# cmsh
% device use dgx001
% set softwareimage dgx54-gds-image
% commit
% reboot

b. After the node boots up, run the following commands on the node itself:

# curl https://repo.download.nvidia.com/baseos/ubuntu/focal/dgx-repo-files.tgz | sudo tar xzf - -C /
# apt update
# apt upgrade
# apt install dgx-a100-system-configurations dgx-a100-system-tools dgx-a100-system-tools-extra
# apt install -y nvidia-driver-515-server linux-modules-nvidia-515-server-generic libnvidia-nscq-515 nvidia-modprobe nvidia-fabricmanager-515 datacenter-gpu-manager nv-persistence-mode# systemctl disable ondemand
# apt install linux-generic

c. Grab the software image from dgx001 to capture the updates:

# cmsh
% device use dgx001
% grabimage -w

d. Set the kernel version to the latest grabbed from the compute node

# cmsh
% softwareimage use dgx54-gds-image
% set kernelversion 5.4.0-131-generic
% append kernelparameters "  iommu=off"
% commit
(wait until the initrd is generated)

e. Reboot the compute node to boot into the new kernel

f. Add the nvme-rdma, nvmet, and rpcrdma to the list of modules to be loaded on boot up:

# cat /etc/modules-load.d/modules.conf
ipmi_devintf
nvme-rdma
nvmet
rpcrdma

g. Grab the changes to the image (repeat step 3) and reboot the node

Installation 

  1. Install MOFED 5.4 with NVME support and GDS 11.8:
# NVIDIA_DRV_VERSION=$(cat /proc/driver/nvidia/version | grep Module | awk '{print $8}' | cut -d '.' -f 1)
# apt-get install mlnx-ofed54 mlnx-nfsrdma-dkms srp-dkms isert-dkms iser-dkms mlnx-nvme-dkms nvidia-gds-11-8 gds-tools-11-8 nvidia-dkms-${NVIDIA_DRV_VERSION}-server
  1. Grab the software image from dgx001 to capture the updates:
# cmsh
% device use dgx001
% grabimage -w
  1. Recreate the ramdisk:
# cmsh
% softwareimage use dgx54-gds-image
% createramdisk
(wait until the initrd is generated)
  1. Reboot the dgx001 to reload all kernel modules
  2. After the node boots up, verify that the nvidia drivers and nvme modules are loaded:
# lsmod | grep -E "nvme|nvidia|rpcrdma"
nvidia_uvm           1036288  0
nvidia_drm             57344  0
nvidia_modeset       1200128  1 nvidia_drm
nvidia_fs             249856  0
rpcrdma                69632  0
nvidia_peermem         16384  0
nvmet                 106496  0
nvme_rdma              40960  0
rdma_cm               106496  3 rpcrdma,nvme_rdma,rdma_ucm
ib_core               319488  11 rdma_cm,ib_ipoib,rpcrdma,nvidia_peermem,nvme_rdma,iw_cm,ib_umad,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm
nvme_fabrics           24576  1 nvme_rdma
sunrpc                397312  12 rpcrdma,nfsv4,auth_rpcgss,lockd,nfsv3,rpcsec_gss_krb5,nfs_acl,nfs
drm_kms_helper        184320  4 ast,nvidia_drm
nvidia              35422208  255 nvidia_uvm,nvidia_peermem,nvidia_modeset
nvme                   49152  13 nvmet
drm                   495616  7 drm_kms_helper,drm_vram_helper,ast,nvidia,nvidia_drm,ttm
nvme_core              98304  25 nvme,nvme_rdma,nvme_fabrics
mlx_compat             65536  18 rdma_cm,ib_ipoib,mlxdevm,nvmet,rpcrdma,nvme,nvme_rdma,iw_cm,nvme_core,auxiliary,nvme_fabrics,ib_umad,ib_core,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm,mlx5_core
  1. Verify that GDS is working with NVME and NVMeOF support:
# /usr/local/cuda-11.8/gds/tools/gdscheck.py -p
GDS release version: 1.4.0.31
nvidia_fs version:  2.13 libcufile version: 2.12
Platform: x86_64
============
ENVIRONMENT:
============
=====================
DRIVER CONFIGURATION:
=====================
NVMe               : Supported
NVMeOF             : Supported
SCSI               : Unsupported
ScaleFlux CSD      : Unsupported
NVMesh             : Unsupported
DDN EXAScaler      : Unsupported
IBM Spectrum Scale : Unsupported
NFS                : Supported
BeeGFS             : Unsupported
WekaFS             : Unsupported
Userspace RDMA     : Unsupported
--Mellanox PeerDirect : Disabled
--rdma library        : Not Loaded (libcufile_rdma.so)
--rdma devices        : Not configured
--rdma_device_status  : Up: 0 Down: 0
# cat /proc/driver/nvidia-fs/stats
GDS Version: 1.4.0.29
NVFS statistics(ver: 4.0)
NVFS Driver(version: 2.13.5)
Mellanox PeerDirect Supported: True
IO stats: Disabled, peer IO stats: Disabled
Logging level: info

Active Shadow-Buffer (MiB): 0
Active Process: 0
Reads                           : err=0 io_state_err=0
Sparse Reads                    : n=0 io=0 holes=0 pages=0
Writes                          : err=0 io_state_err=0 pg-cache=0 pg-cache-fail=0 pg-cache-eio=0
Mmap                            : n=0 ok=0 err=0 munmap=0
Bar1-map                        : n=0 ok=0 err=0 free=0 callbacks=0 active=0 delay-frees=0
Error                           : cpu-gpu-pages=0 sg-ext=0 dma-map=0 dma-ref=0
Ops                             : Read=0 Write=0 BatchIO=0
Updated on January 24, 2023

Leave a Comment