1. Home
  2. How to install GPUDirect Storage (GDS) on BCM 10 (DGX – BaseOS 6)

How to install GPUDirect Storage (GDS) on BCM 10 (DGX – BaseOS 6)

This document is verified on BCM10 with Ubuntu 22.04 (kernel 5.15) with GDS 12.3 and MOFED 23.10 on DGX A100 hardware
Preparation

1. Clone the default software image “dgx-os-6.1-a100-image” (or “dgx-os-6.1-h100-image”) to dgxos61-gds-image:

# cmsh
% softwareimage
% clone dgx-os-6.1-a100 dgxos61-gds-image
% commit
(wait until the initrd is generated)

 

2. Make sure that the image is up-to-date

  a. Pick a DGX compute node and set its software image to the cloned dgxos61-gds-image:

# cmsh
% device use dgx001
% set softwareimage dgxos61-gds-image
% commit
% reboot

  b. After the node boots up, run the following commands on the node itself:

# apt update
# apt upgrade

  c. Grab the software image from dgx001 to capture the updates:

# cmsh
% device use dgx001
% grabimage -w

  d. Set the kernel version to the latest grabbed from the compute node

# cmsh
% softwareimage use dgxos61-gds-image
% set kernelversion 5.15.0-1041-nvidia
% append kernelparameters " iommu=off" # only if not present
% commit
(wait until the initrd is generated)

  e. Reboot the compute node to boot into the new kernel

  f. Add the nvme-rdma, nvmet, rpcrdma and nvidia-peermem to the list of modules to be loaded on boot up:

# cat /etc/modules-load.d/modules.conf
ipmi_devintf
nvme-rdma
nvmet
nvidia-peermem
rpcrdma

  g. Grab the changes to the image and reboot the node

Installation 

1. Install MOFED 23.10 with NVME support and GDS 12.3:

# wget https://content.mellanox.com/ofed/MLNX_OFED-23.10-0.5.5.0/MLNX_OFED_LINUX-23.10-0.5.5.0-ubuntu22.04-x86_64.tgz
# tar -xzvf MLNX_OFED_LINUX-23.10-0.5.5.0-ubuntu22.04-x86_64.tgz 
# cd MLNX_OFED_LINUX-23.10-0.5.5.0-ubuntu22.04-x86_64/
# ./mlnxofedinstall --with-nfsrdma --with-nvmf --without-fw-update

2. Install the latest NVIDIA drivers with GDS 12.3

# apt install dgx-a100-system-configurations dgx-a100-system-tools dgx-a100-system-tools-extra
# apt install nvidia-driver-545 
# apt install linux-modules-nvidia-fs-5.15.0-1041-nvidia
# apt install nvidia-fabricmanager-545 libnvidia-nscq-545
# systemctl unmask nvidia-fabricmanager.service 
# systemctl enable nvidia-fabricmanager.service 
# apt install nvidia-gds nvidia-gds-12-3 gds-tools-12-3

NOTE: make sure that the linux-modules-nvidia-fs package is installed and not removed after installing the fabricmanager or nvidia-gds packages

3. Grab the software image from dgx001 to capture the updates:

# cmsh
% device use dgx001
% grabimage -w

4. Recreate the ramdisk:

# cmsh
% softwareimage use dgxos61-gds-image
% createramdisk
(wait until the initrd is generated)

NOTE: run createramdisk for the dgx001 as well in case the software image is assigned directly to the node and not inherited from the category

5. Reboot the dgx001 to reload all kernel modules

6. After the node boots up, verify that the nvidia drivers and nvme modules are loaded:

root@dgx001:~# lsmod | grep -E "nvme|nvidia|rpcrdma"
nvidia_fs             262144  0
nvidia_uvm           1515520  4
rpcrdma                81920  0
nvidia_peermem         16384  0
nvmet                 151552  0
nvme_rdma              45056  0
nvme_fabrics           32768  1 nvme_rdma
rdma_cm               122880  3 rpcrdma,nvme_rdma,rdma_ucm
ib_core               434176  11 rdma_cm,ib_ipoib,rpcrdma,nvidia_peermem,nvme_rdma,iw_cm,ib_umad,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm
sunrpc                585728  13 rpcrdma,nfsv4,auth_rpcgss,lockd,nfsv3,rpcsec_gss_krb5,nfs_acl,nfs
nvidia_drm             94208  0
nvidia_modeset       1327104  1 nvidia_drm
nvidia              56172544  217 nvidia_uvm,nvidia_peermem,nvidia_modeset
drm_kms_helper        315392  5 drm_vram_helper,ast,nvidia_drm
nvme                   57344  24 nvmet
drm                   622592  8 drm_kms_helper,drm_vram_helper,ast,nvidia,drm_ttm_helper,nvidia_drm,ttm
nvme_core             143360  27 nvmet,nvme,nvme_rdma,nvme_fabrics
mlx_compat             69632  17 rdma_cm,ib_ipoib,mlxdevm,nvmet,rpcrdma,nvme,nvme_rdma,iw_cm,nvme_core,nvme_fabrics,ib_umad,ib_core,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm,mlx5_core

7. Verify that GDS is working with NVME and NVMeOF support:

root@dgx001:~# /usr/local/cuda-12.3/gds/tools/gdscheck.py -v
 GDS release version: 1.8.1.2
 nvidia_fs version:  2.17 libcufile version: 2.12
 Platform: x86_64
root@dgx001:~# /usr/local/cuda-12.3/gds/tools/gdscheck.py -V
FILESYSTEM VERSION CHECK:
Pre-requisite:
nvidia_peermem is loaded as required
nvme module is loaded
nvme module is correctly patched
nvme-rdma module is loaded
nvme-rdma module is correctly patched
ScaleFlux module is not loaded
NVMesh module is not loaded
Lustre module is not loaded
BeeGFS module is not loaded
GPFS module is not loaded
rpcrdma module is loaded
rpcrdma module is correctly patched
Lustre:
current version: Unknown
min version supported: 2.12.3_ddn28
ofed_info:
current version: MLNX_OFED_LINUX-23.10-0.5.5.0: (Supported)
min version supported: MLNX_OFED_LINUX-4.6-1.0.1.1
root@dgx001:~# /usr/local/cuda-12.3/gds/tools/gdscheck.py -p
 GDS release version: 1.8.1.2
 nvidia_fs version:  2.17 libcufile version: 2.12
 Platform: x86_64
 ============
 ENVIRONMENT:
 ============
 =====================
 DRIVER CONFIGURATION:
 =====================
 NVMe               : Supported
 NVMeOF             : Supported
 SCSI               : Unsupported
 ScaleFlux CSD      : Unsupported
 NVMesh             : Unsupported
 DDN EXAScaler      : Unsupported
 IBM Spectrum Scale : Unsupported
 NFS                : Supported
 BeeGFS             : Unsupported
 WekaFS             : Unsupported
 Userspace RDMA     : Unsupported
 --Mellanox PeerDirect : Enabled
 --rdma library        : Not Loaded (libcufile_rdma.so)
 --rdma devices        : Not configured
 --rdma_device_status  : Up: 0 Down: 0
 =====================
 CUFILE CONFIGURATION:
 =====================
 properties.use_compat_mode : true
 properties.force_compat_mode : false
 properties.gds_rdma_write_support : true
 properties.use_poll_mode : false
 properties.poll_mode_max_size_kb : 4
 properties.max_batch_io_size : 128
 properties.max_batch_io_timeout_msecs : 5
 properties.max_direct_io_size_kb : 16384
 properties.max_device_cache_size_kb : 131072
 properties.max_device_pinned_mem_size_kb : 33554432
 properties.posix_pool_slab_size_kb : 4 1024 16384
 properties.posix_pool_slab_count : 128 64 32
 properties.rdma_peer_affinity_policy : RoundRobin
 properties.rdma_dynamic_routing : 0
 fs.generic.posix_unaligned_writes : false
 fs.lustre.posix_gds_min_kb: 0
 fs.beegfs.posix_gds_min_kb: 0
 fs.weka.rdma_write_support: false
 fs.gpfs.gds_write_support: false
 profile.nvtx : false
 profile.cufile_stats : 0
 miscellaneous.api_check_aggressive : false
 execution.max_io_threads : 4
 execution.max_io_queue_depth : 128
 execution.parallel_io : true
 execution.min_io_threshold_size_kb : 8192
 execution.max_request_parallelism : 4
 properties.force_odirect_mode : false
 properties.prefer_iouring : false
 =========
 GPU INFO:
 =========
 GPU index 0 NVIDIA A100-SXM4-40GB bar:1 bar size (MiB):65536 supports GDS, IOMMU State: Disabled
 GPU index 1 NVIDIA A100-SXM4-40GB bar:1 bar size (MiB):65536 supports GDS, IOMMU State: Disabled
 GPU index 2 NVIDIA A100-SXM4-40GB bar:1 bar size (MiB):65536 supports GDS, IOMMU State: Disabled
 GPU index 3 NVIDIA A100-SXM4-40GB bar:1 bar size (MiB):65536 supports GDS, IOMMU State: Disabled
 GPU index 4 NVIDIA A100-SXM4-40GB bar:1 bar size (MiB):65536 supports GDS, IOMMU State: Disabled
 GPU index 5 NVIDIA A100-SXM4-40GB bar:1 bar size (MiB):65536 supports GDS, IOMMU State: Disabled
 GPU index 6 NVIDIA A100-SXM4-40GB bar:1 bar size (MiB):65536 supports GDS, IOMMU State: Disabled
 GPU index 7 NVIDIA A100-SXM4-40GB bar:1 bar size (MiB):65536 supports GDS, IOMMU State: Disabled
 ==============
 PLATFORM INFO:
 ==============
 IOMMU: disabled
 Nvidia Driver Info Status: Unsupported(Nvidia Open Driver Not Installed)
 Cuda Driver Version Installed:  12030
 Platform: DGXA100 920-23687-2530-000, Arch: x86_64(Linux 5.15.0-1041-nvidia)
 Platform verification succeeded

 

Updated on January 3, 2024