This document is verified on BCM10 with Ubuntu 22.04 (kernel 5.15) with GDS 12.3 and MOFED 23.10 on DGX A100 hardware
Preparation
1. Clone the default software image “dgx-os-6.1-a100-image” (or “dgx-os-6.1-h100-image”) to dgxos61-gds-image:
# cmsh % softwareimage % clone dgx-os-6.1-a100 dgxos61-gds-image % commit (wait until the initrd is generated)
2. Make sure that the image is up-to-date
a. Pick a DGX compute node and set its software image to the cloned dgxos61-gds-image:
# cmsh % device use dgx001 % set softwareimage dgxos61-gds-image % commit % reboot
b. After the node boots up, run the following commands on the node itself:
# apt update # apt upgrade
c. Grab the software image from dgx001 to capture the updates:
# cmsh % device use dgx001 % grabimage -w
d. Set the kernel version to the latest grabbed from the compute node
# cmsh % softwareimage use dgxos61-gds-image % set kernelversion 5.15.0-1041-nvidia % append kernelparameters " iommu=off" # only if not present % commit (wait until the initrd is generated)
e. Reboot the compute node to boot into the new kernel
f. Add the nvme-rdma, nvmet, rpcrdma and nvidia-peermem to the list of modules to be loaded on boot up:
# cat /etc/modules-load.d/modules.conf ipmi_devintf nvme-rdma nvmet nvidia-peermem rpcrdma
g. Grab the changes to the image and reboot the node
Installation
1. Install MOFED 23.10 with NVME support and GDS 12.3:
# wget https://content.mellanox.com/ofed/MLNX_OFED-23.10-0.5.5.0/MLNX_OFED_LINUX-23.10-0.5.5.0-ubuntu22.04-x86_64.tgz # tar -xzvf MLNX_OFED_LINUX-23.10-0.5.5.0-ubuntu22.04-x86_64.tgz # cd MLNX_OFED_LINUX-23.10-0.5.5.0-ubuntu22.04-x86_64/ # ./mlnxofedinstall --with-nfsrdma --with-nvmf --without-fw-update
2. Install the latest NVIDIA drivers with GDS 12.3
# apt install dgx-a100-system-configurations dgx-a100-system-tools dgx-a100-system-tools-extra # apt install nvidia-driver-545 # apt install linux-modules-nvidia-fs-5.15.0-1041-nvidia # apt install nvidia-fabricmanager-545 libnvidia-nscq-545 # systemctl unmask nvidia-fabricmanager.service # systemctl enable nvidia-fabricmanager.service # apt install nvidia-gds nvidia-gds-12-3 gds-tools-12-3
NOTE: make sure that the linux-modules-nvidia-fs package is installed and not removed after installing the fabricmanager or nvidia-gds packages
3. Grab the software image from dgx001 to capture the updates:
# cmsh % device use dgx001 % grabimage -w
4. Recreate the ramdisk:
# cmsh % softwareimage use dgxos61-gds-image % createramdisk (wait until the initrd is generated)
NOTE: run createramdisk for the dgx001 as well in case the software image is assigned directly to the node and not inherited from the category
5. Reboot the dgx001 to reload all kernel modules
6. After the node boots up, verify that the nvidia drivers and nvme modules are loaded:
root@dgx001:~# lsmod | grep -E "nvme|nvidia|rpcrdma" nvidia_fs 262144 0 nvidia_uvm 1515520 4 rpcrdma 81920 0 nvidia_peermem 16384 0 nvmet 151552 0 nvme_rdma 45056 0 nvme_fabrics 32768 1 nvme_rdma rdma_cm 122880 3 rpcrdma,nvme_rdma,rdma_ucm ib_core 434176 11 rdma_cm,ib_ipoib,rpcrdma,nvidia_peermem,nvme_rdma,iw_cm,ib_umad,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm sunrpc 585728 13 rpcrdma,nfsv4,auth_rpcgss,lockd,nfsv3,rpcsec_gss_krb5,nfs_acl,nfs nvidia_drm 94208 0 nvidia_modeset 1327104 1 nvidia_drm nvidia 56172544 217 nvidia_uvm,nvidia_peermem,nvidia_modeset drm_kms_helper 315392 5 drm_vram_helper,ast,nvidia_drm nvme 57344 24 nvmet drm 622592 8 drm_kms_helper,drm_vram_helper,ast,nvidia,drm_ttm_helper,nvidia_drm,ttm nvme_core 143360 27 nvmet,nvme,nvme_rdma,nvme_fabrics mlx_compat 69632 17 rdma_cm,ib_ipoib,mlxdevm,nvmet,rpcrdma,nvme,nvme_rdma,iw_cm,nvme_core,nvme_fabrics,ib_umad,ib_core,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm,mlx5_core
7. Verify that GDS is working with NVME and NVMeOF support:
root@dgx001:~# /usr/local/cuda-12.3/gds/tools/gdscheck.py -v GDS release version: 1.8.1.2 nvidia_fs version: 2.17 libcufile version: 2.12 Platform: x86_64
root@dgx001:~# /usr/local/cuda-12.3/gds/tools/gdscheck.py -V FILESYSTEM VERSION CHECK: Pre-requisite: nvidia_peermem is loaded as required nvme module is loaded nvme module is correctly patched nvme-rdma module is loaded nvme-rdma module is correctly patched ScaleFlux module is not loaded NVMesh module is not loaded Lustre module is not loaded BeeGFS module is not loaded GPFS module is not loaded rpcrdma module is loaded rpcrdma module is correctly patched Lustre: current version: Unknown min version supported: 2.12.3_ddn28 ofed_info: current version: MLNX_OFED_LINUX-23.10-0.5.5.0: (Supported) min version supported: MLNX_OFED_LINUX-4.6-1.0.1.1
root@dgx001:~# /usr/local/cuda-12.3/gds/tools/gdscheck.py -p GDS release version: 1.8.1.2 nvidia_fs version: 2.17 libcufile version: 2.12 Platform: x86_64 ============ ENVIRONMENT: ============ ===================== DRIVER CONFIGURATION: ===================== NVMe : Supported NVMeOF : Supported SCSI : Unsupported ScaleFlux CSD : Unsupported NVMesh : Unsupported DDN EXAScaler : Unsupported IBM Spectrum Scale : Unsupported NFS : Supported BeeGFS : Unsupported WekaFS : Unsupported Userspace RDMA : Unsupported --Mellanox PeerDirect : Enabled --rdma library : Not Loaded (libcufile_rdma.so) --rdma devices : Not configured --rdma_device_status : Up: 0 Down: 0 ===================== CUFILE CONFIGURATION: ===================== properties.use_compat_mode : true properties.force_compat_mode : false properties.gds_rdma_write_support : true properties.use_poll_mode : false properties.poll_mode_max_size_kb : 4 properties.max_batch_io_size : 128 properties.max_batch_io_timeout_msecs : 5 properties.max_direct_io_size_kb : 16384 properties.max_device_cache_size_kb : 131072 properties.max_device_pinned_mem_size_kb : 33554432 properties.posix_pool_slab_size_kb : 4 1024 16384 properties.posix_pool_slab_count : 128 64 32 properties.rdma_peer_affinity_policy : RoundRobin properties.rdma_dynamic_routing : 0 fs.generic.posix_unaligned_writes : false fs.lustre.posix_gds_min_kb: 0 fs.beegfs.posix_gds_min_kb: 0 fs.weka.rdma_write_support: false fs.gpfs.gds_write_support: false profile.nvtx : false profile.cufile_stats : 0 miscellaneous.api_check_aggressive : false execution.max_io_threads : 4 execution.max_io_queue_depth : 128 execution.parallel_io : true execution.min_io_threshold_size_kb : 8192 execution.max_request_parallelism : 4 properties.force_odirect_mode : false properties.prefer_iouring : false ========= GPU INFO: ========= GPU index 0 NVIDIA A100-SXM4-40GB bar:1 bar size (MiB):65536 supports GDS, IOMMU State: Disabled GPU index 1 NVIDIA A100-SXM4-40GB bar:1 bar size (MiB):65536 supports GDS, IOMMU State: Disabled GPU index 2 NVIDIA A100-SXM4-40GB bar:1 bar size (MiB):65536 supports GDS, IOMMU State: Disabled GPU index 3 NVIDIA A100-SXM4-40GB bar:1 bar size (MiB):65536 supports GDS, IOMMU State: Disabled GPU index 4 NVIDIA A100-SXM4-40GB bar:1 bar size (MiB):65536 supports GDS, IOMMU State: Disabled GPU index 5 NVIDIA A100-SXM4-40GB bar:1 bar size (MiB):65536 supports GDS, IOMMU State: Disabled GPU index 6 NVIDIA A100-SXM4-40GB bar:1 bar size (MiB):65536 supports GDS, IOMMU State: Disabled GPU index 7 NVIDIA A100-SXM4-40GB bar:1 bar size (MiB):65536 supports GDS, IOMMU State: Disabled ============== PLATFORM INFO: ============== IOMMU: disabled Nvidia Driver Info Status: Unsupported(Nvidia Open Driver Not Installed) Cuda Driver Version Installed: 12030 Platform: DGXA100 920-23687-2530-000, Arch: x86_64(Linux 5.15.0-1041-nvidia) Platform verification succeeded