This article is being updated. Please be aware the content herein, not limited to version numbers and slight syntax changes, may not match the output from the most recent versions of Bright. This notation will be removed when the content has been updated.
Prepare The Azure RDMA Cloud instances:
1. Create an availability set in Azure Portal
- Click on “All Services” (top left corner)
- Choose “Availability Set
- Click on +Add to add a new availability set as follows:
- Set an arbitrary name
- Choose the correct subscription
- Use an existing resource to which the cluster extension belongs. You can check the cloudsettings of the cloud director to see the resource group which should be used.
- In Bright 8.0, only “classic” unmanaged disks are allowed.
- Get the Resource ID of the created Availability Set. This is used later when creating the cloud instances in Bright
- Set the availability set for the cloud nodes in Bright:
[root@ma-c-02-06-b80-c7u2 ~]# cmsh
[ma-c-02-06-b80-c7u2]% device use westeurope-cnode002
[ma-c-02-06-b80-c7u2->device[westeurope-cnode002]]% cloudsettings
[ma-c-02-06-b80-c7u2->device[westeurope-cnode002]->cloudsettings]%set availabilitysetid "/subscriptions/2b8fad2b-aaf1-425a-bf45-36cfd495107e/resourceGroups/ma-c-02-06-b80-c7u2-westeurope-bcm/providers/Microsoft.Compute/availabilitySets/azure-rdma-test"
[ma-c-02-06-b80-c7u2->device*[westeurope-cnode002*]->cloudsettings*]% commit
- Set the VM size to one that supports RDMA:
[root@ma-c-02-06-b80-c7u2 ~]# cmsh
[ma-c-02-06-b80-c7u2]% device use westeurope-cnode002
[ma-c-02-06-b80-c7u2->device[westeurope-cnode002]]% cloudsettings
[ma-c-02-06-b80-c7u2->device[westeurope-cnode002]->cloudsettings]%set vmsize standard_h16m
Prepare the Azure RDMA image:
- Install/enable and configure WALinuxAgent** in the software image and change /etc/waagent.conf to support RDMA:
** the WALinuxAgent is responsible for bringing up the IB interfaces on the cloud nodes
[root@ma-c-02-06-b80-c7u2 ~]# yum install WALinuxAgent --installroot=/cm/images/cloud-image
[root@ma-c-02-06-b80-c7u2 ~]# chroot /cm/images/cloud-image/
[root@ma-c-02-06-b80-c7u2 /]# systemctl enable waagent
[root@ma-c-02-06-b80-c7u2 /]# grep -vE "^#|$^" /etc/waagent.conf
Provisioning.Enabled=n
Provisioning.UseCloudInit=n
Provisioning.DeleteRootPassword=n
Provisioning.RegenerateSshHostKeyPair=n
Provisioning.SshHostKeyPairType=rsa
Provisioning.MonitorHostName=n
Provisioning.DecodeCustomData=n
Provisioning.ExecuteCustomData=n
Provisioning.AllowResetSysUser=n
ResourceDisk.Format=n
ResourceDisk.Filesystem=ext4
ResourceDisk.MountPoint=/mnt/resource
ResourceDisk.EnableSwap=n
ResourceDisk.SwapSizeMB=0
ResourceDisk.MountOptions=None
Logs.Verbose=y
OS.RootDeviceScsiTimeout=300
OS.OpensslPath=None
OS.SshDir=/etc/ssh
OS.EnableRDMA=y
AutoUpdate.Enabled=n
OS.EnableFirewall=n
- Download and install the msft-rdma-drivers provided by Microsoft in the software image (note that the actual URL of the msft-rdma-drivers package will need to be changed in the commands below)
[root@ma-c-02-06-b80-c7u2 ~]# wget http://download.microsoft.com/download/6/8/F/68FE11B8-FAA4-4F8D-8C7D-74DA7F2CFC8C/msft-rdma-drivers-4.2.3.1-20180209.x86_64.rpm
[root@ma-c-02-06-b80-c7u2 ~]# wget http://download.microsoft.com/download/6/8/F/68FE11B8-FAA4-4F8D-8C7D-74DA7F2CFC8C/msft-rdma-drivers-4.2.3.1-20180209.src.rpm
[root@ma-c-02-06-b80-c7u2 ~]# rpm -ivh msft-rdma-drivers-4.2.3.1-20180209.x86_64.rpm --root=/cm/images/cloud-image
- Check the version of kernel supported by the msft-rdma-drivers
[root@ma-c-02-06-b80-c7u2 ~]# chroot /cm/images/cloud-image/
[root@ma-c-02-06-b80-c7u2 ~]# cd /opt/microsoft/rdma/rhel74/
[root@ma-c-02-06-b80-c7u2 rhel74]# rpm -qlp kmod-microsoft-hyper-v-rdma-4.2.3.1.144-20180209.x86_64.rpm
/etc/depmod.d/hyperv.conf
/lib/modules/3.10.0-693.17.1.el7.x86_64
/lib/modules/3.10.0-693.17.1.el7.x86_64/extra
/lib/modules/3.10.0-693.17.1.el7.x86_64/extra/microsoft-hyper-v-rdma
/lib/modules/3.10.0-693.17.1.el7.x86_64/extra/microsoft-hyper-v-rdma/hid-hyperv.ko
/lib/modules/3.10.0-693.17.1.el7.x86_64/extra/microsoft-hyper-v-rdma/hv_balloon.ko
/lib/modules/3.10.0-693.17.1.el7.x86_64/extra/microsoft-hyper-v-rdma/hv_netvsc.ko
/lib/modules/3.10.0-693.17.1.el7.x86_64/extra/microsoft-hyper-v-rdma/hv_network_direct.ko
/lib/modules/3.10.0-693.17.1.el7.x86_64/extra/microsoft-hyper-v-rdma/hv_sock.ko
/lib/modules/3.10.0-693.17.1.el7.x86_64/extra/microsoft-hyper-v-rdma/hv_storvsc.ko
/lib/modules/3.10.0-693.17.1.el7.x86_64/extra/microsoft-hyper-v-rdma/hv_utils.ko
/lib/modules/3.10.0-693.17.1.el7.x86_64/extra/microsoft-hyper-v-rdma/hv_vmbus.ko
/lib/modules/3.10.0-693.17.1.el7.x86_64/extra/microsoft-hyper-v-rdma/hyperv-keyboard.ko
/lib/modules/3.10.0-693.17.1.el7.x86_64/extra/microsoft-hyper-v-rdma/hyperv_fb.ko
/lib/modules/3.10.0-693.17.1.el7.x86_64/extra/microsoft-hyper-v-rdma/pci-hyperv.ko
/lib/modules/3.10.0-693.17.1.el7.x86_64/extra/microsoft-hyper-v-rdma/uio_hv_generic.ko
- Update to a kernel version which matches what is available from Microsoft
[root@ma-c-02-06-b80-c7u2 ~]# yum update --installroot=/cm/images/cloud-image
[root@ma-c-02-06-b80-c7u2 ~]# cmsh
[ma-c-02-06-b80-c7u2]% softwareimage use cloud-image
[ma-c-02-06-b80-c7u2->softwareimage[cloud-image]]%set kernelversion 3.10.0-693.17.1.el7.x86_64
[ma-c-02-06-b80-c7u2->softwareimage*[cloud-image*]]% commit
- Install Infiniband Support group in the software image:
[root@ma-c-02-06-b80-c7u2 ~]# chroot /cm/images/cloud-image/
[root@ma-c-02-06-b80-c7u2 /]# yum groupinstall "Infiniband Support"
- Install kmod-microsoft-hyper-v-rdma and microsoft-hyper-v-rdma in the software image:
[root@ma-c-02-06-b80-c7u2 ~]# chroot /cm/images/cloud-image/
[root@ma-c-02-06-b80-c7u2 /]# rpm -ivh /opt/microsoft/rdma/rhel74/kmod-microsoft-hyper-v-rdma-4.2.3.1.144-20180209.x86_64.rpm
[root@ma-c-02-06-b80-c7u2 /]# rpm -ivh --noscripts /opt/microsoft/rdma/rhel74/microsoft-hyper-v-rdma-4.2.3.1.144-20180209.x86_64.rpm
- Install hypervkvpd which is required by the waagent to bring up the RDMA interface
[root@ma-c-02-06-b80-c7u2 ~]# chroot /cm/images/cloud-image/
[root@ma-c-02-06-b80-c7u2 /]# yum install hypervkvpd
- Enabled openlogic repositories in the software image:
[root@ma-c-02-06-b80-c7u2 ~]# chroot /cm/images/cloud-image/
[root@ma-c-02-06-b80-c7u2 /]# cat > /etc/yum.repos.d/openlogic.repo
[openlogic]
name=CentOS-$releasever - openlogic packages for $basearch
baseurl=http://olcentgbl.trafficmanager.net/openlogic/$releasever/openlogic/$basearch/
enabled=1
gpgcheck=0
(ctrl+d)
- Reboot the cloud nodes and make sure that the kernel modules are loaded properly and the extra interface is up:
[root@westeurope-cnode002 ~]# lsmod | grep hv_
hv_network_direct 100138 0
hv_balloon 22073 0
ib_core 211874 14 rdma_cm,ib_cm,iw_cm,rpcrdma,ib_srp,ib_ucm,ib_iser,ib_srpt,ib_umad,hv_network_direct,ib_uverbs,rdma_ucm,ib_ipoib,ib_isert
hv_storvsc 22716 2
hv_utils 25798 2
scsi_transport_fc 64007 1hv_storvsc
ptp 19231 6 igb,tg3,bnx2x,ixgbe,hv_utils,e1000e
hv_netvsc 45611 0
hv_vmbus 72582 8hv_balloon,hyperv_keyboard,hv_netvsc,hid_hyperv,hv_utils,hyperv_fb,hv_storvsc,hv_network_direct
[root@westeurope-cnode002 ~]# lsmod | grep rdma
rpcrdma 86152 0
rdma_ucm 26841 0
ib_uverbs 64636 2 ib_ucm,rdma_ucm
rdma_cm 54426 4 rpcrdma,ib_iser,rdma_ucm,ib_isert
ib_cm 47287 5rdma_cm,ib_srp,ib_ucm,ib_srpt,ib_ipoib
iw_cm 46260 1rdma_cm
ib_core 211874 14rdma_cm,ib_cm,iw_cm,rpcrdma,ib_srp,ib_ucm,ib_iser,ib_srpt,ib_umad,hv_network_direct,ib_uverbs,rdma_ucm,ib_ipoib,ib_isert
sunrpc 348674 23 nfs,nfsd,auth_rpcgss,lockd,nfsv3,rpcrdma,nfs_acl
[root@westeurope-cnode001 ~]# ip a
1: lo:<LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: eth0:<BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000
link/ether 00:0d:3a:38:fa:84 brd ff:ff:ff:ff:ff:ff
inet 10.42.0.5/16 brd 10.42.255.255 scope global eth0
valid_lft forever preferred_lft forever
inet6 fe80::20d:3aff:fe38:fa84/64 scope link
valid_lft forever preferred_lft forever
3: eth1:<BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000
link/ether 00:15:5d:33:ff:34 brd ff:ff:ff:ff:ff:ff
inet 172.16.1.43/16 brd 172.16.255.255 scope global eth1
valid_lft forever preferred_lft forever
inet6 fe80::215:5dff:fe33:ff34/64 scope link
valid_lft forever preferred_lft forever
5: tun0:<POINTOPOINT,MULTICAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UNKNOWN qlen 1024
link/none
inet 172.31.0.1/16 brd 172.31.255.255 scope global tun0
valid_lft forever preferred_lft forever
inet6 fe80::ed66:b3fb:a329:ef95/64 scope link flags 800
valid_lft forever preferred_lft forever
Test running MPI jobs:
- Run mpi job simple MPI job:
[cmsupport@westeurope-cnode001 ~]$ module load intel/mpi/mic/5.1.3/2016.4.258
[cmsupport@westeurope-cnode001 ~]$ which mpirun
/cm/shared/apps/intel/compilers_and_libraries/2016.4.258/linux/mpi/intel64/bin/mpirun
[cmsupport@westeurope-cnode001 2017]$ mpirun -hosts westeurope-cnode001,westeurope-cnode002 -n 2 -ppn 1-env I_MPI_FABRICS=shm:dapl -env I_MPI_DAPL_PROVIDER=ofa-v2-ib0 -env I_MPI_DYNAMIC_CONNECTION=0 hostname
westeurope-cnode001
Westeurope-cnode002
- Run a PingPong IMB test:
[cmsupport@westeurope-cnode001 ~]$ module load intel/mpi/64/5.1.3/2016.4.258
[cmsupport@westeurope-cnode001 ~]$ which mpirun
/cm/shared/apps/intel/compilers_and_libraries/2016.4.258/linux/mpi/intel64/bin/mpirun
[cmsupport@westeurope-cnode001 ~]$ mpirun -hosts westeurope-cnode001,westeurope-cnode002 -ppn 1-n 2 -env I_MPI_DEBUG 5 -env I_MPI_FABRICS=shm:dapl -env I_MPI_DAPL_PROVIDER=ofa-v2-ib0 -env I_MPI_DYNAMIC_CONNECTION=0/cm/shared/apps/intel/compilers_and_libraries/2016.4.258/linux/mpi/intel64/bin/IMB-MPI1 pingpong
[0] MPI startup():Multi-threaded optimized library
[0] DAPL startup(): trying to open DAPL provider from I_MPI_DAPL_PROVIDER: ofa-v2-ib0
[1] DAPL startup(): trying to open DAPL provider from I_MPI_DAPL_PROVIDER: ofa-v2-ib0
[1] MPI startup(): DAPL provider ofa-v2-ib0
[1] MPI startup(): shm and dapl data transfer modes
[0] MPI startup(): DAPL provider ofa-v2-ib0
[0] MPI startup(): shm and dapl data transfer modes
[0] MPID_nem_init_dapl_coll_fns():Userset DAPL collective mask =0000
[0] MPID_nem_init_dapl_coll_fns():Effective DAPL collective mask =0000
[0] MPI startup():Rank Pid Node name Pin cpu
[0] MPI startup():0 5882 westeurope-cnode001 {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}
[0] MPI startup():1 3563 westeurope-cnode002 {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}
[1] MPID_nem_init_dapl_coll_fns():Userset DAPL collective mask =0000
[0] MPI startup(): I_MPI_DAPL_PROVIDER=ofa-v2-ib0
[0] MPI startup(): I_MPI_DEBUG=5
[0] MPI startup(): I_MPI_DYNAMIC_CONNECTION=0
[0] MPI startup(): I_MPI_FABRICS=shm:dapl
[1] MPID_nem_init_dapl_coll_fns():Effective DAPL collective mask =0000
[0] MPI startup(): I_MPI_INFO_NUMA_NODE_DIST=10,20,20,10
[0] MPI startup(): I_MPI_INFO_NUMA_NODE_MAP=mlx4_0:-1
[0] MPI startup(): I_MPI_INFO_NUMA_NODE_NUM=2
[0] MPI startup(): I_MPI_PIN_MAPPING=1:0 0
#------------------------------------------------------------
# Intel (R) MPI Benchmarks 4.1 Update 1, MPI-1 part
#------------------------------------------------------------
# Date : Mon Mar 26 11:05:25 2018
# Machine : x86_64
# System : Linux
# Release : 3.10.0-693.17.1.el7.x86_64
# Version : #1 SMP Thu Jan 25 20:13:58 UTC 2018
# MPI Version : 3.0
# MPI Thread Environment:
# New default behavior from Version 3.2 on:
# the number of iterations per message size is cut down
# dynamically when a certain run time (per message size sample)
# is expected to be exceeded. Time limit is defined by variable
# "SECS_PER_SAMPLE" (=> IMB_settings.h)
# or through the flag => -time
# Calling sequence was:
# /cm/shared/apps/intel/compilers_and_libraries/2016.4.258/linux/mpi/intel64/bin/IMB-MPI1 pingpong
# Minimum message length in bytes: 0
# Maximum message length in bytes: 4194304
#
# MPI_Datatype : MPI_BYTE
# MPI_Datatype for reductions : MPI_FLOAT
# MPI_Op : MPI_SUM
#
#
# List of Benchmarks to run:
# PingPong
#---------------------------------------------------
# Benchmarking PingPong
# #processes = 2
#---------------------------------------------------
#bytes #repetitions t[usec] Mbytes/sec
0 1000 3.29 0.00
1 1000 3.39 0.28
2 1000 3.30 0.58
4 1000 3.30 1.16
8 1000 3.30 2.31
16 1000 3.31 4.61
32 1000 2.65 11.51
64 1000 2.64 23.12
128 1000 2.70 45.23
256 1000 3.12 78.22
512 1000 3.11 157.23
1024 1000 3.25 300.84
2048 1000 3.88 503.84
4096 1000 5.13 760.86
8192 1000 6.36 1227.52
16384 1000 8.41 1858.89
32768 1000 11.25 2778.86
65536 640 17.37 3597.44
131072 320 30.12 4149.39
262144 160 58.17 4297.39
524288 80 102.09 4897.46
1048576 40 182.74 5472.38
2097152 20 349.38 5724.45
4194304 10 687.24 5820.37
# All processes entering MPI_Finaliz