Categories

ID #1426

How can I create an RDMA-ready cloud instance and software image?

How to create an RDMA ready cloud instances and software image

 

Prepare The Azure RDMA Cloud instances:


1. Create an availability set in Azure Portal

    1. Click on “All Services” (top left corner)

    2. Choose “Availability Set


    3. Click on +Add to add a new availability set as follows:

        • Set an arbitrary name
        • Choose the correct subscription
        • Use an existing resource to which the cluster extension belongs. You can check the cloudsettings of the cloud director to see the resource group which should be used.
        • In Bright 8.0, only “classic” unmanaged disks are allowed.



































        • Get the Resource ID of the created Availability Set. This is used later when creating the cloud instances in Bright
  1. Set the availability set for the cloud nodes in Bright:

[root@ma-c-02-06-b80-c7u2 ~]# cmsh

[ma-c-02-06-b80-c7u2]% device use westeurope-cnode002  

[ma-c-02-06-b80-c7u2->device[westeurope-cnode002]]% cloudsettings

[ma-c-02-06-b80-c7u2->device[westeurope-cnode002]->cloudsettings]%set availabilitysetid "/subscriptions/2b8fad2b-aaf1-425a-bf45-36cfd495107e/resourceGroups/ma-c-02-06-b80-c7u2-westeurope-bcm/providers/Microsoft.Compute/availabilitySets/azure-rdma-test"

[ma-c-02-06-b80-c7u2->device*[westeurope-cnode002*]->cloudsettings*]% commit


  1. Set the VM size to one that supports RDMA:

[root@ma-c-02-06-b80-c7u2 ~]# cmsh

[ma-c-02-06-b80-c7u2]% device use westeurope-cnode002  

[ma-c-02-06-b80-c7u2->device[westeurope-cnode002]]% cloudsettings

[ma-c-02-06-b80-c7u2->device[westeurope-cnode002]->cloudsettings]%set vmsize standard_h16m




Prepare the Azure RDMA image:


  1. Install/enable and configure WALinuxAgent** in the software image and change /etc/waagent.conf to support RDMA:

** the WALinuxAgent is responsible for bringing up the IB interfaces on the cloud nodes

[root@ma-c-02-06-b80-c7u2 ~]# yum install WALinuxAgent --installroot=/cm/images/cloud-image

[root@ma-c-02-06-b80-c7u2 ~]# chroot /cm/images/cloud-image/

[root@ma-c-02-06-b80-c7u2 /]# systemctl enable waagent

[root@ma-c-02-06-b80-c7u2 /]# grep -vE "^#|$^" /etc/waagent.conf

Provisioning.Enabled=n

Provisioning.UseCloudInit=n

Provisioning.DeleteRootPassword=n

Provisioning.RegenerateSshHostKeyPair=n

Provisioning.SshHostKeyPairType=rsa

Provisioning.MonitorHostName=n

Provisioning.DecodeCustomData=n

Provisioning.ExecuteCustomData=n

Provisioning.AllowResetSysUser=n

ResourceDisk.Format=n

ResourceDisk.Filesystem=ext4

ResourceDisk.MountPoint=/mnt/resource

ResourceDisk.EnableSwap=n

ResourceDisk.SwapSizeMB=0

ResourceDisk.MountOptions=None

Logs.Verbose=y

OS.RootDeviceScsiTimeout=300

OS.OpensslPath=None

OS.SshDir=/etc/ssh

OS.EnableRDMA=y

AutoUpdate.Enabled=n

OS.EnableFirewall=n


  1. Download and install the msft-rdma-drivers provided by Microsoft in the software image

[root@ma-c-02-06-b80-c7u2 ~]#  wget http://download.microsoft.com/download/6/8/F/68FE11B8-FAA4-4F8D-8C7D-74DA7F2CFC8C/msft-rdma-drivers-4.2.3.1-20180209.x86_64.rpm

[root@ma-c-02-06-b80-c7u2 ~]#  wget http://download.microsoft.com/download/6/8/F/68FE11B8-FAA4-4F8D-8C7D-74DA7F2CFC8C/msft-rdma-drivers-4.2.3.1-20180209.src.rpm

[root@ma-c-02-06-b80-c7u2 ~]#  rpm -ivh msft-rdma-drivers-4.2.3.1-20180209.x86_64.rpm --root=/cm/images/cloud-image


  1. Check the version of kernel supported by the msft-rdma-drivers

[root@ma-c-02-06-b80-c7u2 ~]# chroot /cm/images/cloud-image/

[root@ma-c-02-06-b80-c7u2 ~]# cd /opt/microsoft/rdma/rhel74/

[root@ma-c-02-06-b80-c7u2 rhel74]# rpm -qlp kmod-microsoft-hyper-v-rdma-4.2.3.1.144-20180209.x86_64.rpm

/etc/depmod.d/hyperv.conf

/lib/modules/3.10.0-693.17.1.el7.x86_64

/lib/modules/3.10.0-693.17.1.el7.x86_64/extra

/lib/modules/3.10.0-693.17.1.el7.x86_64/extra/microsoft-hyper-v-rdma

/lib/modules/3.10.0-693.17.1.el7.x86_64/extra/microsoft-hyper-v-rdma/hid-hyperv.ko

/lib/modules/3.10.0-693.17.1.el7.x86_64/extra/microsoft-hyper-v-rdma/hv_balloon.ko

/lib/modules/3.10.0-693.17.1.el7.x86_64/extra/microsoft-hyper-v-rdma/hv_netvsc.ko

/lib/modules/3.10.0-693.17.1.el7.x86_64/extra/microsoft-hyper-v-rdma/hv_network_direct.ko

/lib/modules/3.10.0-693.17.1.el7.x86_64/extra/microsoft-hyper-v-rdma/hv_sock.ko

/lib/modules/3.10.0-693.17.1.el7.x86_64/extra/microsoft-hyper-v-rdma/hv_storvsc.ko

/lib/modules/3.10.0-693.17.1.el7.x86_64/extra/microsoft-hyper-v-rdma/hv_utils.ko

/lib/modules/3.10.0-693.17.1.el7.x86_64/extra/microsoft-hyper-v-rdma/hv_vmbus.ko

/lib/modules/3.10.0-693.17.1.el7.x86_64/extra/microsoft-hyper-v-rdma/hyperv-keyboard.ko

/lib/modules/3.10.0-693.17.1.el7.x86_64/extra/microsoft-hyper-v-rdma/hyperv_fb.ko

/lib/modules/3.10.0-693.17.1.el7.x86_64/extra/microsoft-hyper-v-rdma/pci-hyperv.ko

/lib/modules/3.10.0-693.17.1.el7.x86_64/extra/microsoft-hyper-v-rdma/uio_hv_generic.ko

 

  1. Update to a kernel version which matches what is available from Microsoft

[root@ma-c-02-06-b80-c7u2 ~]# yum update --installroot=/cm/images/cloud-image

[root@ma-c-02-06-b80-c7u2 ~]# cmsh

[ma-c-02-06-b80-c7u2]% softwareimage use cloud-image  

[ma-c-02-06-b80-c7u2->softwareimage[cloud-image]]%set kernelversion 3.10.0-693.17.1.el7.x86_64  

[ma-c-02-06-b80-c7u2->softwareimage*[cloud-image*]]% commit


  1. Install Infiniband Support group in the software image:

[root@ma-c-02-06-b80-c7u2 ~]# chroot /cm/images/cloud-image/

[root@ma-c-02-06-b80-c7u2 /]# yum groupinstall "Infiniband Support"


  1. Install kmod-microsoft-hyper-v-rdma and microsoft-hyper-v-rdma in the software image:

[root@ma-c-02-06-b80-c7u2 ~]# chroot /cm/images/cloud-image/

[root@ma-c-02-06-b80-c7u2 /]# rpm -ivh /opt/microsoft/rdma/rhel74/kmod-microsoft-hyper-v-rdma-4.2.3.1.144-20180209.x86_64.rpm

[root@ma-c-02-06-b80-c7u2 /]# rpm -ivh --noscripts /opt/microsoft/rdma/rhel74/microsoft-hyper-v-rdma-4.2.3.1.144-20180209.x86_64.rpm


  1. Install hypervkvpd which is required by the waagent to bring up the RDMA interface

[root@ma-c-02-06-b80-c7u2 ~]# chroot /cm/images/cloud-image/

[root@ma-c-02-06-b80-c7u2 /]# yum install hypervkvpd


  1. Enabled openlogic repositories in the software image:

[root@ma-c-02-06-b80-c7u2 ~]# chroot /cm/images/cloud-image/

[root@ma-c-02-06-b80-c7u2 /]# cat > /etc/yum.repos.d/openlogic.repo

[openlogic]

name=CentOS-$releasever - openlogic packages for $basearch

baseurl=http://olcentgbl.trafficmanager.net/openlogic/$releasever/openlogic/$basearch/

enabled=1

gpgcheck=0

(ctrl+d)


  1. Reboot the cloud nodes and make sure that the kernel modules are loaded properly and the extra interface is up:

[root@westeurope-cnode002 ~]# lsmod | grep hv_

hv_network_direct     100138  0

hv_balloon             22073  0

ib_core               211874  14 rdma_cm,ib_cm,iw_cm,rpcrdma,ib_srp,ib_ucm,ib_iser,ib_srpt,ib_umad,hv_network_direct,ib_uverbs,rdma_ucm,ib_ipoib,ib_isert

hv_storvsc             22716  2

hv_utils               25798  2

scsi_transport_fc      64007  1hv_storvsc

ptp                    19231  6 igb,tg3,bnx2x,ixgbe,hv_utils,e1000e

hv_netvsc              45611  0

hv_vmbus               72582  8hv_balloon,hyperv_keyboard,hv_netvsc,hid_hyperv,hv_utils,hyperv_fb,hv_storvsc,hv_network_direct

[root@westeurope-cnode002 ~]# lsmod | grep rdma

rpcrdma                86152  0

rdma_ucm               26841  0

ib_uverbs              64636  2 ib_ucm,rdma_ucm

rdma_cm                54426  4 rpcrdma,ib_iser,rdma_ucm,ib_isert

ib_cm                  47287  5rdma_cm,ib_srp,ib_ucm,ib_srpt,ib_ipoib

iw_cm                  46260  1rdma_cm

ib_core               211874  14rdma_cm,ib_cm,iw_cm,rpcrdma,ib_srp,ib_ucm,ib_iser,ib_srpt,ib_umad,hv_network_direct,ib_uverbs,rdma_ucm,ib_ipoib,ib_isert

sunrpc                348674  23 nfs,nfsd,auth_rpcgss,lockd,nfsv3,rpcrdma,nfs_acl



[root@westeurope-cnode001 ~]# ip a

1: lo:<LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1

  link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00

  inet 127.0.0.1/8 scope host lo

     valid_lft forever preferred_lft forever

  inet6 ::1/128 scope host  

     valid_lft forever preferred_lft forever

2: eth0:<BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000

  link/ether 00:0d:3a:38:fa:84 brd ff:ff:ff:ff:ff:ff

  inet 10.42.0.5/16 brd 10.42.255.255 scope global eth0

     valid_lft forever preferred_lft forever

  inet6 fe80::20d:3aff:fe38:fa84/64 scope link  

     valid_lft forever preferred_lft forever

3: eth1:<BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000

  link/ether 00:15:5d:33:ff:34 brd ff:ff:ff:ff:ff:ff

  inet 172.16.1.43/16 brd 172.16.255.255 scope global eth1

     valid_lft forever preferred_lft forever

  inet6 fe80::215:5dff:fe33:ff34/64 scope link  

     valid_lft forever preferred_lft forever

5: tun0:<POINTOPOINT,MULTICAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UNKNOWN qlen 1024

  link/none  

  inet 172.31.0.1/16 brd 172.31.255.255 scope global tun0

     valid_lft forever preferred_lft forever

  inet6 fe80::ed66:b3fb:a329:ef95/64 scope link flags 800  

     valid_lft forever preferred_lft forever


Test running MPI jobs:


  1. Run mpi job simple MPI job:

[cmsupport@westeurope-cnode001 ~]$ module load intel/mpi/mic/5.1.3/2016.4.258

[cmsupport@westeurope-cnode001 ~]$ which mpirun

/cm/shared/apps/intel/compilers_and_libraries/2016.4.258/linux/mpi/intel64/bin/mpirun

[cmsupport@westeurope-cnode001 2017]$ mpirun -hosts westeurope-cnode001,westeurope-cnode002 -n 2 -ppn 1-env I_MPI_FABRICS=shm:dapl -env I_MPI_DAPL_PROVIDER=ofa-v2-ib0 -env I_MPI_DYNAMIC_CONNECTION=0 hostname

westeurope-cnode001

Westeurope-cnode002


  1. Run a PingPong IMB test:

[cmsupport@westeurope-cnode001 ~]$ module load intel/mpi/64/5.1.3/2016.4.258

[cmsupport@westeurope-cnode001 ~]$ which mpirun

/cm/shared/apps/intel/compilers_and_libraries/2016.4.258/linux/mpi/intel64/bin/mpirun

[cmsupport@westeurope-cnode001 ~]$ mpirun -hosts westeurope-cnode001,westeurope-cnode002 -ppn 1-n 2  -env I_MPI_DEBUG 5  -env I_MPI_FABRICS=shm:dapl -env I_MPI_DAPL_PROVIDER=ofa-v2-ib0 -env I_MPI_DYNAMIC_CONNECTION=0/cm/shared/apps/intel/compilers_and_libraries/2016.4.258/linux/mpi/intel64/bin/IMB-MPI1 pingpong




[0] MPI startup():Multi-threaded optimized library

[0] DAPL startup(): trying to open DAPL provider from I_MPI_DAPL_PROVIDER: ofa-v2-ib0

[1] DAPL startup(): trying to open DAPL provider from I_MPI_DAPL_PROVIDER: ofa-v2-ib0

[1] MPI startup(): DAPL provider ofa-v2-ib0

[1] MPI startup(): shm and dapl data transfer modes

[0] MPI startup(): DAPL provider ofa-v2-ib0

[0] MPI startup(): shm and dapl data transfer modes

[0] MPID_nem_init_dapl_coll_fns():Userset DAPL collective mask =0000

[0] MPID_nem_init_dapl_coll_fns():Effective DAPL collective mask =0000

[0] MPI startup():Rank    Pid      Node name            Pin cpu

[0] MPI startup():0       5882     westeurope-cnode001  {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}

[0] MPI startup():1       3563     westeurope-cnode002  {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}

[1] MPID_nem_init_dapl_coll_fns():Userset DAPL collective mask =0000

[0] MPI startup(): I_MPI_DAPL_PROVIDER=ofa-v2-ib0

[0] MPI startup(): I_MPI_DEBUG=5

[0] MPI startup(): I_MPI_DYNAMIC_CONNECTION=0

[0] MPI startup(): I_MPI_FABRICS=shm:dapl

[1] MPID_nem_init_dapl_coll_fns():Effective DAPL collective mask =0000

[0] MPI startup(): I_MPI_INFO_NUMA_NODE_DIST=10,20,20,10

[0] MPI startup(): I_MPI_INFO_NUMA_NODE_MAP=mlx4_0:-1

[0] MPI startup(): I_MPI_INFO_NUMA_NODE_NUM=2

[0] MPI startup(): I_MPI_PIN_MAPPING=1:0 0

#------------------------------------------------------------

#    Intel (R) MPI Benchmarks 4.1 Update 1, MPI-1 part

#------------------------------------------------------------

# Date                  : Mon Mar 26 11:05:25 2018

# Machine               : x86_64

# System                : Linux

# Release               : 3.10.0-693.17.1.el7.x86_64

# Version               : #1 SMP Thu Jan 25 20:13:58 UTC 2018

# MPI Version           : 3.0

# MPI Thread Environment:


# New default behavior from Version 3.2 on:


# the number of iterations per message size is cut down

# dynamically when a certain run time (per message size sample)

# is expected to be exceeded. Time limit is defined by variable

# "SECS_PER_SAMPLE" (=> IMB_settings.h)

# or through the flag => -time




# Calling sequence was:


# /cm/shared/apps/intel/compilers_and_libraries/2016.4.258/linux/mpi/intel64/bin/IMB-MPI1 pingpong


# Minimum message length in bytes:   0

# Maximum message length in bytes:   4194304

#

# MPI_Datatype                   : MPI_BYTE

# MPI_Datatype for reductions    : MPI_FLOAT

# MPI_Op                         : MPI_SUM

#

#


# List of Benchmarks to run:


# PingPong


#---------------------------------------------------

# Benchmarking PingPong

# #processes = 2

#---------------------------------------------------

     #bytes #repetitions      t[usec] Mbytes/sec

          0         1000         3.29         0.00

          1         1000         3.39         0.28

          2         1000         3.30         0.58

          4         1000         3.30         1.16

          8         1000         3.30         2.31

         16         1000         3.31         4.61

         32         1000         2.65        11.51

         64         1000         2.64        23.12

        128         1000         2.70        45.23

        256         1000         3.12        78.22

        512         1000         3.11       157.23

       1024         1000         3.25       300.84

       2048         1000         3.88       503.84

       4096         1000         5.13       760.86

        8192         1000         6.36      1227.52

      16384         1000         8.41      1858.89

      32768         1000        11.25      2778.86

      65536          640        17.37      3597.44

     131072          320        30.12      4149.39

     262144          160        58.17      4297.39

     524288           80       102.09      4897.46

    1048576           40       182.74      5472.38

    2097152           20       349.38      5724.45

    4194304           10       687.24      5820.37



# All processes entering MPI_Finalize






Tags: -

Related entries:

You cannot comment on this entry