1. Home
  2. OpenStack
  3. How do I get SR-IOV working with OpenStack?

How do I get SR-IOV working with OpenStack?

This document will not work With IB interfaces.

Abstract

This article describes how to use the Mellanox Neutron ML2 Mechanism Driver to add Mellanox InfiniBand support to an existing standard OpenStack Liberty and Mitaka cloud deployment. The cloud deployment is assumed to be running with ML2 + LinuxBridge Mechanism Driver + VLAN-based network isolation, and is managed with Bright Cluster Manager 7.2.

The end result is:

extending the capabilities of the OpenStack private cloud with the ability to create OpenStack networks backed with isolated segments of the InfiniBand fabric, and the ability to then spawn OpenStack VMs which have direct access to those networks/segments via a dedicated (passthrough) virtual IB device (exposed via SR-IOV).

such a VM will have a (logical) IPoIB network device, direct access to the IB fabric segment (e.g. for running MPI jobs with other machines attached to this segment), also, optionally regular virtual Ethernet devices connected to a VLAN-backed OpenStack networks.

Users will be able to pick whether they want to create
VMs attached to the InfiniBand-backed networks (IPoIB),
VLAN/VXLAN-backed isolated Ethernet networks,
Flat (shared) cluster internal networks, or any combination of those.

This article focuses on enabling IB functionality alongside the pre-existing regular VLAN-based network isolation. However, it’s also possible to follow most of this document as a guide to configuring IB functionality for OpenStack deployments running VXLAN-based network isolation. Some tips on that are included in the text.

Mellanox ML2 Mechanism Driver Introduction

The Mellanox ML2 Mechanism Driver supports Mellanox embedded switch functionality as part of the VPI (Ethernet/InfiniBand) HCA. The Mellanox ML2 Mechanism Driver provides functional parity with the Mellanox Neutron plugin.

It supports DIRECT (PCI passthrough) and MACVTAP (virtual interface with a tap-like software interface) vNIC types. For vNIC type configuration API details, one can refer to the configuration reference guide at http://docs.openstack.org/api/openstack-network/2.0/content/binding_ext_ports.html

Hardware vNICs mapped to the guest VMs allow higher performance and advanced features such as RDMA (remote direct memory access). The driver supports the VLAN network type, so that virtual networks can be supported on Ethernet or on InfiniBand fabrics. In such configurations:
Mellanox Openstack Neutron Agent (L2 Agent) runs on each compute node.

The L2 Agent should apply VIF connectivity based on mapping between a VIF (VM vNIC) and Embedded Switch port.

Source:https://wiki.openstack.org/wiki/Mellanox-Neutron-ML2

Prerequisites

Mellanox ConnectX-4 Lx network cards

A Functioning OpenStack Liberty private cloud deployed and managed with Bright Cluster Manager 7.2. The deployment should have been deployed with support for Bright-managed instances (i.e. there should be at least one virtual node typically given the name “vnode001”, in the cluster’s configuration).

The administrator should have a basic proficiency in using Bright’s cmsh

VT-d and SR-IOV should be enabled in the BIOS (VT-d may sometimes appear under “Virtualization Technology” in the BIOS processor settings, which also includes enabling VT-x for KVM etc.)

For administrators who are not yet managing their private clouds with Bright, the article illustrates how Bright brings structure and ease-of-use to OpenStack management.

Environment

In the following article the configuration of the hardware, CMDaemon and the Mellanox Driver is carried out.

The example environment consists of a single head-node and 3 compute nodes in the following configuration:

1 head-node (controller):
This is where neutron-server and opensm will be running.

1 network-node (node001):
This is where neutron agents will be running (i.e dhcp, metadata, l3 and linuxbridge).

2 compute hosts (node002..node003):
These have openstack-nova-compute and libvirtd, mlnx-agent and eswitchd

Each of these servers has a ConnectX-4 InfiniBand card.

Bright Cluster Manager 7.2 has been deployed and OpenStack has been setup. Nodes are running RHEL 7.2 / CentOS 7.2

In this example, deploying OpenStack resulted in the following OpenStack-specific configuration:
The node category openstack-network-nodes. This contained node001
The node category openstack-compute-hosts. This contained node002, node003
The software image openstack-image. This was used by all compute nodes and also by the network nodes (node001..node003)
Some OpenStack deployments can have the network node configured as part of the “openstack-compute-hosts” category (and instead have an additional OpenStackNetwork role attached to the node itself). Such deployments can also be used while following the instructions in this article.

Configuration overview

The remainder of this article carries out the following steps:
Enabling SR-IOV capabilities on the ConnectX-4 Lx cards
Configuring node categories and a software image for the hosts in cmdaemon/cmsh
Configuring the software images (for the compute hosts, network node and virtual nodes)
Installing the Mellanox ML2 Driver and configuring its services
Creating the InfiniBand-backed network in OpenStack Neutron
Booting Bright-managed instances with access to the created IB network

This all is carried out in the next steps.

Creating The Mellanox-specific OpenStack-enabled Software Image

In this section a new software image is created from an OpenStack node image — in this example we call the original software image openstack-image. The new image will contain the changes needed for the IB-enabled OpenStack nodes.

On the head node, run these commands in cmsh:
% softwareimage
% clone openstack-image openstack-image-mellanox
% set kernelversion 3.10.0-327.13.1.el7.x86_64
% set kernelparameters "intel_iommu=on"
% commit

The intel_iommu allows for VT-d (Intel Virtualization Technology for Directed I/O). This will allow SR-IOV devices to attach to VM instances.

It is advisable to set kernelversion to the newest kernel.

Creating The Mellanox-specific OpenStack compute-hosts Category

In this section a  new category for the Mellanox compute hosts is created, and those nodes are assigned to the category. In these examples the nodes are node002 and node003.

On the head node, run these commands in cmsh:
% category
% clone openstack-compute-hosts openstack-compute-hosts-mellanox
% set softwareimage openstack-image-mellanox
% commit
% device
% foreach -n node002..node003 (set category openstack-compute-hosts-mellanox)
% commit

Ensure the Openstack/Mellanox Software Image is Assigned to the Network Node

In this example the network node belongs to the openstack-network-nodes category, so the image is assigned to that category.

On the head node, these commands are run in cmsh:
% category
% use openstack-network-nodes
% set softwareimage openstack-image-mellanox
% commit

Update the firmware of the Mellanox ConnectX-4 Lx cards

This procedure must be followed in each physical host with Mellanox ConnectX-4 Lx cards.

The current firmware version should be checked:
[root@dell1 ~]# ibstat | grep "Firmware version"

Firmware version: 14.14.1100

The latest Mellanox OFED package for Centos/RHEL 7.2 should be downloaded

http://www.mellanox.com/page/products_dyn?product_family=26&mtag=linux_sw_drivers

The package name looks like this: MLNX_OFED
[root@dell1 ~]# tar -xzf MLNX_OFED_LINUX-3.2-2.0.0.0-rhel7.2-x86_64.tgz
[root@dell1 ~]# cd MLNX_OFED_LINUX-3.2-2.0.0.0-rhel7.2-x86_64/
[root@dell1 MLNX_OFED_LINUX-3.2-2.0.0.0-rhel7.2-x86_64]# ./mlnxofedinstall --fw-update-only

After running the updater, he host must be rebooted.

The new firmware version should be seen:
[root@dell1 ~]# ibstat | grep "Firmware version"

Firmware version: 14.14.2036

Configure the software image

These steps have to be run on the head node.

Install Neutron SR-IOV agent.
yum -y --installroot /cm/images/openstack-image-mellanox install openstack-neutron-sriov-nic-agent
Install OFED
yum update mlnx-ofed32
/cm/local/apps/mlnx-ofed32/current/bin/mlnx-ofed32-install.sh -s openstack-image-mellanox

Reboot the nodes with Mellanox cards so they boot with the new software image

On the head node, these commands should be run from cmsh:
% device
% foreach -n node001..node003 (reboot)

Enable SR-IOV on the network cards.

This procedure must be followed on each physical host with Mellanox Infiniband cards.

The BIOS of the nodes should be checked to see if SR-IOV is enabled:
[root@dell1 ~]# dmesg | grep -e IOMMU
[    0.000000] Intel-IOMMU: enabled
The mst service should be started:
[root@dell1 ~]# mst start

The Mellanox PCI devices in the system are listed (with lspci) to get the PCI slots of the devices we want to configure
[root@dell1 ~]# lspci -D | grep -i mellanox

0000:03:00.0 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx]

In this case, the device is the network card connected at 03:00.0

The device name for the network card can be obtained from the output of mst status. In this example the PCI slot of the device is known to be 0000:03:00.0, so a grep like this will find it:
[root@dell1 ~]# mst status | grep -b1 "0000:03:00.0"
126-/dev/mst/mt4117_pciconf0         - PCI configuration cycles access.
194:                                   domain:bus:dev.fn=0000:03:00.0 addr.reg=88 data.reg=92
284-                                   Chip revision is: 00

In this case the device name is /dev/mst/mt4117_pciconf0.

SR-IOV is enabled for the network card, and the number of virtual functions (4 here) set:
[root@dell1 ~]# mlxconfig -d /dev/mst/mt4117_pciconf0 set SRIOV_EN=1 NUM_OF_VFS=4

The node must be rebooted to apply the changes.

Enable VF on the network cards.

Enabling VF manually (this is not persistent)

The linux device that was assigned to each interface can be identified by running ibdev2ibnetdev:
[root@dell1 ~]# ibdev2netdev
mlx5_0 port 1 ==> enp3s0 (Up)

In this case the interface’s name is mlx5_0 and was assigned device enp3s0.

To verify there are no VFs enabled on the interface the ip link command can be run:
[root@dell1 ~]# ip link show enp3s0

5: enp3s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq master ovs-system state UP mode DEFAULT qlen 1000
link/ether 7c:fe:90:2f:ce:a0 brd ff:ff:ff:ff:ff:ff

4 VFs can be enabled on the interface with:
[root@dell1 ~]# echo 4 > /sys/class/infiniband/mlx5_0/device/mlx5_num_vfs

To verify the VFs are enabled, the ip link command can be run again:
[root@dell1 ~]# ip link show

5: enp3s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq master ovs-system state UP mode DEFAULT qlen 1000
link/ether 7c:fe:90:2f:ce:a0 brd ff:ff:ff:ff:ff:ff
    vf 0 MAC 00:00:00:00:00:00, spoof checking off, link-state auto
    vf 1 MAC 00:00:00:00:00:00, spoof checking off, link-state auto
    vf 2 MAC 00:00:00:00:00:00, spoof checking off, link-state auto
    vf 3 MAC 00:00:00:00:00:00, spoof checking off, link-state auto

The linux devices assigned to the VFs can be seen by running, the ibdev2netdev command again:
[root@dell1 ~]# ibdev2netdev
mlx5_0 port 1 ==> enp3s0 (Up)
mlx5_1 port 1 ==> em1_0 (Down)
mlx5_2 port 1 ==> em1_1 (Down)
mlx5_3 port 1 ==> em1_2 (Down)
mlx5_4 port 1 ==> em1_3 (Down)

Configure MAC address of the VF

This procedure must be done for every virtual function. This example uses the device enp3s0 with 4 VFs.

Before doing this step it should be ensured that the VFs are detected as being of type ethernet instead of InfiniBand. The lspci command should show this with:
#lspci -D | grep Mellanox

If the type is not Ethernet , then a session such as the following should be run first:
# mst status
[root@node002 ~]# mst status

MST modules:
------------
   MST PCI module is not loaded
   MST PCI configuration module loadedMST devices:
------------
/dev/mst/mt4115_pciconf0         - PCI configuration cycles access.
                                  domain:bus:dev.fn=0000:05:00.0 addr.reg=88 data.reg=92
                                  Chip revision is: 00

In this case the device path is /dev/mst/mt4115_pciconf0

Then the following should be executed:

mlxconfig -d /dev/mst/mt4115_pciconf0 set LINK_TYPE_P1=2

P1 is is used if this interface has only one port attached to it. If there are 2 ports, P2 should be used, and so on.

The machine should then be rebooted:

Now the SR-IOV VFs should be enabled once more:
#echo 4 > /sys/class/net/infiniband/mlx5_0/device/mlx5_num_vfs

The interface types should be checked
#lspci -D | grep Mellanox

If all went well, the session will have set the interface type to ethernet.

A MAC address, for example 7c:fe:90:11:11:00 can be assigned to the VF 0:[root@dell1 ~]# ip link set dev enp3s0 vf 0 mac 7c:fe:90:11:11:00

Running ip link show verifies this:
[root@dell1 ~]# ip link show enp3s0

5: enp3s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq master ovs-system state UP mode DEFAULT qlen 1000
  link/ether 7c:fe:90:2f:ce:a0 brd ff:ff:ff:ff:ff:ff
    vf 0 MAC 7c:fe:90:11:11:00, spoof checking off, link-state auto
    vf 1 MAC 00:00:00:00:00:00, spoof checking off, link-state auto
    vf 2 MAC 00:00:00:00:00:00, spoof checking off, link-state auto
    vf 3 MAC 00:00:00:00:00:00, spoof checking off, link-state auto

Get the vendor and product ID of the Mellanox network cards

This procedure must be done in all compute hosts with Mellanox network cards.

The host should be logged into to get the vendor and product ID of the Mellanox cards. Care should be taken not to include the Virtual functions:
[root@dell1 ~]# lspci -D -nn | grep Mellanox | grep -v "Virtual Function"
0000:03:00.0 Ethernet controller [0200]: Mellanox Technologies MT27710 Family [ConnectX-4 Lx] [15b3:1015]

In this example the vendor ID is 15b3 and the product ID is 1015.

Configure PCI passthrough whitelist in Nova

In the openstack-image-mellanox software image, and in each of the compute hosts, the file /etc/nova/nova.conf should be edited. In the [DEFAULT] section this line should be added:
Pci_alias = [{“dev”}]

The above list must include the vendor and product IDs of all the cards. The interfaces have the tag physical_network set to physnet1 because that’s the name of the physical network that is defined by the administrator in Neutron.

In each of the compute hosts, the nova compute service is restarted, by running
systemctl restart openstack-nova-compute.service

Configure the SR-IOV Neutron agent in the compute nodes

In the openstack-image-mellanox software image and in each of the compute hosts, the file /etc/neutron/plugins/ml2/sriov_agent.ini should be edited as follows:

In the [ml2_sriov] section this line should be added:

physical_device_mappings = physnet1:enp3s0

physnet1 is the physical network the administator defines in neutron, and enp3s0 is the linux device name of the Mellanox card.

In the [securitygroup] section this line is added:

firewall_driver = neutron.agent.firewall.NoopFirewallDriver

The Neutron agent service is enabled and started. From cmsh this can be done with:
% category; use openstack-compute-hosts-mellanox
% services; add neutron-sriov-nic-agent; set autostart yes; set monitored yes; commit

Configure the Nova Scheduler in the controller nodes

The PciPassthroughFilter setting is added to the schedulerfilters object. From cmsh:
% openstack
% settings
% compute
% append schedulerfilters PciPassthroughFilter
% commit

Configure the Neutron Server in the controller nodes

Configure the ML2 Neutron plugin

From cmsh:
% openstack
% settings
% networking
% set ml2mechanismdrivers sriovnicswitch
% commit

In the openstack-image software image and in each of the controller nodes, the file /etc/neutron/plugins/ml2/ml2_conf_sriov.ini is edited, so that in the [ml2_sriov] section this property is set:
agent_required = True

Modify the neutron service to include the ml2_conf_sriov.ini file

In the openstack-image software image and in each of the controller nodes, the file /etc/systemd/system/neutron-server.service is edited, so that the [Service] section has this:
ExecStart=/usr/bin/neutron-server --config-file /usr/share/neutron/neutron-dist.conf --config-dir /usr/share/neutron/server --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugin.ini --config-dir /etc/neutron/conf.d/common --config-dir /etc/neutron/conf.d/neutron-server --config-file /etc/neutron/plugins/ml2/ml2_conf_sriov.ini --log-file /var/log/neutron/server.log

Restart the neutron service with new settings in the controller nodes

From cmsh:
% device
% pexec -n controller1 systemctl daemon-reload
% foreach -n controller1 (services; restart neutron-server)

Updated on October 26, 2020

Related Articles

Leave a Comment