PCI passthrough on OpenStack
This article describes how PCI passthrough can be used in Bright OpenStack. Two different models of GP-GPUs, are covered in this article, but the same configuration method can be used for any type of PCI device. A few options are also discussed on reserving some resources in the cloud, in order to ensure that instances requiring the PCI devices are able to access them.
Basic setup
In this article two GPUs are assumed: an NVIDIA K40c and an NVIDIA P100. The K40c will be installed in a hypervisor node called hyper01, and the P100 will be in hyper02. The cloud does have more hypervisor nodes, but those don’t have any GPUs. It is assumed that all hypervisor nodes, as well as controller nodes are using the default-image software image.
The procedure starts with configuring the hypervisors. The first thing to do is to make sure that Intel VT-d is enabled in the BIOS. In addition, intel_iommu=on must be added to the kernel parameters for the hypervisor nodes. To do this, the following commands in cmsh are used. These override any existing parameters:
[cluster]% device
[cluster->device]% foreach -n hyper[01-02] (set kernelparameters intel_iommu=on)
[cluster->device*]% commit
Configure the compute service
Next, the vendor ID and product ID of the GPU cards must be found. To retrieve those, the hypervisor node can be logged into using ssh, and lspci -nn | grep NVIDIA can be run. For example, on hyper01 the following output is displayed:
# lspci -nn | grep NVIDIA
86:00.0 3D controller [0302]: NVIDIA Corporation GK110BGL [Tesla K40c] [10de:1024] (rev a1)
This means that for this card, the vendor ID is 10de, and the product ID is 1024. For the P100 card the vendor ID and product ID turn out to be 10de and 15f8.
The file /etc/nova/nova.conf
can now be modified. Specifically, the pci_alias and the pci_passthrough fields are to be updated. Strictly, the former is only needed on the controller nodes (for the nova scheduler), while the later is only needed on the hypervisors. But since these nodes both use the same software image, and the non-relevant settings are simply ignored by the scheduler and compute processes, these changes can just be applied as they are to the /etc/nova/nova.conf file in the default-image. In the [DEFAULT] section of the configuration file /cm/images/default-image/etc/nova/nova.conf, the following fields are then set:
pci_alias={"vendor_id": "10de", "product_id":"1024", "device_type":"type-PCI", "name":"gpu-k40"}
pci_alias={"vendor_id": "10de", "product_id":"15f8", "device_type":"type-PCI", "name":"gpu-p100"}
pci_passthrough_whitelist=[{"vendor_id": "10de", "product_id":"1024"},{"vendor_id": "10de", "product_id":"15f8"}]
The appropriate values for the vendor_id and product_id fields, as obtained from the lspci output earlier, should be inserted. The name fields in the pci_alias definitions are arbitrary names, which will be referred to when setting up matching flavors. After modifying nova.conf, either the controller and hypervisor nodes must be rebooted, or some other means of deploying the changes to the running nodes must be used.
Create new flavors
To create the flavors, an existing flavor is cloned using cmsh. New flavors could also be created from scratch. The key is in setting the pci_passthrough:alias values in the flavors extraspecs. These values should match the values used in the pci_alias definitions in nova.conf. For example here is how a g1.medium.p100 flavor was created:
[cluster]% openstack flavors
[cluster->openstack[default]->flavors]% clone m1.medium g1.medium.p100
[cluster->openstack[default]->flavors*[g1.medium.p100*]]% set extraspecs {\"pci_passthrough:alias\":\"gpu-p100:1\"}
[cluster->openstack[default]->flavors*[g1.medium.p100*]]% commit
The double quotes needed to be escaped with a backslash here. That is specific to cmsh, and is not needed when using the OpenStack native clients. The trailing :1 indicates that this flavor requires a single P100 GPU. In the case of hypervisors with more than one GPU, this value could be increased to have the flavor require any number of GPUs.
Configure the compute scheduler
Next, the PciPassthroughFilter must be added to the nova scheduler filters. This filter will ensure that flavors which require PCI devices get scheduled on hosts that provide those devices. Using cmsh, the filter can be added as follows:
[cluster]% openstack settings
[cluster->openstack[default]->settings]% compute
[cluster->openstack[default]->settings->compute]% append schedulerfilters PciPassthroughFilter
[cluster->openstack*[default*]->settings*->compute*]% commit
Reserving resources
At this point basic GPU passthrough is working and an instance can be requested using one of the new flavors. Running lspci from inside the instance should then list the GPU.
However, as yet there is nothing to prevent the compute scheduler from filling up a host with a GPU with non-GPU instances. If that happens, then any request for a GPU will be denied. The current Newton release of OpenStack used by Bright OpenStack 8.0 doesn’t really provide a good solution for this problem. Recently upstream development did introduce a so called “PCI Affinity Weigher”, which will be available in the Pike release. The PCI Affinity Weigher will allow the scheduler to be more aware of PCI devices, and make it “prefer” to schedule non-GPU instances to non-GPU hosts.
But for this specific use case a more solid guarantee was needed, to ensure that a GPU instance could always be scheduled. In other words a hard reservation needed to be made. To achieve this, a modified version of the existing (Aggregate)RamFilter was made. It works by reserving a certain amount of memory on the GPU nodes, whenever the scheduler is trying to allocate non-GPU instances.
To deploy the modified filter, the source code is obtained from the attached file, reserve_ram_filter.py
, and installed in this location:
/cm/images/default-image/usr/lib/python2.7/site-packages/modifiedfilters/reserve_ram_filter.py
An empty file is also created here:
/cm/images/default-image/usr/lib/python2.7/site-packages/modifiedfilters/__init__.py
Then /cm/images/default-image/etc/nova/nova.conf must be edited again and deployed to the controllers. As in the preceding text, the image can be modified since the setting doesn’t cause problems on hypervisor nodes.
The following is set, in the [DEFAULT] section:
scheduler_available_filters=nova.scheduler.filters.all_filters
scheduler_available_filters=modifiedfilters.reserve_ram_filter.ReserveAggregateRamFilter
Now cmsh is used to modify the schedulerfilter settings again as in the preceding example. This time (Aggregate)RamFilter must be replaced with Reserve(Aggregate)RamFilter.
Next, a host aggregate must be created for the two GPU nodes. Two metadata properties which are specific to the custom filter must be set. The reserve_ram_filter_mb property defines how much RAM is to be reserved for the GPU instances, per hypervisor node. 4096 MiB could be a value. The reserve_ram_filter_flavor_extra_spec property specifies that when scheduling any flavor which does not have the gpuextraspec set to true, the reserved memory should be taken into account. This is how to create the aggregate using cmsh:
[cluster]% openstack hostaggregates
[cluster->openstack[default]->hostaggregates]% add gpu
[cluster->openstack[default]->hostaggregates*[gpu*]]% set nodes hyper01 hyper02
[cluster->openstack[default]->hostaggregates*[gpu*]]% set metadata {\"reserve_ram_filter_flavor_extra_spec\":\"gpu\",\"reserve_ram_filter_mb\":\"4096\"}
[cluster->openstack[default]->hostaggregates*[gpu*]]% commit
Now that the aggregate existes, all that is left to do is to modify the extraspecs of the GPU flavor(s). For example, as follows:
[cluster]% openstack flavors
[cluster->openstack[default]->flavors]% set g1.medium.p100 extraspecs {\"pci_passthrough:alias\":\"gpu-p100:1\",\"gpu\":\"true\"}
[cluster->openstack[default]->flavors*]% commit
With all of these changes, allocation of a GPU instance can now always be guaranteed, unless the GPU is already in use. Obviously, this is a specific use case. But it does work really well for this case.