If you have a BittWare FPGA card that can be inserted into a PCI/PCIe slot of a Bright-managed compute node (e. g. XUP-VVH), then you can follow these procedures to allow Bright to monitor the card’s sensor data.
Installing the toolkit
First, in order to be able to gather sensor data from the card, the BittWorks II Toolkit will need to be installed onto the compute node that is hosting the FPGA card. The toolkit may be downloaded from the BittWare Developer Site:
https://developer.bittware.com/
Please note that access to the BittWare Developer Site and toolkit software requires registering for an account with Molex Electronic Solutions; Molex does charge for the toolkit.
The toolkit should be installed onto the software image that is used by the compute node where the FPGA card is installed. For example, if the node is running RHEL or CentOS and if the RPM has been copied to /root on the head node, then you can run the following commands on the head node to install the toolkit to the image, and for this example, we will say the image is called fpga-image:
# cd /root
# yum --installroot=/cm/images/fpga-image localinstall bw2tk*.rpm
You should replace “fpga-image” with the actual name you are using for that node’s software image on your cluster.
Then, the image can be synced to that compute node in order to install the toolkit onto the compute node. For example:
# cmsh
% device use gpu029
% imageupdate -w
Making the card visible to the toolkit and node’s OS
Once you have installed the BittWorks II Toolkit onto the node, you should make sure that the appropriate driver for the BittWare FPGA card has been loaded:
# modprobe bwpcidrv
# lsmod | grep bwpci
bwpcidrv 40960 0
You should also now see the respective device under /dev:
# ls /dev/bwpci
/dev/bwpci
In order to interact with the card using the toolkit, you will first need to make sure that the toolkit’s binaries have been added to your PATH variable and that the BWTK variable points to the location of the toolkit software:
# source /etc/profile.d/bwtk.sh
# echo $BWTK
/opt/bwtk/2020.1
Next, use the toolkit’s bwconfig utility to scan for the FPGA card:
# bwconfig --scan=pci --dev-id=0x56
Scanning for devices
[result]: (bus,slot,func) VendorID, DeviceID (Name)
[0]: (175,0,0) 0x12ba 0x0056 (FPGA BMC-PCIe)
After that, use the same utility to add the FPGA card:
# bwconfig --add=pci --dev-id=0x56
Scanning for devices
[result]: (bus,slot,func) VendorID, DeviceID (Name)
[0]: (175,0,0) 0x12ba 0x0056 (FPGA BMC-PCIe)
Device added as device "0".
Now you should be able to use the toolkit’s bwmonitor utility to read the card’s sensor data:
# bwmonitor --device=0 --read --type=Sensor
12-11-2020 11:26:47
SDR Sensors XUPVVH dev 0
(0) Board Power OK 26 Watts
(1) 12v Cable Current OK 1.00 Amps
(2) 12v Cable Voltage OK 12.12 Volts
(3) 12v PCIe Current OK 1.09 Amps
(4) 12v PCIe Voltage OK 12.01 Volts
(5) 3.3v MP Voltage OK 3.30 Volts
(6) 3.3v MP Current OK 2.40 Amps
(7) 3.3v MP2 Voltage OK 3.30 Volts
(8) 3.3v MP2 Current OK 1.54 Amps
(9) DIMM12 Voltage OK 1.19 Volts
(10) DIMM12 Current OK 0.06 Amps
(11) HBM Voltage OK 1.19 Volts
(12) HBM Current OK 0.13 Amps
(13) FPGA Core Voltage OK 0.85 Volts
(14) FPGA Core Current 0 OK 10.29 Amps
(16) FPGA Supply Die Temp OK 47 degrees C
(17) FPGA Supply Inductor Temp 0 OK 34 degrees C
(18) FPGA Supply Inductor Temp 1 OK 34 degrees C
(19) FPGA Slave Supply Temp 0 OK 44 degrees C
(20) FPGA Slave Supply Temp 1 OK 46 degrees C
(21) FPGA Core Temperature OK 37 degrees C
(22) Board Temperature OK 31 degrees C
(23) Vcc AUX Voltage OK 1.76 Volts
(24) Vcc AUX Current OK 0.91 Amps
(36) Vcc VBRAM Current OK 0.11 Amps
(37) QSFP-0 Temperature Unavailable
(38) QSFP-1 Temperature Unavailable
(39) QSFP-2 Temperature Unavailable
(40) QSFP-3 Temperature Unavailable
(41) DIMM-1 Temperature OK 26 degrees C
(42) DIMM-2 Temperature Unavailable
(43) DIMM-12 Vpp Voltage OK 2.46 Volts
(44) DIMM-12 Vtt Voltage OK 0.59 Volts
(45) HBM Vpp Voltage OK 2.46 Volts
(46) HBM Vtt Voltage OK 0.59 Volts
Adding a metric collection script to Bright for gathering the card’s sensor data
A JSON configuration file for the sensor data, which should be placed under /cm/local/apps/cmd/scripts/metrics/configfiles/ locally on the node where the FPGA card is installed, can be set up as follows:
# cat sample_fpga_bittware.json
{
"classdefs":
{
"Board Power": {
"description": "Total board power reading",
"unit": "W"
},
"12v Cable Current": {
"description": "Current reading for 12v cable",
"unit": "A"
},
"12v Cable Voltage": {
"description": "Voltage reading for 12v cable",
"unit": "V"
},
"12v PCIe Current": {
"description": "Current reading for 12v PCIe",
"unit": "A"
},
"12v PCIe Voltage": {
"description": "Voltage reading for 12v PCIe",
"unit": "V"
},
"3.3v MP Voltage": {
"description": "Voltage reading for 3.3v MP",
"unit": "V"
},
"3.3v MP Current": {
"description": "Current reading for 3.3v MP",
"unit": "A"
},
"3.3v MP2 Voltage": {
"description": "Voltage reading for 3.3v MP2",
"unit": "V"
},
"3.3v MP2 Current": {
"description": "Current reading for 3.3v MP2",
"unit": "A"
},
"DIMM12 Voltage": {
"description": "Voltage reading for DIMM12",
"unit": "V"
},
"DIMM12 Current": {
"description": "Current reading for DIMM12",
"unit": "A"
},
"HBM Voltage": {
"description": "Voltage reading for HBM",
"unit": "V"
},
"HBM Current": {
"description": "Current reading for HBM",
"unit": "A"
},
"FPGA Core Voltage": {
"description": "Voltage reading for FPGA Core",
"unit": "V"
},
"FPGA Core Current 0": {
"description": "Current reading for FPGA Core",
"unit": "A"
},
"FPGA Supply Die Temp": {
"description": "Temperature reading for Supply Die",
"unit": "C"
},
"FPGA Supply Inductor Temp 0": {
"description": "Temperature reading for Supply Inductor 0",
"unit": "C"
},
"FPGA Supply Inductor Temp 1": {
"description": "Temperature reading for Supply Inductor 1",
"unit": "C"
},
"FPGA Slave Supply Temp 0": {
"description": "Temperature reading for Slave Supply 0",
"unit": "C"
},
"FPGA Slave Supply Temp 1": {
"description": "Temperature reading for Slave Supply 1",
"unit": "C"
},
"FPGA Core Temperature": {
"description": "Temperature reading for FPGA Core",
"unit": "C"
},
"Board Temperature": {
"description": "Temperature reading for Board",
"unit": "C"
},
"Vcc AUX Voltage": {
"description": "Voltage reading for Vcc AUX",
"unit": "V"
},
"Vcc AUX Current": {
"description": "Current reading for Vcc AUX",
"unit": "A"
},
"Vcc VBRAM Current": {
"description": "Current reading for Vcc VBRAM",
"unit": "A"
},
"DIMM-1 Temperature": {
"description": "Temperature reading for DIMM-1",
"unit": "C"
},
"DIMM-12 Vpp Voltage": {
"description": "Voltage reading for DIMM-12 Vpp",
"unit": "V"
},
"DIMM-12 Vtt Voltage": {
"description": "Voltage reading for DIMM-12 Vtt",
"unit": "V"
},
"HBM Vpp Voltage": {
"description": "Voltage reading for HBM Vpp",
"unit": "V"
},
"HBM Vtt Voltage": {
"description": "Voltage reading for HBM Vtt",
"unit": "V"
}
},
"removeclass": [
"FPGA"
]
}
The script for gathering the sensor data from the FPGA card, which should be placed under /cm/local/apps/cmd/scripts/metrics/ locally on the node where the FPGA card is installed, can be set up as follows:
# cat sample_fpga_bittware.py
#!/cm/local/apps/python3/bin/python
import sys
import json
import subprocess
import re
import os
LINE_MATCH_PATTERN = '\([0-9]*[0-9]*\) ([a-zA-Z0-9\. ]+).*'
DATE_PATTERN = '^(3[01]|[12][0-9]|0[1-9])-(1[0-2]|0[1-9])-[0-9]{4}'
CONFIGFILE = "/cm/local/apps/cmd/scripts/metrics/configfiles/sample_fpga_bittware.json"
BWMONITOR_CMD = ["/opt/bwtk/2020.1/bin/bwmonitor", "--dev=0", "--read", "--type=Sensor"]
def parse_line(line):
# Do not parse if line contains date
datematch = re.match(DATE_PATTERN, line)
if not datematch and "sdr sensors" not in line.lower() and "available" not in line.lower():
title, value = line.replace('OK', ':').split(':', 1)
value, unit = value.strip().replace('degrees C', 'C').split(" ", 1)
regexmatch = re.match(LINE_MATCH_PATTERN, title.replace(' ', ' ').replace('-', ' ').strip())
if regexmatch:
regexgroups = regexmatch.groups()
title = regexgroups[0].strip()
return title, value, unit
else:
return False, False, False
def is_defined(line, definelist):
match = False
matchingpattern = ""
for pattern in definelist:
if re.search(pattern, line):
match = True
matchingpattern = pattern
return match, matchingpattern
def parse_output(fpga_out, configuration):
data = {}
for line in fpga_out.split('\n'):
match, pattern = is_defined(line, configuration["classdefs"])
if match:
title, value, unit = parse_line(line)
if title:
try:
data[title] = {"pattern": pattern, "title": title, "value": float(value), "unit": unit}
data[title].update(configuration["classdefs"][pattern])
except ValueError:
continue
return data
def parse_metrics(output, configuration):
metrics_data = parse_output(output, configuration)
metrics = []
for metric in metrics_data:
metricname = "FPGA.Bittware.{}".format(metrics_data[metric]["title"].replace(" ", "."))
for rmclass in configuration["removeclass"]:
metricname = metricname.replace(".{}.".format(rmclass), ".")
metrics_data[metric]["metric"] = metricname
metrics_data[metric]["class"] = "FPGA/Bittware"
del(metrics_data[metric]["pattern"])
del(metrics_data[metric]["title"])
metrics.append(metrics_data[metric])
return metrics
def run_info_command():
command_out = subprocess.check_output(BWMONITOR_CMD, universal_newlines=True)
return command_out
def initialize(configuration):
initialized_metrics = []
infocmd_out = run_info_command()
metrics = parse_metrics(infocmd_out, configuration)
for metric in metrics:
data = metric
del(data["value"])
initialized_metrics.append(data)
return initialized_metrics
def sample(configuration):
sampled_metrics = []
infocmd_out = run_info_command()
metrics = parse_metrics(infocmd_out, configuration)
for metric in metrics:
data = {}
data["metric"] = metric["metric"]
data["value"] = metric["value"]
sampled_metrics.append(data)
return sampled_metrics
def main():
metrics = []
with open(CONFIGFILE) as configfile:
try:
configuration = json.load(configfile)
except FileNotFoundError:
print("Configfile \"{}\" not found.".format(CONFIGFILE))
sys.exit(1)
if "--initialize" in sys.argv:
metrics = initialize(configuration)
else:
metrics = sample(configuration)
print(json.dumps(metrics, indent=4))
if __name__ == '__main__':
main()
The collection may be added to Bright using cmsh on the cluster’s active head node as follows:
# cmsh
% monitoring setup
% add collection Bittware\ FPGA
% set description Samples\ metrics\ of\ Bittware\ FPGAs
% set script /cm/local/apps/cmd/scripts/metrics/sample_fpga_bittware.py
% set timeout 30s
% nodeexecutionfilters
% add node bittware_fpga_nodes
% set nodes gpu029
% ..
% ..
% commit
A minute or two later, you should be able to view the results as follows:
# cmsh
% device use gpu029
% latestmetricdata | grep FPGA
FPGA.Bittware.12v.Cable.Current FPGA/Bittware 1 A 2m 35s
FPGA.Bittware.12v.Cable.Voltage FPGA/Bittware 12.12 V 2m 35s
FPGA.Bittware.12v.PCIe.Current FPGA/Bittware 1.09 A 2m 35s
FPGA.Bittware.12v.PCIe.Voltage FPGA/Bittware 12.01 V 2m 35s
FPGA.Bittware.3.3v.MP.Current FPGA/Bittware 2.4 A 2m 35s
FPGA.Bittware.3.3v.MP.Voltage FPGA/Bittware 3.3 V 2m 35s
FPGA.Bittware.3.3v.MP2.Current FPGA/Bittware 1.54 A 2m 35s
FPGA.Bittware.3.3v.MP2.Voltage FPGA/Bittware 3.3 V 2m 35s
FPGA.Bittware.Board.Power FPGA/Bittware 26 W 2m 35s
FPGA.Bittware.Board.Temperature FPGA/Bittware 30 C 2m 35s
FPGA.Bittware.Core.Current.0 FPGA/Bittware 10.29 A 2m 35s
FPGA.Bittware.Core.Temperature FPGA/Bittware 37 C 2m 35s
FPGA.Bittware.Core.Voltage FPGA/Bittware 0.85 V 2m 35s
FPGA.Bittware.DIMM.1.Temperature FPGA/Bittware 25 C 2m 35s
FPGA.Bittware.DIMM.12.Vpp.Voltage FPGA/Bittware 2.46 V 2m 35s
FPGA.Bittware.DIMM.12.Vtt.Voltage FPGA/Bittware 0.59 V 2m 35s
FPGA.Bittware.DIMM12.Current FPGA/Bittware 0.06 A 2m 35s
FPGA.Bittware.DIMM12.Voltage FPGA/Bittware 1.19 V 2m 35s
FPGA.Bittware.HBM.Current FPGA/Bittware 0.13 A 2m 35s
FPGA.Bittware.HBM.Voltage FPGA/Bittware 1.19 V 2m 35s
FPGA.Bittware.HBM.Vpp.Voltage FPGA/Bittware 2.46 V 2m 35s
FPGA.Bittware.HBM.Vtt.Voltage FPGA/Bittware 0.59 V 2m 35s
FPGA.Bittware.Slave.Supply.Temp.0 FPGA/Bittware 44 C 2m 35s
FPGA.Bittware.Slave.Supply.Temp.1 FPGA/Bittware 46 C 2m 35s
FPGA.Bittware.Supply.Die.Temp FPGA/Bittware 45 C 2m 35s
FPGA.Bittware.Supply.Inductor.Temp.0 FPGA/Bittware 33 C 2m 35s
FPGA.Bittware.Supply.Inductor.Temp.1 FPGA/Bittware 33 C 2m 35s
FPGA.Bittware.Vcc.AUX.Current FPGA/Bittware 0.91 A 2m 35s
FPGA.Bittware.Vcc.AUX.Voltage FPGA/Bittware 1.76 V 2m 35s
FPGA.Bittware.Vcc.VBRAM.Current FPGA/Bittware 0.07 A 2m 35s
For persistence between node reboots, it is recommended that the metric collection script and JSON configuration file be stored in the node’s software image.