1. Home
  2. How to have Bright monitor a BittWare FPGA card

How to have Bright monitor a BittWare FPGA card

If you have a BittWare FPGA card that can be inserted into a PCI/PCIe slot of a Bright-managed compute node (e. g. XUP-VVH), then you can follow these procedures to allow Bright to monitor the card’s sensor data.

Installing the toolkit

First, in order to be able to gather sensor data from the card, the BittWorks II Toolkit will need to be installed onto the compute node that is hosting the FPGA card. The toolkit may be downloaded from the BittWare Developer Site:

https://developer.bittware.com/

Please note that access to the BittWare Developer Site and toolkit software requires registering for an account with Molex Electronic Solutions; Molex does charge for the toolkit.

The toolkit should be installed onto the software image that is used by the compute node where the FPGA card is installed. For example, if the node is running RHEL or CentOS and if the RPM has been copied to /root on the head node, then you can run the following commands on the head node to install the toolkit to the image, and for this example, we will say the image is called fpga-image:

# cd /root
# yum --installroot=/cm/images/fpga-image localinstall bw2tk*.rpm

You should replace “fpga-image” with the actual name you are using for that node’s software image on your cluster.

Then, the image can be synced to that compute node in order to install the toolkit onto the compute node. For example:

# cmsh
% device use gpu029
% imageupdate -w

Making the card visible to the toolkit and node’s OS

Once you have installed the BittWorks II Toolkit onto the node, you should make sure that the appropriate driver for the BittWare FPGA card has been loaded:

# modprobe bwpcidrv
# lsmod | grep bwpci
bwpcidrv               40960  0

You should also now see the respective device under /dev:

# ls /dev/bwpci
/dev/bwpci

In order to interact with the card using the toolkit, you will first need to make sure that the toolkit’s binaries have been added to your PATH variable and that the BWTK variable points to the location of the toolkit software:

# source /etc/profile.d/bwtk.sh
# echo $BWTK
/opt/bwtk/2020.1

Next, use the toolkit’s bwconfig utility to scan for the FPGA card:

# bwconfig --scan=pci --dev-id=0x56
Scanning for devices

[result]: (bus,slot,func) VendorID, DeviceID (Name)
[0]:      (175,0,0)       0x12ba    0x0056   (FPGA BMC-PCIe)

After that, use the same utility to add the FPGA card:

# bwconfig --add=pci --dev-id=0x56
Scanning for devices

[result]: (bus,slot,func) VendorID, DeviceID (Name)
[0]:      (175,0,0)       0x12ba    0x0056   (FPGA BMC-PCIe)
Device added as device "0".

Now you should be able to use the toolkit’s bwmonitor utility to read the card’s sensor data:

# bwmonitor --device=0 --read --type=Sensor
12-11-2020 11:26:47
SDR Sensors                    XUPVVH dev 0
 (0)  Board Power              OK                 26 Watts
 (1)  12v Cable Current        OK                 1.00 Amps
 (2)  12v Cable Voltage        OK                 12.12 Volts
 (3)  12v PCIe Current         OK                 1.09 Amps
 (4)  12v PCIe Voltage         OK                 12.01 Volts
 (5)  3.3v MP Voltage          OK                 3.30 Volts
 (6)  3.3v MP Current          OK                 2.40 Amps
 (7)  3.3v MP2 Voltage         OK                 3.30 Volts
 (8)  3.3v MP2 Current         OK                 1.54 Amps
 (9)  DIMM12 Voltage           OK                 1.19 Volts
 (10) DIMM12 Current           OK                 0.06 Amps
 (11) HBM Voltage              OK                 1.19 Volts
 (12) HBM Current              OK                 0.13 Amps
 (13) FPGA Core Voltage        OK                 0.85 Volts
 (14) FPGA Core Current 0      OK                 10.29 Amps
 (16) FPGA Supply Die Temp     OK                 47 degrees C
 (17) FPGA Supply Inductor Temp 0 OK                 34 degrees C
 (18) FPGA Supply Inductor Temp 1 OK                 34 degrees C
 (19) FPGA Slave Supply Temp 0 OK                 44 degrees C
 (20) FPGA Slave Supply Temp 1 OK                 46 degrees C
 (21) FPGA Core Temperature    OK                 37 degrees C
 (22) Board Temperature        OK                 31 degrees C
 (23) Vcc AUX Voltage          OK                 1.76 Volts
 (24) Vcc AUX Current          OK                 0.91 Amps
 (36) Vcc VBRAM Current        OK                 0.11 Amps
 (37) QSFP-0 Temperature       Unavailable
 (38) QSFP-1 Temperature       Unavailable
 (39) QSFP-2 Temperature       Unavailable
 (40) QSFP-3 Temperature       Unavailable
 (41) DIMM-1 Temperature       OK                 26 degrees C
 (42) DIMM-2 Temperature       Unavailable
 (43) DIMM-12 Vpp Voltage      OK                 2.46 Volts
 (44) DIMM-12 Vtt Voltage      OK                 0.59 Volts
 (45) HBM Vpp Voltage          OK                 2.46 Volts
 (46) HBM Vtt Voltage          OK                 0.59 Volts

Adding a metric collection script to Bright for gathering the card’s sensor data

A JSON configuration file for the sensor data, which should be placed under /cm/local/apps/cmd/scripts/metrics/configfiles/ locally on the node where the FPGA card is installed, can be set up as follows:

# cat sample_fpga_bittware.json
{
  "classdefs":
  {
    "Board Power": {
      "description": "Total board power reading",
      "unit": "W"
    },
    "12v Cable Current": {
      "description": "Current reading for 12v cable",
      "unit": "A"
    },
    "12v Cable Voltage": {
      "description": "Voltage reading for 12v cable",
      "unit": "V"
    },
    "12v PCIe Current": {
      "description": "Current reading for 12v PCIe",
      "unit": "A"
    },
    "12v PCIe Voltage": {
      "description": "Voltage reading for 12v PCIe",
      "unit": "V"
    },
    "3.3v MP Voltage": {
      "description": "Voltage reading for 3.3v MP",
      "unit": "V"
    },
    "3.3v MP Current": {
      "description": "Current reading for 3.3v MP",
      "unit": "A"
    },
    "3.3v MP2 Voltage": {
      "description": "Voltage reading for 3.3v MP2",
      "unit": "V"
    },
    "3.3v MP2 Current": {
      "description": "Current reading for 3.3v MP2",
      "unit": "A"
    },
    "DIMM12 Voltage": {
      "description": "Voltage reading for DIMM12",
      "unit": "V"
    },
    "DIMM12 Current": {
      "description": "Current reading for DIMM12",
      "unit": "A"
    },
    "HBM Voltage": {
      "description": "Voltage reading for HBM",
      "unit": "V"
    },
    "HBM Current": {
      "description": "Current reading for HBM",
      "unit": "A"
    },
    "FPGA Core Voltage": {
      "description": "Voltage reading for FPGA Core",
      "unit": "V"
    },
    "FPGA Core Current 0": {
      "description": "Current reading for FPGA Core",
      "unit": "A"
    },
    "FPGA Supply Die Temp": {
      "description": "Temperature reading for Supply Die",
      "unit": "C"
    },
    "FPGA Supply Inductor Temp 0": {
      "description": "Temperature reading for Supply Inductor 0",
      "unit": "C"
    },
    "FPGA Supply Inductor Temp 1": {
      "description": "Temperature reading for Supply Inductor 1",
      "unit": "C"
    },
    "FPGA Slave Supply Temp 0": {
      "description": "Temperature reading for Slave Supply 0",
      "unit": "C"
    },
    "FPGA Slave Supply Temp 1": {
      "description": "Temperature reading for Slave Supply 1",
      "unit": "C"
    },
    "FPGA Core Temperature": {
      "description": "Temperature reading for FPGA Core",
      "unit": "C"
    },
    "Board Temperature": {
      "description": "Temperature reading for Board",
      "unit": "C"
    },
    "Vcc AUX Voltage": {
      "description": "Voltage reading for Vcc AUX",
      "unit": "V"
    },
    "Vcc AUX Current": {
      "description": "Current reading for Vcc AUX",
      "unit": "A"
    },
    "Vcc VBRAM Current": {
      "description": "Current reading for Vcc VBRAM",
      "unit": "A"
    },
    "DIMM-1 Temperature": {
      "description": "Temperature reading for DIMM-1",
      "unit": "C"
    },
    "DIMM-12 Vpp Voltage": {
      "description": "Voltage reading for DIMM-12 Vpp",
      "unit": "V"
    },
    "DIMM-12 Vtt Voltage": {
      "description": "Voltage reading for DIMM-12 Vtt",
      "unit": "V"
    },
    "HBM Vpp Voltage": {
      "description": "Voltage reading for HBM Vpp",
      "unit": "V"
    },
    "HBM Vtt Voltage": {
      "description": "Voltage reading for HBM Vtt",
      "unit": "V"
    }
  },
  "removeclass": [
    "FPGA"
  ]
}

The script for gathering the sensor data from the FPGA card, which should be placed under /cm/local/apps/cmd/scripts/metrics/ locally on the node where the FPGA card is installed, can be set up as follows:

# cat sample_fpga_bittware.py
#!/cm/local/apps/python3/bin/python

import sys
import json
import subprocess
import re
import os

LINE_MATCH_PATTERN = '\([0-9]*[0-9]*\) ([a-zA-Z0-9\. ]+).*'
DATE_PATTERN = '^(3[01]|[12][0-9]|0[1-9])-(1[0-2]|0[1-9])-[0-9]{4}'

CONFIGFILE = "/cm/local/apps/cmd/scripts/metrics/configfiles/sample_fpga_bittware.json"
BWMONITOR_CMD = ["/opt/bwtk/2020.1/bin/bwmonitor", "--dev=0", "--read", "--type=Sensor"]

def parse_line(line):
    # Do not parse if line contains date
    datematch = re.match(DATE_PATTERN, line)

    if not datematch and "sdr sensors" not in line.lower() and "available" not in line.lower():
        title, value = line.replace('OK', ':').split(':', 1)
        value, unit = value.strip().replace('degrees C', 'C').split(" ", 1)
        regexmatch = re.match(LINE_MATCH_PATTERN, title.replace('  ', ' ').replace('-', ' ').strip())
        if regexmatch:
            regexgroups = regexmatch.groups()
            title = regexgroups[0].strip()
        return title, value, unit
    else:
        return False, False, False


def is_defined(line, definelist):
    match = False
    matchingpattern = ""

    for pattern in definelist:
        if re.search(pattern, line):
            match = True
            matchingpattern = pattern

    return match, matchingpattern


def parse_output(fpga_out, configuration):
    data = {}

    for line in fpga_out.split('\n'):
        match, pattern = is_defined(line, configuration["classdefs"])
        if match:
            title, value, unit = parse_line(line)
            if title:
                try:
                    data[title] = {"pattern": pattern, "title": title, "value": float(value), "unit": unit}
                    data[title].update(configuration["classdefs"][pattern])
                except ValueError:
                    continue

    return data


def parse_metrics(output, configuration):
    metrics_data = parse_output(output, configuration)
    metrics = []

    for metric in metrics_data:
        metricname = "FPGA.Bittware.{}".format(metrics_data[metric]["title"].replace(" ", "."))
        for rmclass in configuration["removeclass"]:
            metricname = metricname.replace(".{}.".format(rmclass), ".")
        metrics_data[metric]["metric"] = metricname
        metrics_data[metric]["class"] = "FPGA/Bittware"
        del(metrics_data[metric]["pattern"])
        del(metrics_data[metric]["title"])

        metrics.append(metrics_data[metric])

    return metrics

def run_info_command():
     command_out = subprocess.check_output(BWMONITOR_CMD, universal_newlines=True)
     return command_out

def initialize(configuration):
    initialized_metrics = []
    infocmd_out = run_info_command()
    metrics = parse_metrics(infocmd_out, configuration)
    for metric in metrics:
       data = metric
       del(data["value"])
       initialized_metrics.append(data)

    return initialized_metrics

def sample(configuration):
    sampled_metrics = []
    infocmd_out = run_info_command()
    metrics = parse_metrics(infocmd_out, configuration)
    for metric in metrics:
       data = {}
       data["metric"] = metric["metric"]
       data["value"] = metric["value"]
       sampled_metrics.append(data)

    return sampled_metrics

def main():
    metrics = []

    with open(CONFIGFILE) as configfile:
        try:
            configuration = json.load(configfile)
        except FileNotFoundError:
            print("Configfile \"{}\" not found.".format(CONFIGFILE))
            sys.exit(1)

    if "--initialize" in sys.argv:
        metrics = initialize(configuration)
    else:
        metrics = sample(configuration)

    print(json.dumps(metrics, indent=4))


if __name__ == '__main__':
    main()

The collection may be added to Bright using cmsh on the cluster’s active head node as follows:

# cmsh
% monitoring setup
% add collection Bittware\ FPGA
% set description Samples\ metrics\ of\ Bittware\ FPGAs
% set script /cm/local/apps/cmd/scripts/metrics/sample_fpga_bittware.py
% set timeout 30s
% nodeexecutionfilters
% add node bittware_fpga_nodes
% set nodes gpu029
% ..
% ..
% commit

A minute or two later, you should be able to view the results as follows:

# cmsh
% device use gpu029
% latestmetricdata | grep FPGA
FPGA.Bittware.12v.Cable.Current                    FPGA/Bittware  1 A                       2m 35s
FPGA.Bittware.12v.Cable.Voltage                    FPGA/Bittware  12.12 V                   2m 35s
FPGA.Bittware.12v.PCIe.Current                     FPGA/Bittware  1.09 A                    2m 35s
FPGA.Bittware.12v.PCIe.Voltage                     FPGA/Bittware  12.01 V                   2m 35s
FPGA.Bittware.3.3v.MP.Current                      FPGA/Bittware  2.4 A                     2m 35s
FPGA.Bittware.3.3v.MP.Voltage                      FPGA/Bittware  3.3 V                     2m 35s
FPGA.Bittware.3.3v.MP2.Current                     FPGA/Bittware  1.54 A                    2m 35s
FPGA.Bittware.3.3v.MP2.Voltage                     FPGA/Bittware  3.3 V                     2m 35s
FPGA.Bittware.Board.Power                          FPGA/Bittware  26 W                      2m 35s
FPGA.Bittware.Board.Temperature                    FPGA/Bittware  30 C                      2m 35s
FPGA.Bittware.Core.Current.0                       FPGA/Bittware  10.29 A                   2m 35s
FPGA.Bittware.Core.Temperature                     FPGA/Bittware  37 C                      2m 35s
FPGA.Bittware.Core.Voltage                         FPGA/Bittware  0.85 V                    2m 35s
FPGA.Bittware.DIMM.1.Temperature                   FPGA/Bittware  25 C                      2m 35s
FPGA.Bittware.DIMM.12.Vpp.Voltage                  FPGA/Bittware  2.46 V                    2m 35s
FPGA.Bittware.DIMM.12.Vtt.Voltage                  FPGA/Bittware  0.59 V                    2m 35s
FPGA.Bittware.DIMM12.Current                       FPGA/Bittware  0.06 A                    2m 35s
FPGA.Bittware.DIMM12.Voltage                       FPGA/Bittware  1.19 V                    2m 35s
FPGA.Bittware.HBM.Current                          FPGA/Bittware  0.13 A                    2m 35s
FPGA.Bittware.HBM.Voltage                          FPGA/Bittware  1.19 V                    2m 35s
FPGA.Bittware.HBM.Vpp.Voltage                      FPGA/Bittware  2.46 V                    2m 35s
FPGA.Bittware.HBM.Vtt.Voltage                      FPGA/Bittware  0.59 V                    2m 35s
FPGA.Bittware.Slave.Supply.Temp.0                  FPGA/Bittware  44 C                      2m 35s
FPGA.Bittware.Slave.Supply.Temp.1                  FPGA/Bittware  46 C                      2m 35s
FPGA.Bittware.Supply.Die.Temp                      FPGA/Bittware  45 C                      2m 35s
FPGA.Bittware.Supply.Inductor.Temp.0               FPGA/Bittware  33 C                      2m 35s
FPGA.Bittware.Supply.Inductor.Temp.1               FPGA/Bittware  33 C                      2m 35s
FPGA.Bittware.Vcc.AUX.Current                      FPGA/Bittware  0.91 A                    2m 35s
FPGA.Bittware.Vcc.AUX.Voltage                      FPGA/Bittware  1.76 V                    2m 35s
FPGA.Bittware.Vcc.VBRAM.Current                    FPGA/Bittware  0.07 A                    2m 35s

For persistence between node reboots, it is recommended that the metric collection script and JSON configuration file be stored in the node’s software image.

Updated on June 20, 2023