1. Home
  2. NVIDIA DGX
  3. How to upgrade DGX A100 Firmware from headnode

How to upgrade DGX A100 Firmware from headnode

This article describes how to stage the NVIDIA DGX A100 Firmware Update Utility for PXE booting from the BCM headnode. Please take a moment to read through the DGX A100 firmware update documentation to understand overall process. 

The procedure below is loosely based on the firmware update ISO method, for PXE booting the DGX A100 firmware update image, where the headnode acts as the DHCP and PXE server. This allows users to setup the firmware update image, and update several DGX A100’s in a cluster. 

Staging the files

The DGX A100 firmware updates can be downloaded from the NVIDIA Enterprise Support Portal. Login to the portal, then go to Download -> Servers/Workstations -> DGX -> DGX A100 -> Firmware. If you cannot access the portal, please reach out to NVIDIA support (enterprisesupport@nvidia.com) for assistance.

We will use the DGX A100 System Firmware Update Version 24.6.1 in the examples below.

Note: The procedure below uses the PXE netboot firmware download. However, the necessary files can be extracted from the ISO image, as described in the documentation, step 1 & 2.

  1. From the NVIDIA enterprise support portal, download the DGX A100 firmware update PXE netboot file and copy it to the headnode.
    # scp pxeboot-DGXA100_FWUI-24.6.1.tgz <user>@<headnode_ip>:/tmp
  2. On the headnode, extract the firmware tar file contents to /tmp:
    # cd /tmp
    # tar xzvf pxeboot-DGXA100_FWUI-24.6.1.tgz
    # tree netboot
    netboot
    └── files
    ├── filesystem.squashfs
    ├── initrd
    ├── pxelinux.cfg
    │   └── default
    └── vmlinuz
  3. Create a directory for the firmware update files. It is best to include the firmware version in the directory name:
    # mkdir /tftpboot/a100fw_24.6.1
  4. Move necessary files to the tftpboot directory created in the previous step and create necessary symlinks:
    # mv /tmp/netboot/files/initrd /tftpboot/a100fw_24.6.1
    # mv /tmp/netboot/files/vmlinuz /tftpboot/a100fw_24.6.1
    # mv /tmp/netboot/files/filesystem.squashfs /tftpboot/a100fw_24.6.1
    # cd /tftpboot/x86_64/bios
    # ln -s ../../a100fw_24.6.1 a100fw_24.6.1
    # cd /tftpboot/x86_64/efi64
    # ln -s ../../a100fw_24.6.1 a100fw_24.6.1

Prepare the PXE boot entries

Next add template entries to support both BIOS and EFI boot methods. Make sure the string highlighted in bold match for the 2 template configurations. This will be used later to set the pxelabel to use for booting the nodes.

  1. Add the new entry to /tftpboot/pxelinux.cfg/template and save:
    LABEL A100FW
    KERNEL a100fw_24.6.1/vmlinuz
    IPAPPEND 3
    MENU LABEL ^A100FW_24.6.1 - Launch A100 FW Update OS
    APPEND vga=normal initrd=a100fw_24.6.1/initrd console=${CMD_CONSOLE} boot=live apparmor=0 elevator=noop nvme-core.multipath=n nouveau.modeset=0 boot-live-env start-systemd-networkd fetch=http://${CMD_SERVER_IP}:8080/tftpboot/a100fw_24.6.1/filesystem.squashfs 
    $(CMD_PXE_LABEL=a100fw_24.6.1?MENU DEFAULT:)
  2. Add the new entry to /tftpboot/grub.cfg/template and save:
    menuentry 'A100FW_24.6.1 - Update DGX A100 Firmware' --id 'a100fw_24.6.1' {
      echo 'Loading Linux'
      linux ${CMD_PROTOCOL}/a100fw_24.6.1/vmlinuz rw root=/dev/ram0 nokeymap BOOTIF=01-${mac_dashed} ip=${net_default_ip}:${net_default_server}:0x${gw_hex}:0x${nm_hex} vga=normal initrd=a100fw_24.6.1/initrd console=${CMD_CONSOLE} boot=live apparmor=0 elevator=noop nvme-core.multipath=n nouveau.modeset=0 boot-live-env start-systemd-networkd fetch=http://${CMD_SERVER_IP}:8080/tftpboot/a100fw_24.6.1/filesystem.squashfs
      echo 'Loading initrd'
      initrd ${CMD_PROTOCOL}/a100fw_24.6.1/initrd
    }

Configure the devices to use the PXE boot entry

Finally, we need to configure each device (or category) to boot with the new PXE entry. The pxelabel is set using the string highlighted in bold above.  For example, to configure a single node to boot into the firmware update image specified in the configurations above:

# cmsh
% device use dgx-01
set pxelabel a100fw_24.6.1
commit

This will generate a new file in tftpboot based on the template:

root@headnode:~# la -l /tftpboot/grub.cfg/
total 32
-rw-r--r-- 1 root root 4055 Aug 16 13:19 category.default
-rw-r--r-- 1 root root 4715 Aug 16 13:19 category.dgx-a100   <-- category-specific pxeboot entry
-rw-r--r-- 1 root root 4754 Aug 16 13:19 node.dgx-01         <-- node-specific pxeboot entry
-rw-r--r-- 1 root root 3808 Aug 16 13:15 template

To update the firmware, you must reboot this node. It should automatically boot using the new pxelabel for the firmware update image. 

Open the BMC Web Interface -> remote control console and complete the firmware update process using the documentation for the interactive update. Optionally, a user can SSH to the booted firmware update OS, where the username is ‘fwui’ and the password is ‘fw_update’. The firmware update can be started by executing ‘update_fw all’.

Before rebooting the DGX A100 after firmware upgrade, we MUST clear the pxelabel for that device, or we will reboot back into the firmware update OS.

# cmsh
% device use dgx-01
clear pxelabel
commit

 

Updated on October 17, 2024

Related Articles