• How do I add a QCOW image as a software image?

    A software image is a directory on the head node that is being used to provision compute nodes. It contains a full Linux filesystem. In order to create a new software image from a QCOW image, we must first mount the QCOW and copy the contents. This can be done…

  • How do I upgrade CMDaemon and DCGM packages on head nodes and compute nodes?

    August 2021: Due to recently uncovered security issues there is a need to update cmdaemon and cuda-dcgm to mitigate known software defects. Below are instructions for identifying and updating sites to use the new packages. We strongly encourage users to update these packages at their earliest convenience. Am I affected?…

  • How do I upgrade to Bright 9.1?

    The upgrade procedure was originally published in parallel with Bright 9.1-7.   Please take the time to completely read the following document before proceeding with the upgrade.  https://support.brightcomputing.com/upgrade-manuals/9.1/upgrade-manual.pdf As always, please feel free to reach out to the support staff if you need assistance.

  • How do I create an edge test setup?

    Edge set ups are characterized by having computational resources in multiple geographic locations. Staging such an environment in a single lab for evaluation or testing purposes, can be remarkably challenging. In this article we will describe a setup that can be used to build a Bright setup spanning several edge…

  • How can I get access to nightly builds of packages?

    The packages you will find in the Bright repositories have gone through a QA process. Updated packages are released roughly every 3-4 weeks for the latest version of Bright. Older versions of Bright will receive updates less frequently. It may be desirable to have access to the latest version of…

  • General considerations for installing a Bright DGX cluster

    Loading the correct kernel modules If you are going to use the built-in gigabit Ethernet interface as your internal cluster network between the head node(s) and the DGX nodes, there is nothing special that needs to be done in terms of loading kernel modules. This is because the igb module…

  • How should I set up Slurm on a DGX cluster?

    A workload management system is helpful to be able to schedule jobs on a cluster of nodes. The steps below describe how to set up Slurm in such a way so that GPUs have to be explicitly requested. This way it becomes much easier to share GPU and CPU compute…

  • How can I run a simple test to stress test my GPUs?

    Make sure CUDA, git and cmake are installed on the head node of the cluster: Clone the Multi GPU Benchmark (mgbench) repository under a user account (e.g. cmsupport): Load the CUDA environment module: Build it: Create a file mgbench.slurm with the following contents: Submit a number of jobs: Each job…

  • How do I validate that my DGX cluster is working properly?

    One of the best ways to stress test your DGX cluster is to use NVIDIA’s HPC benchmarks which can be found in NGC. Since this software is packaged as a container image, we will need to use a container runtime engine such as Singularity to run it. It is worth…

  • How can I have multiple network interfaces on a node in the same IP subnet?

    When you configure multiple network interfaces on a single machine with an IP address in the same IP subnet, you will need to do some additional configuration work to allow the networking stack in the Linux kernel to use these interfaces properly. By default you will find that only one…