The NVIDIA GPU Operator with Kubernetes on a Bright Cluster

Contents

This article has been written for existing Kubernetes deployments in Bright versions <= 9.2.

Prerequisites

This article was written with the following in mind.

OS: Red Hat Enterprise Linux 8 or Ubuntu 20.04.
Bright Cluster Manager 9.0, 9.1, 9.2.
Kubernetes version >= 1.21 is installed and operational.
Kubernetes GPU nodes are provisioned with a supported DGX Software Image.
(such as RHEL8 or Ubuntu 20.04)

Regarding the DGX software image, the ones shipped with Bright already have configuration for DGX A100 and NVIDIA drivers pre-installed. If not, please install them first.

Prepare the DGX software image(s)

The following small changes need to be done for the relevant software images, and are expected to be no longer needed in future versions of Bright Cluster Manager. When this is the case we will update this KB article. At the time of writing these small changes are necessary. Please do this for each of the software images if there are multiple in use by the Kubernetes deployment.

[root@headnode ~]# chroot /cm/images/dgx-a100-image/

Once inside the software image, execute the following commands.

echo "/usr/local/nvidia/toolkit" > /etc/ld.so.conf.d/nvidia.conf
ldconfig
ln -s /usr/sbin/ldconfig /sbin/ldconfig.real
mkdir -p /opt/cni/bin
rsync -raPv /cm/local/apps/kubernetes/current/bin/cni/ /opt/cni/bin/
exit

As can be seen from the rsync, we assume Kubernetes has already been installed in this KB article.

At this point ensure that the nodes are provisioned with these changes via an imageupdate or reboot.

Prepare the non-DGX software image(s)

In case you are following this KB article, but are dealing with non-DGX specific software images. Please do follow the steps in the previous section for the DGX software image. And additionally, it is recommended to make sure that the NVIDIA driver is present in advance. One way to do this is to use the following Bright packages: cuda-driver package, and cm-nvidia-container-toolkit.

yum install cm-container-toolkit cuda-driver -y

Deployment of the GPU operator

From the Head Node, execute the following.

module load kubernetes/default/1.21.4
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update

Now continue with the appropriate section for the chosen runtime for Kubernetes.
If deployed with the containerd runtime, continue with the next section. For docker, continue to the section after the next.

Use kubectl get nodes -o wide to see the runtime per Kubernetes node.

containerd runtime

In case Kubernetes is using the containerd runtime use the following helm install. Otherwise skip this section, and continue with the docker runtime section instead.

helm install --wait -n gpu-operator --create-namespace \
             --version v1.10.1 \
             --set driver.enabled=false \
             --set operator.defaultRuntime=containerd \
             --set toolkit.enabled=true \
             --set toolkit.env[0].name=CONTAINERD_CONFIG \
             --set toolkit.env[0].value=/cm/local/apps/containerd/var/etc/conf.d/nvidia-cri.toml \
             gpu-operator nvidia/gpu-operator

The environmental variable for CONTAINERD_CONFIG is modified to play along more nicely with existing configuration present there.
(See https://github.com/containerd/containerd/issues/5837 for more details as to why we have to modify the default config path.)

docker runtime

Use the following helm install in the case of the docker runtime (which is also the default if left unspecified for the Nvidia GPU operator at the time of writing)

helm install --wait -n gpu-operator --create-namespace \
             --version v1.10.1 \
             --set driver.enabled=false \
             --set operator.defaultRuntime=docker \
             --set toolkit.enabled=true \
             gpu-operator nvidia/gpu-operator

validation steps

Some sanity checks to see if the NVIDIA GPU operator is functioning correctly before continuing with the next part.

kubectl get pod -n gpu-operator – see if all Pods are up and running.
kubectl describe nodes | grep nvidia.com/gpu.count – see if GPUs are known.

Example output:
[root@headnode ~]# kubectl describe nodes | grep nvidia.com/gpu.count nvidia.com/gpu.count=1 nvidia.com/gpu.count=1

Continue with checking if the dcgm-exporter part is working too.

kubectl run -i -t --rm busybox --image=busybox --restart=Never -n gpu-operator /bin/sh – start a shell in the gpu-operator namespace.
wget -O - http://nvidia-dcgm-exporter:9400/metrics – execute this inside the shell.

Example output:
... DCGM_FI_DEV_SM_CLOCK{gpu="0",UUID="GPU-44c419ef-98bd-f5b5-5360-655b79ba4e69",device="nvidia0",modelName="Tesla V100-SXM3-32GB",Hostname="nvidia-dcgm-exporter-dpgkt",container="",namespace="",pod=""} 135 ...
exit – exit the shell

Deployment of the Prometheus Operator Stack

First prepare a values.yaml.

cat << EOF > values.yaml
prometheus:
  prometheusSpec:
    additionalScrapeConfigs:
    - job_name: 'dcgm-exporter'
      metrics_path: '/metrics'
      static_configs:
      - targets: ['nvidia-dcgm-exporter.gpu-operator.svc:9400']
EOF

You can include an additional block for grafana in this values.yaml file if you intend to also expose the grafana dashboard later through path-based ingress, such as: https://<headnode_ip>:<ingress_tls_port>/grafana/. See below.

cat << EOF > values.yaml
prometheus:
  prometheusSpec:
    additionalScrapeConfigs:
    - job_name: 'dcgm-exporter'
      metrics_path: '/metrics'
      static_configs:
      - targets: ['nvidia-dcgm-exporter.gpu-operator.svc:9400']
grafana:
  grafana.ini:
    server:
      root_url: "%(protocol)s://%(domain)s:%(http_port)s/grafana/"
      serve_from_sub_path: true
EOF

Proceed with the helm install.

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install -f ./values.yaml --version 35.5.1 \
             --create-namespace --namespace prometheus \
             prometheus-operator prometheus-community/kube-prometheus-stack

validation steps

We can see if all the Pods are Running without any issues in the prometheus namespace with.

kubectl get pod -n prometheus -o wide

Deployment of the Prometheus Adapter

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install --version 3.3.1 --set rbac.create=true,prometheus.url=http://prometheus-operated.prometheus.svc.cluster.local,prometheus.port=9090 \
             prometheus-adapter prometheus-community/prometheus-adapter

validation steps

If the Prometheus Adapter is working correctly, custom metrics should become available.

root@headnode:~# kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1 | wc -c
411318

If there are already Pods using GPUs you should be able to find DCGM exported metrics as well. The next section Extra: Horizontal Pod Autoscaling with GPU Metrics can be used to run an example Pod to have DCGM exported metrics in the output as follows.

root@headnode:~# kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1 | jq . |  grep DCGM_FI_DEV_GPU_UTIL
      "name": "jobs.batch/DCGM_FI_DEV_GPU_UTIL",
      "name": "namespaces/DCGM_FI_DEV_GPU_UTIL",
      "name": "pods/DCGM_FI_DEV_GPU_UTIL",

In case you do not have “jq”, omit it from the command and just see if the metric is in the output or not.

The next section will show an example of how this can be used in practice.

Extra: Horizontal Pod Autoscaling with GPU Metrics

Let’s create an example deployment and auto scaler in the file gpudeploy.yaml.

cat << EOF > gpudeploy.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: gpu
  name: gpu
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      app: gpu
  template:
    metadata:
      labels:
        app: gpu
    spec:
      containers:
      - image: k8s.gcr.io/cuda-vector-add:v0.1
        command: ["/bin/bash", "-c", "sleep infinity"]
        imagePullPolicy: IfNotPresent
        name: cuda-vector-add
        resources:
          limits:
            nvidia.com/gpu: 1
---
apiVersion: autoscaling/v2beta1
kind: HorizontalPodAutoscaler
metadata:
  name: hpa-gpu
  namespace: default
spec:
  scaleTargetRef:
    apiVersion: apps/v1beta1
    kind: Deployment
    name: gpu
  minReplicas: 1
  maxReplicas: 3
  metrics:
  - type: Pods
    pods:
      metricName: DCGM_FI_DEV_GPU_UTIL
      targetAverageValue: 40
EOF

The above deploys a deployment with one replica, and an autoscale configuration using the DCGM_FI_DEV_GPU_UTIL metric.
The threshold is set to 40, if this value is exceeded the auto-scaler should increase replicas (min replicas is 1, max replicas is 3). Let’s apply the yaml…

kubectl apply -f gpudeploy.yaml

The deployment we just applied should result in one idle pod, that holds on to one of the GPUs.

[root@headnode ~]# kubectl get pod -l app=gpu
NAME                   READY   STATUS    RESTARTS   AGE
gpu-6bbf7bb786-mgh8l   1/1     Running   1          4h25m

The Horizontal Pod Autoscaler will look as follows.

[root@headnode ~]# kubectl get hpa
NAME	  REFERENCE        TARGETS   MINPODS   MAXPODS   REPLICAS   AGE
hpa-gpu   Deployment/gpu   0/40      1         3         1          114m

We can see the current value of the metric we configured is currently zero, below the threshold of 40. GPU is not being utilized at all.

Let’s exec into the pod, and start some work by modifying the vectorAdd example.

# kubectl exec -it gpu-6bbf7bb786-mgh8l /bin/bash

Once inside the shell paste the following.

sed -ibak 's/vectorAdd<<</while(true)vectorAdd<<</g' vectorAdd.cu
make && ./vectorAdd

This should make the GPU pretty busy, and we can see this reflected in the autoscaler.

NAME	  REFERENCE        TARGETS   MINPODS   MAXPODS   REPLICAS   AGE
hpa-gpu   Deployment/gpu   67/40     1         3         1          133m

After some time, you will find in the events for the autoscaler that the metric was found above target.

Normal   SuccessfulRescale             52m                 horizontal-pod-autoscaler  New size: 2; reason: pods metric DCGM_FI_DEV_GPU_UTIL above target

This results in an extra replica being scheduled.

NAME	  REFERENCE        TARGETS   MINPODS   MAXPODS   REPLICAS   AGE
hpa-gpu   Deployment/gpu   0/40      1         3         2          122m

The target value needs some time to update since now the metric has to be queried for two pods. The output may change to a non-zero value once the metric for the second replica has been read.

Now if we stop the running vectorAdd, we can see the hpa-gpu go back to a value of 0/40 again, and after some delay, it will scale back to the minimum number of replicas (1).

Normal   SuccessfulRescale             2s (x2 over 45m)    horizontal-pod-autoscaler  New size: 1; reason: All metrics below target

Extra: Exposing Grafana via (path-based) Ingress

The following workaround is needed, and will soon be fixed, we will update this KB article once this workaround is no longer required.

Please go to the firewall role for the head node(s) if it is part of the Kubernetes cluster and modify the “cali+” interface (regex) as follows.

[headnode]% device use master
[headnode->device[headnode]]% roles
[headnode->device[headnode]->roles]% use firewall 
[headnode->device[headnode]->roles[firewall]]% interfaces 
[headnode->device[headnode]->roles[firewall]->interfaces]% list
Index  Zone   Interface    Broadcast    Options     
------ ------ ------------ ------------ ------------
0      cal    tunl0                                 
1      cal    cali+                                 
[headnode->device[headnode]->roles[firewall]->interfaces]% use 1
[headnode->device[headnode]->roles[firewall]->interfaces[1]]% set broadcast detect
[headnode->device*[headnode*]->roles*[firewall*]->interfaces*[1*]]% set options routeback
[headnode->device*[headnode*]->roles*[firewall*]->interfaces*[1*]]% commit

This will result in a restart of shorewall. Do this for each Head Node in case of Bright HA.

Now we will create the YAML definition for the Ingress rule.

cat << EOF > grafanaingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  annotations:
    kubernetes.io/ingress.provider: nginx
    kubernetes.io/ingress.class: nginx
    nginx.ingress.kubernetes.io/backend-protocol: HTTP
    nginx.ingress.kubernetes.io/proxy-body-size: "0"
    nginx.ingress.kubernetes.io/proxy-buffering: "off"
    nginx.ingress.kubernetes.io/proxy-read-timeout: "900"
    nginx.ingress.kubernetes.io/proxy-request-buffering: "off"
    nginx.ingress.kubernetes.io/rewrite-target: /\$2
  name: prometheus-grafana
  namespace: prometheus
spec:
  rules:
  - http:
      paths:
      - backend:
          service:
            name: prometheus-operator-grafana
            port:
              number: 80
        path: /grafana(/|$)(.*)
        pathType: Prefix
EOF

And we will apply it.

kubectl apply -f grafanaingress.yaml

validation steps

If ingress is running on port 30443, go to a node that is part of the Kube cluster and try it out as follows. Expected output is included below.

[root@headnode ~]# curl  -k https://localhost:30443/grafana/
<a href="/grafana/login">Found</a>.

Updated on February 14, 2023

Prerequisites

Prepare the DGX software image(s)

Prepare the non-DGX software image(s)

Deployment of the GPU operator

containerd runtime

docker runtime

validation steps

Deployment of the Prometheus Operator Stack

validation steps

Deployment of the Prometheus Adapter

validation steps

Extra: Horizontal Pod Autoscaling with GPU Metrics

Extra: Exposing Grafana via (path-based) Ingress

validation steps

Leave a Comment Cancel