This article has been written for existing Kubernetes deployments in Bright versions <= 9.2.
Prerequisites
This article was written with the following in mind.
- OS: Red Hat Enterprise Linux 8 or Ubuntu 20.04.
- Bright Cluster Manager 9.0, 9.1, 9.2.
- Kubernetes version >= 1.21 is installed and operational.
- Kubernetes GPU nodes are provisioned with a supported DGX Software Image.
(such as RHEL8 or Ubuntu 20.04)
Regarding the DGX software image, the ones shipped with Bright already have configuration for DGX A100 and NVIDIA drivers pre-installed. If not, please install them first.
Prepare the DGX software image(s)
The following small changes need to be done for the relevant software images, and are expected to be no longer needed in future versions of Bright Cluster Manager. When this is the case we will update this KB article. At the time of writing these small changes are necessary. Please do this for each of the software images if there are multiple in use by the Kubernetes deployment.
[root@headnode ~]# chroot /cm/images/dgx-a100-image/
Once inside the software image, execute the following commands.
echo "/usr/local/nvidia/toolkit" > /etc/ld.so.conf.d/nvidia.conf
ldconfig
ln -s /usr/sbin/ldconfig /sbin/ldconfig.real
mkdir -p /opt/cni/bin
rsync -raPv /cm/local/apps/kubernetes/current/bin/cni/ /opt/cni/bin/
exit
As can be seen from the rsync
, we assume Kubernetes has already been installed in this KB article.
At this point ensure that the nodes are provisioned with these changes via an imageupdate or reboot.
Prepare the non-DGX software image(s)
In case you are following this KB article, but are dealing with non-DGX specific software images. Please do follow the steps in the previous section for the DGX software image. And additionally, it is recommended to make sure that the NVIDIA driver is present in advance. One way to do this is to use the following Bright packages: cuda-driver
package, and cm-nvidia-container-toolkit
.
yum install cm-container-toolkit cuda-driver -y
Deployment of the GPU operator
From the Head Node, execute the following.
module load kubernetes/default/1.21.4
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
Now continue with the appropriate section for the chosen runtime for Kubernetes.
If deployed with the containerd
runtime, continue with the next section. For docker
, continue to the section after the next.
Use kubectl get nodes -o wide
to see the runtime per Kubernetes node.
containerd runtime
In case Kubernetes is using the containerd
runtime use the following helm install. Otherwise skip this section, and continue with the docker
runtime section instead.
helm install --wait -n gpu-operator --create-namespace \
--version v1.10.1 \
--set driver.enabled=false \
--set operator.defaultRuntime=containerd \
--set toolkit.enabled=true \
--set toolkit.env[0].name=CONTAINERD_CONFIG \
--set toolkit.env[0].value=/cm/local/apps/containerd/var/etc/conf.d/nvidia-cri.toml \
gpu-operator nvidia/gpu-operator
The environmental variable for CONTAINERD_CONFIG
is modified to play along more nicely with existing configuration present there.
(See https://github.com/containerd/containerd/issues/5837 for more details as to why we have to modify the default config path.)
docker runtime
Use the following helm install in the case of the docker runtime (which is also the default if left unspecified for the Nvidia GPU operator at the time of writing)
helm install --wait -n gpu-operator --create-namespace \
--version v1.10.1 \
--set driver.enabled=false \
--set operator.defaultRuntime=docker \
--set toolkit.enabled=true \
gpu-operator nvidia/gpu-operator
validation steps
Some sanity checks to see if the NVIDIA GPU operator is functioning correctly before continuing with the next part.
kubectl get pod -n gpu-operator
– see if all Pods are up and running.kubectl describe nodes | grep nvidia.com/gpu.count
– see if GPUs are known.
Example output:[root@headnode ~]# kubectl describe nodes | grep nvidia.com/gpu.count
nvidia.com/gpu.count=1
nvidia.com/gpu.count=1
Continue with checking if the dcgm-exporter part is working too.
kubectl run -i -t --rm busybox --image=busybox --restart=Never -n gpu-operator /bin/sh
– start a shell in thegpu-operator
namespace.wget -O - http://nvidia-dcgm-exporter:9400/metrics
– execute this inside the shell.
Example output:...
DCGM_FI_DEV_SM_CLOCK{gpu="0",UUID="GPU-44c419ef-98bd-f5b5-5360-655b79ba4e69",device="nvidia0",modelName="Tesla V100-SXM3-32GB",Hostname="nvidia-dcgm-exporter-dpgkt",container="",namespace="",pod=""} 135
...exit
– exit the shell
Deployment of the Prometheus Operator Stack
First prepare a values.yaml
.
cat << EOF > values.yaml
prometheus:
prometheusSpec:
additionalScrapeConfigs:
- job_name: 'dcgm-exporter'
metrics_path: '/metrics'
static_configs:
- targets: ['nvidia-dcgm-exporter.gpu-operator.svc:9400']
EOF
You can include an additional block for grafana in this values.yaml
file if you intend to also expose the grafana dashboard later through path-based ingress, such as: https://<headnode_ip>:<ingress_tls_port>/grafana/
. See below.
cat << EOF > values.yaml
prometheus:
prometheusSpec:
additionalScrapeConfigs:
- job_name: 'dcgm-exporter'
metrics_path: '/metrics'
static_configs:
- targets: ['nvidia-dcgm-exporter.gpu-operator.svc:9400']
grafana:
grafana.ini:
server:
root_url: "%(protocol)s://%(domain)s:%(http_port)s/grafana/"
serve_from_sub_path: true
EOF
Proceed with the helm install.
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install -f ./values.yaml --version 35.5.1 \
--create-namespace --namespace prometheus \
prometheus-operator prometheus-community/kube-prometheus-stack
validation steps
We can see if all the Pods are Running without any issues in the prometheus
namespace with.
kubectl get pod -n prometheus -o wide
Deployment of the Prometheus Adapter
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install --version 3.3.1 --set rbac.create=true,prometheus.url=http://prometheus-operated.prometheus.svc.cluster.local,prometheus.port=9090 \
prometheus-adapter prometheus-community/prometheus-adapter
validation steps
If the Prometheus Adapter is working correctly, custom metrics should become available.
root@headnode:~# kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1 | wc -c
411318
If there are already Pods using GPUs you should be able to find DCGM exported metrics as well. The next section Extra: Horizontal Pod Autoscaling with GPU Metrics can be used to run an example Pod to have DCGM exported metrics in the output as follows.
root@headnode:~# kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1 | jq . | grep DCGM_FI_DEV_GPU_UTIL
"name": "jobs.batch/DCGM_FI_DEV_GPU_UTIL",
"name": "namespaces/DCGM_FI_DEV_GPU_UTIL",
"name": "pods/DCGM_FI_DEV_GPU_UTIL",
In case you do not have “jq”, omit it from the command and just see if the metric is in the output or not.
The next section will show an example of how this can be used in practice.
Extra: Horizontal Pod Autoscaling with GPU Metrics
Let’s create an example deployment and auto scaler in the file gpudeploy.yaml
.
cat << EOF > gpudeploy.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: gpu
name: gpu
namespace: default
spec:
replicas: 1
selector:
matchLabels:
app: gpu
template:
metadata:
labels:
app: gpu
spec:
containers:
- image: k8s.gcr.io/cuda-vector-add:v0.1
command: ["/bin/bash", "-c", "sleep infinity"]
imagePullPolicy: IfNotPresent
name: cuda-vector-add
resources:
limits:
nvidia.com/gpu: 1
---
apiVersion: autoscaling/v2beta1
kind: HorizontalPodAutoscaler
metadata:
name: hpa-gpu
namespace: default
spec:
scaleTargetRef:
apiVersion: apps/v1beta1
kind: Deployment
name: gpu
minReplicas: 1
maxReplicas: 3
metrics:
- type: Pods
pods:
metricName: DCGM_FI_DEV_GPU_UTIL
targetAverageValue: 40
EOF
The above deploys a deployment with one replica, and an autoscale configuration using the DCGM_FI_DEV_GPU_UTIL
metric.
The threshold is set to 40, if this value is exceeded the auto-scaler should increase replicas (min replicas is 1, max replicas is 3). Let’s apply the yaml…
kubectl apply -f gpudeploy.yaml
The deployment we just applied should result in one idle pod, that holds on to one of the GPUs.
[root@headnode ~]# kubectl get pod -l app=gpu
NAME READY STATUS RESTARTS AGE
gpu-6bbf7bb786-mgh8l 1/1 Running 1 4h25m
The Horizontal Pod Autoscaler will look as follows.
[root@headnode ~]# kubectl get hpa
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
hpa-gpu Deployment/gpu 0/40 1 3 1 114m
We can see the current value of the metric we configured is currently zero, below the threshold of 40. GPU is not being utilized at all.
Let’s exec into the pod, and start some work by modifying the vectorAdd example.
# kubectl exec -it gpu-6bbf7bb786-mgh8l /bin/bash
Once inside the shell paste the following.
sed -ibak 's/vectorAdd<<</while(true)vectorAdd<<</g' vectorAdd.cu
make && ./vectorAdd
This should make the GPU pretty busy, and we can see this reflected in the autoscaler.
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
hpa-gpu Deployment/gpu 67/40 1 3 1 133m
After some time, you will find in the events for the autoscaler that the metric was found above target.
Normal SuccessfulRescale 52m horizontal-pod-autoscaler New size: 2; reason: pods metric DCGM_FI_DEV_GPU_UTIL above target
This results in an extra replica being scheduled.
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
hpa-gpu Deployment/gpu 0/40 1 3 2 122m
The target value needs some time to update since now the metric has to be queried for two pods. The output may change to a non-zero value once the metric for the second replica has been read.
Now if we stop the running vectorAdd
, we can see the hpa-gpu
go back to a value of 0/40
again, and after some delay, it will scale back to the minimum number of replicas (1).
Normal SuccessfulRescale 2s (x2 over 45m) horizontal-pod-autoscaler New size: 1; reason: All metrics below target
Extra: Exposing Grafana via (path-based) Ingress
The following workaround is needed, and will soon be fixed, we will update this KB article once this workaround is no longer required.
Please go to the firewall role for the head node(s) if it is part of the Kubernetes cluster and modify the “cali+” interface (regex) as follows.
[headnode]% device use master
[headnode->device[headnode]]% roles
[headnode->device[headnode]->roles]% use firewall
[headnode->device[headnode]->roles[firewall]]% interfaces
[headnode->device[headnode]->roles[firewall]->interfaces]% list
Index Zone Interface Broadcast Options
------ ------ ------------ ------------ ------------
0 cal tunl0
1 cal cali+
[headnode->device[headnode]->roles[firewall]->interfaces]% use 1
[headnode->device[headnode]->roles[firewall]->interfaces[1]]% set broadcast detect
[headnode->device*[headnode*]->roles*[firewall*]->interfaces*[1*]]% set options routeback
[headnode->device*[headnode*]->roles*[firewall*]->interfaces*[1*]]% commit
This will result in a restart of shorewall. Do this for each Head Node in case of Bright HA.
Now we will create the YAML definition for the Ingress rule.
cat << EOF > grafanaingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
annotations:
kubernetes.io/ingress.provider: nginx
kubernetes.io/ingress.class: nginx
nginx.ingress.kubernetes.io/backend-protocol: HTTP
nginx.ingress.kubernetes.io/proxy-body-size: "0"
nginx.ingress.kubernetes.io/proxy-buffering: "off"
nginx.ingress.kubernetes.io/proxy-read-timeout: "900"
nginx.ingress.kubernetes.io/proxy-request-buffering: "off"
nginx.ingress.kubernetes.io/rewrite-target: /\$2
name: prometheus-grafana
namespace: prometheus
spec:
rules:
- http:
paths:
- backend:
service:
name: prometheus-operator-grafana
port:
number: 80
path: /grafana(/|$)(.*)
pathType: Prefix
EOF
And we will apply it.
kubectl apply -f grafanaingress.yaml
validation steps
If ingress is running on port 30443, go to a node that is part of the Kube cluster and try it out as follows. Expected output is included below.
[root@headnode ~]# curl -k https://localhost:30443/grafana/
<a href="/grafana/login">Found</a>.