1. Home
  2. Upgrading Kubernetes version 1.21 to 1.24 on a Bright 9.2 cluster.

Upgrading Kubernetes version 1.21 to 1.24 on a Bright 9.2 cluster.

1. Prerequisites
  • This article is written with Bright Cluster Manager 9.2 in mind, where Kubernetes is currently deployed with the default version 1.21.4 using containerd as its container runtime.
  • The instructions are written with RHEL 8 and Ubuntu 20.04 in mind.
  • These instructions have been run in dev environments a couple of times, all caveats should be covered by this KB article. We do however recommend making a backup of Etcd so a roll-back to an older version is possible. This backup can be made without interrupting the running cluster. Please follow the instructions on the following URL to create a snapshot of Etcd:https://kb.brightcomputing.com/knowledge-base/etcd-backup-and-restore-with-bright-9-0/
2. Upgrade approach
  • For the purposes of this KB article we will use the following example deployment on six nodes, 3 head nodes (2 of them in a HA setup, which is not a requirement), and 3 compute-nodes make up the Kubernetes cluster.
[root@ea-k8s-a ~]# module load kubernetes/default/1.21.4

[root@ea-k8s-a ~]# kubectl get nodes 
NAME       STATUS   ROLES                  AGE   VERSION 
ea-k8s-a   Ready    control-plane,master   37m   v1.21.4 
ea-k8s-b   Ready    control-plane,master   36m   v1.21.4 
node001    Ready    control-plane,master   37m   v1.21.4 
node002    Ready    worker                 37m   v1.21.4 
node003    Ready    worker                 37m   v1.21.4 
node004    Ready    worker                 37m   v1.21.4

[root@ea-k8s-a ~]# kubectl version
Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.4", GitCommit:"3cce4a82b44f032d0cd1a1790e6d2f5a55d20aae", GitTreeState:"clean", BuildDate:"2021-08-11T18:16:05Z", GoVersion:"go1.16.7", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.4", GitCommit:"3cce4a82b44f032d0cd1a1790e6d2f5a55d20aae", GitTreeState:"clean", BuildDate:"2021-08-11T18:10:22Z", GoVersion:"go1.16.7", Compiler:"gc", Platform:"linux/amd64"}
3. Prepare a configuration overlay for control-plane

We’re updating from version 1.21 to 1.24 and new parameters have been added to Kubernetes. If we upgrade the kube apiserver, it will no longer start, because of the missing parameters.

We will create a configuration overlay, without any nodes, categories or headnodes assigned to it for future use.

[ea-k8s-a]% configurationoverlay 
[ea-k8s-a->configurationoverlay]% clone kube-default-master kube-default-master-new 
[ea-k8s-a->configurationoverlay*[kube-default-master-new*]]% set priority 520 
[ea-k8s-a->configurationoverlay*[kube-default-master-new*]]% clear nodes 
[ea-k8s-a->configurationoverlay*[kube-default-master-new*]]% clear categories 
[ea-k8s-a->configurationoverlay*[kube-default-master-new*]]% roles 
[ea-k8s-a->configurationoverlay*[kube-default-master-new*]->roles*]% use kubernetes::apiserver 
[ea-k8s-a->configurationoverlay*[kube-default-master-new*]->roles*[Kubernetes::ApiServer*]]% append options "--feature-gates=LegacyServiceAccountTokenNoAutoGeneration=false" 
[ea-k8s-a->configurationoverlay*[kube-default-master-new*]->roles*[Kubernetes::ApiServer*]]% use kubernetes::controller  
[ea-k8s-a->configurationoverlay*[kube-default-master-new*]->roles*[Kubernetes::Controller*]]% set options "--feature-gates=LegacyServiceAccountTokenNoAutoGeneration=false" 
[ea-k8s-a->configurationoverlay*[kube-default-master-new*]->roles*[Kubernetes::Controller*]]% use kubernetes::node  
[ea-k8s-a->configurationoverlay*[kube-default-master-new*]->roles*[Kubernetes::Node*]]% set cnipluginbinariespath "/opt/cni/bin" 
[ea-k8s-a->configurationoverlay*[kube-default-master-new*]->roles*[Kubernetes::Node*]]% append options "--cgroup-driver=systemd" 
[ea-k8s-a->configurationoverlay*[kube-default-master-new*]->roles*[Kubernetes::Node*]]% commit

To make it easier to apply, here’s the sequence of cmsh commands used there:

configurationoverlay
clone kube-default-master kube-default-master-new
set priority 520
clear nodes
clear categories
roles
use kubernetes::apiserver
append options "--feature-gates=LegacyServiceAccountTokenNoAutoGeneration=false"
use kubernetes::controller 
set options "--feature-gates=LegacyServiceAccountTokenNoAutoGeneration=false"
use kubernetes::node 
set cnipluginbinariespath "/opt/cni/bin"
append options "--cgroup-driver=systemd"
commit

It might be possible that a deprecated option be set in kubernetes::apiserver. After commit, run this command to make sure:

use kubernetes::apiserver

get options

If –feature-gates=RunAsGroup=true is listed, it needs to be removed. Use these commands to get this done:

removefrom options "--feature-gates=RunAsGroup=true"

commit
4. Prepare software images

We will bump the kubernetes package for each software image that is relevant to the Kubernetes cluster. In this example scenario our three compute nodes are provisioned from /cm/images/default-image. We will use the cm-chroot-sw-img program to replace the kubernetes package.

[root@ea-k8s-a ~]# cm-chroot-sw-img /cm/images/default-image/ # go into chroot

$ apt install -y cm-kubernetes121- cm-kubernetes124 # for ubuntu

$ yum swap -y cm-kubernetes121 cm-kubernetes124 # for RHEL

$ exit
5. Image update one of the workers

We start with one to see if we can update on of the kubelets. This should give us some confidence before upgrading all of the kubelets. We do not start with the control plane (Kubernetes API server, etc., since additional command-line flags have been added since Kubernetes version 1.21)

In our example node002 is a worker, and we will first drain the node. See https://kubernetes.io/docs/tasks/administer-cluster/safely-drain-node/ for more details. This is not strictly necessary, but usually recommended.

[root@ea-k8s-a ~]# kubectl cordon node002                                           # disable scheduling
[root@ea-k8s-a ~]# kubectl drain node002 --ignore-daemonsets --delete-emptydir-data # optionally drain as well

The drain command will evict all Pods and prevent anything from being scheduled on the node. After the command finishes successfully we will issue an imageupdate on node002 via cmsh.

[root@ea-k8s-a ~]# cmsh 
[ea-k8s-a]% device 
[ea-k8s-a->device]% imageupdate -w node002 
Wed Nov 23 15:09:02 2022 [notice] ea-k8s-a: Provisioning started: sending ea-k8s-a:/cm/images/default-image to node002:/, mode UPDATE, dry run = no 
Wed Nov 23 15:09:56 2022 [notice] ea-k8s-a: Provisioning completed: sent ea-k8s-a:/cm/images/default-image to node002:/, mode UPDATE, dry run = no 
imageupdate -w node002 [ COMPLETED ]

We will now restart cmd, kubelet and kube-proxy services on the node.

[root@ea-k8s-a ~]# pdsh -w node002 'systemctl daemon-reload; systemctl restart cmd; systemctl restart kubelet.service; systemctl restart kube-proxy.service'

After a few moments, verify that the kubelet has been updated correctly.

[root@ea-k8s-a ~]# kubectl get nodes 
NAME       STATUS                     ROLES                  AGE   VERSION 
ea-k8s-a   Ready                      control-plane,master   66m   v1.21.4 
ea-k8s-b   Ready                      control-plane,master   66m   v1.21.4 
node001    Ready                      control-plane,master   66m   v1.21.4 
node002    Ready,SchedulingDisabled   worker                 66m   v1.24.0 
node003    Ready                      worker                 66m   v1.21.4 
node004    Ready                      worker                 66m   v1.21.4

Notice how node002 has version set to 1.24.0

Now we can re-enable scheduling for the node.

[root@ea-k8s-a ~]# kubectl uncordon node002 
node/node002 uncordoned
6. Image update the rest of the workers

This can be done similarly to step 5, one-by-one, or in batches. In the case of this KB article we’ll do the remaining compute nodes node00[3-4] in one go, without draining them first.

  • We issue an imageupdate, but for the whole category in cmsh: device; imageupdate -c default -w
  • We restart the services: pdsh -w node00[3-4] 'systemctl daemon-reload; systemctl restart cmd; systemctl restart kubelet.service; systemctl restart kube-proxy.service'
  • We confirm the version has updated.
[root@ea-k8s-a ~]# kubectl get nodes 
NAME       STATUS   ROLES                  AGE   VERSION 
ea-k8s-a   Ready    control-plane,master   76m   v1.21.4 
ea-k8s-b   Ready    control-plane,master   75m   v1.21.4 
node001    Ready    control-plane,master   76m   v1.21.4 
node002    Ready    worker                 76m   v1.24.0 
node003    Ready    worker                 76m   v1.24.0 
node004    Ready    worker                 76m   v1.24.0
7. Update one of the control-plane nodes

We will pick node001 and add the node to the new overlay created in step 3. If your cluster does not have control-plane nodes running on compute nodes, see the next section on how to update the Head Nodes, and pick a Head Node that runs as a control-plane.

Given that this node has not received an image update yet as, in our example, the node is in a separate category from the one used by the workers so we need to do that first if in this scenario:

[root@ea-k8s-update ~]# cmsh
[ea-k8s-update]% device 
[ea-k8s-update->device]% imageupdate -w node001
Wed Nov 23 15:28:44 2022 [notice] ea-k8s-update: Provisioning started: sending ea-k8s-update:/cm/images/default-image to node001:/, mode UPDATE, dry run = no
Wed Nov 23 15:29:32 2022 [notice] ea-k8s-update: Provisioning completed: sent ea-k8s-update:/cm/images/default-image to node001:/, mode UPDATE, dry run = no
imageupdate -w node001 [ COMPLETED ]

Now we proceed with setting up the configuration overlay:

[ea-k8s-a]% configurationoverlay 
[ea-k8s-a->configurationoverlay]% use kube-default-master-new 
[ea-k8s-a->configurationoverlay[kube-default-master-new]]% append nodes node001
[ea-k8s-a->configurationoverlay*[kube-default-master-new*]]% commit

We expect the Kube API server to be automatically restarted, however, we also want to restart the scheduler and controller-manager.

pdsh -w node001 "systemctl daemon-reload; systemctl restart kube-scheduler; systemctl restart kube-controller-manager"

In this case we can try to exercise the API server on the node via curl:

[root@ea-k8s-a ~]# curl -k https://node001:6443; echo
{
"kind": "Status",
"apiVersion": "v1",
"metadata": {},
"status": "Failure",
"message": "Unauthorized",
"reason": "Unauthorized",
"code": 401
}

The authorization error is expected here, not important for now, but mentioning for completeness: one way to do an authenticated request would be using a token (which we embed in the kubeconfig by default for the root user):

[root@ea-k8s-a ~]# grep token .kube/config-default
token: 'SOME_LONG_STRING'
[root@ea-k8s-a ~]# export TOKEN=SOME_LONG_STRING
[root@ea-k8s-a ~]# curl -s https://node001:6443/openapi/v2 --header "Authorization: Bearer $TOKEN" --cacert /cm/local/apps/kubernetes/var/etc/kubeca-default.pem | less
8. Updating Head Nodes

First we need to execute step 4 on the Head Nodes. In case there are two, execute the following on both.

[root@ea-k8s-a ~]# apt install -y cm-kubernetes121- cm-kubernetes124 # for ubuntu

[root@ea-k8s-a ~]# yum swap -y cm-kubernetes121 cm-kubernetes124 # for RHEL

We can do kubelet + kube-proxy first as before, or we can do all services at once. Section 5 and 7 can be referenced for the detailed steps. The imageupdate steps can be omitted since those are only relevant for Compute Nodes.

We will update the worker services on the active Head Node first, and verify that the version has updated.

First, we add the active Head Node into the overlay that was created for master nodes:

root@ea-k8s-ubuntu-a:~# cmsh
[ea-k8s-ubuntu-a]% configurationoverlay
[ea-k8s-ubuntu-a->configurationoverlay]% use kube-default-master-new
[ea-k8s-ubuntu-a->configurationoverlay[kube-default-master-new]]% append nodes master
[ea-k8s-ubuntu-a->configurationoverlay*[kube-default-master-new*]]% commit

Then we can restart the kubernetes services:

[root@ea-k8s-a ~]# systemctl daemon-reload; systemctl restart kubelet; systemctl restart kube-proxy; 
[root@ea-k8s-a ~]# kubectl get nodes 
NAME       STATUS                     ROLES                  AGE   VERSION 
ea-k8s-a   Ready,SchedulingDisabled   control-plane,master   47m   v1.24.0 
ea-k8s-b   Ready,SchedulingDisabled   control-plane,master   47m   v1.21.4 
node001    Ready                      control-plane,master   47m   v1.24.0 
node002    Ready                      worker                 47m   v1.24.0 
node003    Ready                      worker                 47m   v1.24.0 
node004    Ready                      worker                 47m   v1.24.0

And now we restart the Scheduler and Controller-Manager.

[root@ea-k8s-a ~]# systemctl daemon-reload; systemctl restart kubelet; systemctl restart kube-proxy;

Finally, we will repeat for the secondary Head Node. And after that, the cluster should be fully updated.

9. Updating Addons

Issuing the following command updates the addons. The output for the command has been omitted to avoid cluttering this KB article, but backups of the original yaml are made to the following directory: /cm/local/apps/kubernetes/var/, this information is printed as part of the output.

cm-kubernetes-setup -v --update-addons

The update script will have backed up the old configuration inside NVIDIA Bright Cluster Manager as well:

[ea-k8s-a]% kubernetes 
[ea-k8s-a->kubernetes[default]]% appgroups 
[ea-k8s-a->kubernetes[default]->appgroups]% list
Name (key) Applications 
-------------------------------- ------------------------------
system <13 in submode> 
system-backup-2022-11-23-193952 <13 in submode>
Restart ingress-nginx

Due to the differences that exist in the configuration of ingress-nginx between 1.21 and 1.24, its jobs have to be deleted so that cmd can restart them with the proper configuration for 1.24:

[root@ea-k8s-a ~]# kubectl delete job -n ingress-nginx --all

After a few minutes, the job should be visible again:

[root@ea-k8s-a ~]# kubectl get jobs -A 
NAMESPACE       NAME                             COMPLETIONS   DURATION   AGE 
ingress-nginx   ingress-nginx-admission-create   1/1           7s         7m17s 
ingress-nginx   ingress-nginx-admission-patch    1/1           7s         7m17s
10. Finalize the update.

Kubernetes should be ready at this point, we can get rid of the old module file and make one final change to the configuration overlays.

[root@ea-k8s-a ~]# pdsh -A rm -rf /cm/local/modulefiles/kubernetes/default/1.21.4 

[ea-k8s-a]% configurationoverlay 
[ea-k8s-a->configurationoverlay]% remove kube-default-master
[ea-k8s-a->configurationoverlay*]% commit
Successfully removed 1 ConfigurationOverlays
Successfully committed 0 ConfigurationOverlays
[ea-k8s-a->configurationoverlay]% set kube-default-master-new priority 510
[ea-k8s-a->configurationoverlay]% set kube-default-master-new name kube-default-master
[ea-k8s-a->configurationoverlay*]% commit
Successfully committed 1 ConfigurationOverlays
[ea-k8s-a->configurationoverlay]% kubernetes  
[ea-k8s-a->kubernetes[default]]% labelsets  
[ea-k8s-a->kubernetes[default]->labelsets]% show  
[ea-k8s-a->kubernetes[default]->labelsets]% use master  
[ea-k8s-a->kubernetes[default]->labelsets[master]]% append overlays kube-default-master 
[ea-k8s-a->kubernetes*[default*]->labelsets*[master*]]% commit
11. Rollback the update.

In order to go back to the previous version 1.18, we have to follow the reverse of steps 1-10.

Downgrade the addons

This is needed if Step 9 was executed only.

[root@ea-k8s-a ~]# cmsh 
[ea-k8s-a]% kubernetes  
[ea-k8s-a->kubernetes[default]]% appgroups  
[ea-k8s-a->kubernetes[default]->appgroups]% list 
Name (key)                       Applications                   
-------------------------------- ------------------------------ 
system                           <13 in submode>                
system-backup-2022-11-24-112008  <13 in submode>                
[ea-k8s-a->kubernetes[default]->appgroups]% set system enabled no 
[ea-k8s-a->kubernetes*[default*]->appgroups*]% set system-backup-2022-11-24-112008 enabled yes 
[ea-k8s-a->kubernetes*[default*]->appgroups*]% commit

This should keep Kubernetes busy for a minute, after it’s done restoring all the resources, do the following steps:

Downgrading the packages

We need to downgrade the newly installed cm-kubernetes124 and downgrade it everywhere to cm-kubernetes121 .

Meaning that on both Head Nodes and relevant software images the following command needs to be executed.

apt install -y cm-kubernetes124- cm-kubernetes121  # for ubuntu

yum swap -y cm-kubernetes124 cm-kubernetes121  # for RHEL

Image update relevant nodes

We need to image update the relevant nodes next, in order for all Kubernetes nodes to have the Kubernetes 1.21 binaries again. (e.g. imageupdate -c default -w in cmsh)

Restore the configuration overlay

Depending on whether Step 10 was executed, and whether the kube-default-master-new overlay was already removed, the rollback can be different. In case kube-default-master-new still exists, we can remove + commit it. The lower-priority original kube-default-master overlay should take over the configuration.

[root@ea-k8s-a ~]# cmsh
[ea-k8s-a]% configurationoverlay
[ea-k8s-a->configurationoverlay]% remove kube-default-master-new
[ea-k8s-a->configurationoverlay*]% commit

In the second case kube-default-master was updated in Step 10, we have to remove the extra parameters from the api server role and add the as follows.

[ea-k8s-a]% configurationoverlay  
[ea-k8s-a->configurationoverlay]% use kube-default-master 
[ea-k8s-a->configurationoverlay[kube-default-master]]% roles 
[ea-k8s-a->configurationoverlay[kube-default-master]->roles]% use kubernetes::apiserver 
[ea-k8s-a->configurationoverlay[kube-default-master]->roles[Kubernetes::ApiServer]]% removefrom options "--feature-gates=LegacyServiceAccountTokenNoAutoGeneration=false" 
[ea-k8s-a->configurationoverlay*[kube-default-master*]->roles*[Kubernetes::ApiServer*]]% use kubernetes::controller  
[ea-k8s-a->configurationoverlay*[kube-default-master*]->roles*[Kubernetes::Controller]]% removefrom options "--feature-gates=LegacyServiceAccountTokenNoAutoGeneration=false" 
[ea-k8s-a->configurationoverlay*[kube-default-master*]->roles*[Kubernetes::Controller*]]% use kubernetes::node  
[ea-k8s-a->configurationoverlay*[kube-default-master*]->roles*[Kubernetes::Node]]% set cnipluginbinariespath /cm/local/apps/kubernetes/current/bin/cni 
[ea-k8s-a->configurationoverlay*[kube-default-master*]->roles*[Kubernetes::Node*]]% removefrom options "--cgroup-driver=systemd" 
[ea-k8s-a->configurationoverlay*[kube-default-master*]->roles*[Kubernetes::Node*]]% commit

In both cases the Kube API servers may be restarted and can produce errors until we complete the next step.

Restart services

On all the nodes relevant to the Kube cluster, we need to execute the following reload + restarts. In our example setup, as follows. Please note that it includes a restart of Bright Cluster Manager.

[root@ea-k8s-a ~]# pdsh -w ea-k8s-a,ea-k8s-b,node00[1-4] "systemctl daemon-reload; systemctl restart cmd; systemctl restart '*kube*.service'"

We can cleanup the module file for version 1.21 to avoid it from popping up in tab-completion.

[root@ea-k8s-a ~]# pdsh -A rm -rf /cm/local/modulefiles/kubernetes/default/1.24.0

All versions should be back to 1.21.4:

[root@ea-k8s-a ~]# kubectl get nodes 
NAME       STATUS   ROLES                  AGE   VERSION 
ea-k8s-a   Ready    control-plane,master   22h   v1.21.4 
ea-k8s-b   Ready    control-plane,master   22h   v1.21.4 
node001    Ready    control-plane,master   22h   v1.21.4 
node002    Ready    worker                 22h   v1.21.4 
node003    Ready    worker                 22h   v1.21.4 
node004    Ready    worker                 22h   v1.21.4

Hopefully resources inside Kubernetes are also running in good health and without issues.

It is very unlikely with this downgrade from 1.21 back to 1.18, however, should something get into an invalid, unrecoverable state, we can restore the Etcd database at this point with the snapshot created in Step 1. The instructions for this are explained in the same KB article referenced in Step 1.

Updated on December 29, 2022

Leave a Comment