Overview
This article provides instructions for renewing expired certificates in a BCM Kubernetes cluster when the entire cluster has expired and kubectl commands no longer work.
Problem: When Kubernetes cluster certificates expire, all kubectl commands fail with certificate validation errors, preventing normal certificate renewal procedures.
Root Cause: Kubernetes certificates have a default validity period of one year. When these expire, the cluster API becomes inaccessible, preventing standard renewal methods.
Solution: Use the cm-kubeadm-manage
script to renew certificates on each control plane node, starting with one node and propagating the working configuration to others.
Prerequisites
- BCM Version: 10.x with cmdaemon version 10.25.03 or later
- Operating System: Ubuntu 22.04 (adapt package manager commands for other distributions)
- Cluster State: All certificates expired, kubectl non-functional
- Required Access: Root access to head node and all Kubernetes control plane nodes
Verify Environment
This guide assumes:
- Kubernetes cluster label:
default
(verify withcmsh -c 'kubernetes list'
) - Three control plane nodes: knode01, knode02, knode03
- Three worker nodes: worker01, worker02, worker03
Solution Steps
Step 1: Update cmdaemon
Note: This step can be skipped and completed at a later time; however, it is preferred to start by updating cmdaemon.
Ensure you have the latest cmdaemon with critical certificate renewal fixes:
apt update
apt install cmdaemon
apt install cm-setup # optionally
Note: Version 10.25.03 includes fixes for CSR request and approval mechanisms.
Verify installed versions:
cm-package-release-info -f cm-setup
cm-package-release-info -f cmdaemon
Step 2: Install cm-kubeadm-manage Script
Download and install the management script:
wget -O /cm/local/apps/cmd/scripts/cm-kubeadm-manage https://support2.brightcomputing.com/etcd/cm-kubeadm-manage
chmod +x /cm/local/apps/cmd/scripts/cm-kubeadm-manage
alias cm-kubeadm-manage='/cm/local/apps/cmd/scripts/cm-kubeadm-manage'
For more information about this script, see: https://kb.brightcomputing.com/knowledge-base/configuring-kubernetes-control-plane-in-bcm-10/#1-prerequisites
Step 3: Locate kubeadm-init Configuration
Find the appropriate kubeadm-init-default.yaml
file:
pdsh -w $(hostname),knode01,knode02,knode03 ls -al /root/.kube/kubeadm-init-default.yaml
Example output:
knode01: ls: cannot access '/root/.kube/kubeadm-init-default.yaml': No such file or directory
pdsh@headnode: knode01: ssh exited with exit code 2
knode02: ls: cannot access '/root/.kube/kubeadm-init-default.yaml': No such file or directory
pdsh@headnode: knode02: ssh exited with exit code 2
knode03: -rw------- 1 root root 1328 Jul 16 2024 /root/.kube/kubeadm-init-default.yaml
headnode: -rw------- 1 root root 1328 Jul 16 2024 /root/.kube/kubeadm-init-default.yaml
Use the most recent file found (in this example, either headnode or knode03).
Step 4: Distribute kubeadm Configuration
Ensure all control plane nodes have the most recent kubeadm-init-default.yaml
file. Copy from the node with the most recent version to all other nodes:
rsync -av /root/.kube/kubeadm-init-default.yaml knode01:/root/.kube/
rsync -av /root/.kube/kubeadm-init-default.yaml knode02:/root/.kube/
Step 5: Renew Certificates on First Control Plane Node
Choose one control plane node to update first. In this example, we use knode03 because it already has the kubeadm-init-default.yaml
file in place. If your chosen node doesn’t have the file, copy it there first.
cm-kubeadm-manage --kube-cluster=default update_certs knode03
Step 6: Update Kubernetes Components
Update each component on the same node:
cm-kubeadm-manage --kube-cluster=default update_apiserver knode03
cm-kubeadm-manage --kube-cluster=default update_controller_manager knode03
cm-kubeadm-manage --kube-cluster=default update_scheduler knode03
Example output:
root@headnode:~# cm-kubeadm-manage --kube-cluster=default update_apiserver knode03
2025-07-17 02:29:37,746 - cm-kubeadm-manage - INFO - ##### CLI invoked: ['/cm/local/apps/cmd/scripts/cm-kubeadm-manage', '--kube-cluster=default', 'update_apiserver', 'knode03'] #####
2025-07-17 02:29:39,006 - cm-kubeadm-manage - DEBUG - No need to backup /etc/kubernetes/pki/default/etcd/ca.crt
2025-07-17 02:29:39,243 - cm-kubeadm-manage - DEBUG - Not uploading /etc/kubernetes/pki/default/etcd/ca.crt... (already exists)
2025-07-17 02:29:39,259 - cm-kubeadm-manage - DEBUG - Invoking cm-component-certificate to regenerate apiserver-etcd-client.crt on node
2025-07-17 02:29:39,260 - cm-kubeadm-manage - DEBUG - Executing: /cm/local/apps/cluster-tools/bin/cm-component-certificate --component=/etc/kubernetes/pki/default/apiserver-etcd-client.crt -n knode03
2025-07-17 02:29:42,812 - cm-kubeadm-manage - DEBUG - Executing: kubeadm init phase control-plane apiserver --config /root/.kube/kubeadm-init-default.yaml --v=5
[control-plane] Creating static Pod manifest for "kube-apiserver"
2025-07-17 02:29:44,379 - cm-kubeadm-manage - DEBUG - Executing: . /etc/profile.d/modules.sh ; module load containerd ; crictl pods --name kube-apiserver --quiet
2025-07-17 02:29:44,503 - cm-kubeadm-manage - INFO - kube-apiserver pod ID: 6abcdc633d473b555087e9b6587497d007af32289407351d8f0cd40fa37239da
2025-07-17 02:29:44,503 - cm-kubeadm-manage - DEBUG - Executing: . /etc/profile.d/modules.sh ; module load containerd ; crictl stopp 6abcdc633d473b555087e9b6587497d007af32289407351d8f0cd40fa37239da || true
2025-07-17 02:29:45,073 - cm-kubeadm-manage - INFO - Successfully stopped kube-apiserver pod
2025-07-17 02:29:45,073 - cm-kubeadm-manage - DEBUG - Executing: . /etc/profile.d/modules.sh ; module load containerd ; crictl rmp 6abcdc633d473b555087e9b6587497d007af32289407351d8f0cd40fa37239da
2025-07-17 02:29:45,176 - cm-kubeadm-manage - INFO - Successfully removed kube-apiserver pod
root@headnode:~# cm-kubeadm-manage --kube-cluster=default update_controller_manager knode03
2025-07-17 02:30:16,891 - cm-kubeadm-manage - INFO - ##### CLI invoked: ['/cm/local/apps/cmd/scripts/cm-kubeadm-manage', '--kube-cluster=default', 'update_controller_manager', 'knode03'] #####
2025-07-17 02:30:18,178 - cm-kubeadm-manage - DEBUG - Executing: kubeadm init phase control-plane controller-manager --config /root/.kube/kubeadm-init-default.yaml --v=5
[control-plane] Creating static Pod manifest for "kube-controller-manager"
2025-07-17 02:30:20,391 - cm-kubeadm-manage - DEBUG - Executing: . /etc/profile.d/modules.sh ; module load containerd ; crictl pods --name kube-controller-manager --quiet
2025-07-17 02:30:20,494 - cm-kubeadm-manage - INFO - kube-controller-manager pod ID: cdee074f93419ca058c9df3eb62c60d7cc6b2e13b1bb065d527a39f957f64240
2025-07-17 02:30:20,495 - cm-kubeadm-manage - DEBUG - Executing: . /etc/profile.d/modules.sh ; module load containerd ; crictl stopp cdee074f93419ca058c9df3eb62c60d7cc6b2e13b1bb065d527a39f957f64240 || true
2025-07-17 02:30:20,670 - cm-kubeadm-manage - INFO - Successfully stopped kube-controller-manager pod
2025-07-17 02:30:20,671 - cm-kubeadm-manage - DEBUG - Executing: . /etc/profile.d/modules.sh ; module load containerd ; crictl rmp cdee074f93419ca058c9df3eb62c60d7cc6b2e13b1bb065d527a39f957f64240
2025-07-17 02:30:20,827 - cm-kubeadm-manage - INFO - Successfully removed kube-controller-manager pod
root@headnode:~# cm-kubeadm-manage --kube-cluster=default update_scheduler knode03 2025-07-17 02:30:39,719 - cm-kubeadm-manage - INFO - ##### CLI invoked: ['/cm/local/apps/cmd/scripts/cm-kubeadm-manage', '--kube-cluster=default', 'update_scheduler', 'knode03'] #####
2025-07-17 02:30:40,920 - cm-kubeadm-manage - DEBUG - Executing: kubeadm init phase control-plane scheduler --config /root/.kube/kubeadm-init-default.yaml --v=5
[control-plane] Creating static Pod manifest for "kube-scheduler"
2025-07-17 02:30:44,500 - cm-kubeadm-manage - DEBUG - Executing: . /etc/profile.d/modules.sh ; module load containerd ; crictl pods --name kube-scheduler --quiet
2025-07-17 02:30:44,604 - cm-kubeadm-manage - INFO - kube-scheduler pod ID: ac42a8b4c0b0abdf8c37f17a55dc5ba37890d93919f50cb8a83e5ff5598c0bbe
2025-07-17 02:30:44,604 - cm-kubeadm-manage - DEBUG - Executing: . /etc/profile.d/modules.sh ; module load containerd ; crictl stopp ac42a8b4c0b0abdf8c37f17a55dc5ba37890d93919f50cb8a83e5ff5598c0bbe || true
2025-07-17 02:30:44,708 - cm-kubeadm-manage - INFO - Successfully stopped kube-scheduler pod
2025-07-17 02:30:44,708 - cm-kubeadm-manage - DEBUG - Executing: . /etc/profile.d/modules.sh ; module load containerd ; crictl rmp ac42a8b4c0b0abdf8c37f17a55dc5ba37890d93919f50cb8a83e5ff5598c0bbe
2025-07-17 02:30:44,811 - cm-kubeadm-manage - INFO - Successfully removed kube-scheduler pod
Step 7: Verify Node Configuration
SSH to the updated node and verify:
- Check if the kubelet service is running:
systemctl status kubelet
- Verify pods are running:
module load containerd
crictl ps
Example output:
root@knode03:~# crictl ps
WARN[0000] runtime connect using default endpoints: [unix:///var/run/dockershim.sock unix:///run/containerd/containerd.sock unix:///run/crio/crio.sock unix:///var/run/cri-dockerd.sock]. As the default settings are now deprecated, you should set the endpoint instead.
ERRO[0000] validate service connection: validate CRI v1 runtime API for endpoint "unix:///var/run/dockershim.sock": rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial unix /var/run/dockershim.sock: connect: no such file or directory"
WARN[0000] image connect using default endpoints: [unix:///var/run/dockershim.sock unix:///run/containerd/containerd.sock unix:///run/crio/crio.sock unix:///var/run/cri-dockerd.sock]. As the default settings are now deprecated, you should set the endpoint instead.
ERRO[0000] validate service connection: validate CRI v1 image API for endpoint "unix:///var/run/dockershim.sock": rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial unix /var/run/dockershim.sock: connect: no such file or directory"
CONTAINER IMAGE CREATED STATE NAME ATTEMPT POD ID POD
1d2697853abdb 9d3465f8477c6 4 minutes ago Running kube-scheduler 0 f3f24f3ad0d17 kube-scheduler-knode03
12fa7e0335686 bc292d823f05c 5 minutes ago Running worker 1 32df5109de2f3 gpu-operator-1722942924-node-feature-discovery-worker-j5whb
6f91e9499ce46 10541d8af03f4 5 minutes ago Running kube-controller-manager 0 c75432b5a473b kube-controller-manager-knode03
5c9a0a70a5561 9dc6939e7c573 5 minutes ago Running kube-apiserver 0 ff10138958cb8 kube-apiserver-knode03
a522d8d5a7924 6860eccd97258 3 months ago Running promtail 0 8afd71606c737 loki-promtail-kq24n
ffa94698ef2fd 4b57359fd6745 11 months ago Running nfs 0 107fd1e87dd8e csi-nfs-node-hqlbb
b779480da3c8d 50013f94a28d1 11 months ago Running node-driver-registrar 0 107fd1e87dd8e csi-nfs-node-hqlbb
164acfbd8db1f 494ea5379400e 11 months ago Running liveness-probe 0 107fd1e87dd8e csi-nfs-node-hqlbb
e85cfa8777294 72c9c20889862 12 months ago Running node-exporter 0 9eca042e8d27d kube-prometheus-stack-prometheus-node-exporter-sxsnh
c9a3cb50e7cbe 0b888dd0f0dc0 12 months ago Running network-operator-sriov-network-operator 0 bb87d940c97a7 network-operator-sriov-network-operator-9ff6f8ccb-54tzf
b2d5be7cd9c9a 1843802b91be8 12 months ago Running calico-node 0 e36343501bad1 calico-node-5gch5
64aa7144dbeba a3eea76ce409e 12 months ago Running kube-proxy 0 e9b8d665595f5 kube-proxy-pxq9x
Note the kube-scheduler
, kube-controller-manager
, and kube-apiserver
pods have recently been recreated (shown by their creation time of 4-5 minutes ago).
- Test kubectl with the new admin.conf:
kubectl --kubeconfig=/etc/kubernetes/admin.conf get nodes
The command should work at this point.
Step 8: Update kubeconfig Files on Node
On the updated node (knode03):
cp -prv /etc/kubernetes/admin.conf /root/.kube/config-default # Note: 'default' is the cluster label
cp -prv /etc/kubernetes/admin.conf /root/.kube/config
rm -rfv /root/.kube/.config-default.hash
Step 9: Distribute Working Configuration
From the head node, copy the working configuration to all locations:
# Copy to head node
rsync -av root@knode03:/etc/kubernetes/admin.conf /root/.kube/config
rsync -av root@knode03:/etc/kubernetes/admin.conf /root/.kube/config-default
rm -fv /root/.kube/.config-default.hash
# Copy to other control plane nodes
rsync -av /root/.kube/config root@knode01:/root/.kube/config
rsync -av /root/.kube/config root@knode02:/root/.kube/config
rsync -av /root/.kube/config-default root@knode01:/root/.kube/config-default
rsync -av /root/.kube/config-default root@knode02:/root/.kube/config-default
ssh knode01 'rm -fv /root/.kube/.config-default.hash'
ssh knode02 'rm -fv /root/.kube/.config-default.hash'
Note: The kubeadm-init-default.yaml
file was already distributed to all control plane nodes in Step 4.
Step 10: Update Remaining Control Plane Nodes
Repeat the certificate renewal process for each remaining control plane node:
For knode01:
cm-kubeadm-manage --kube-cluster=default update_certs knode01
cm-kubeadm-manage --kube-cluster=default update_apiserver knode01
cm-kubeadm-manage --kube-cluster=default update_controller_manager knode01
cm-kubeadm-manage --kube-cluster=default update_scheduler knode01
For knode02:
cm-kubeadm-manage --kube-cluster=default update_certs knode02
cm-kubeadm-manage --kube-cluster=default update_apiserver knode02
cm-kubeadm-manage --kube-cluster=default update_controller_manager knode02
cm-kubeadm-manage --kube-cluster=default update_scheduler knode02
Step 11: Update ConfigMap
Finally, update the kubeadm ConfigMap:
cm-kubeadm-manage --kube-cluster default update_configmap
This creates an updated kubeadm-init-default.yaml
file on all control plane nodes and ensures that kubeadm stores a copy of it inside its configmap.
Post-Renewal Tasks
Certificate Approval (Older BCM Versions)
For BCM versions prior to 10.25.03, manually approve pending certificate requests:
kubectl get csr | grep -i pending | awk '{ print $1 }' | xargs -r -n 1 kubectl certificate approve
Restart Failed Workloads
Some pods may still use expired tokens. Restart affected workloads:
# For Deployments
kubectl -n <namespace> rollout restart deployment <deployment-name>
# For DaemonSets
kubectl -n <namespace> rollout restart daemonset <daemonset-name>
# For StatefulSets
kubectl -n <namespace> rollout restart statefulset <statefulset-name>
Expired kubeconfig files for regular users
In case a regular user’s /home/user/.kube/config
or /home/user/.kube/config-default
is expired, you can let BCM regenerate it by deleting the following files:
rm -fv /home/user/.kube/config /home/user/.kube/config-default /home/user/.kube/.config-default.hash
The following helper script can also be used to list the expired kubeconfig files:
wget https://support2.brightcomputing.com/rayb/kube/kube_check_kubeconfig_certs.py
./kube_check_kubeconfig_certs.py
Example output:
root@headnode:~# ./kube_check_kubeconfig_certs.py
user1 /home/user1/.kube/config certificate ERROR (expired: 2025-07-16 11:36:09, 0 days ago)
user1 /home/user1/.kube/config_orig certificate ERROR (expired: 2025-07-12 13:56:36, 4 days ago)
...
Preventive Measures
To avoid certificate expiration in the future:
Regular Kubernetes Upgrades
Upgrading the Kubernetes cluster (even minor version bumps) causes kubeadm to automatically renew certificates. This is the recommended approach for maintaining certificate validity. See the BCM Containerization Manual for upgrade procedures.
BCM Improvements
Recent BCM versions include improvements for maintaining kubeconfig files and preventing expiration issues:
- Health Check: BCM now includes a health check that warns when kube certificates are expiring within 30 days.
- Kubeconfig file management: BCM actively checks kubeconfig embedded certificates and regenerates proactively to prevent expiration issues.
Keeping BCM updated should help with preventing the issue this KB article addresses in the future.
Proactive Certificate Renewal
Use the cm-kubeadm-manage
wrapper script to renew certificates before they expire. See Section 4: Rotate All Certificates for detailed instructions on proactive certificate rotation.