1. Prerequisites
- The following article was written with Bright Cluster Manager 9.0 in mind but should work the same also for newer versions (at least versions 9.1 and 9.2)
- We assume we have shared storage available as a mount in
/cm/shared
, we will create a target directory there for our Etcd backups. - The restore part of this KB article should only be followed if your entire Etcd cluster has to be recreated from the backup. If you run a multi-node Etcd cluster, broken members can be replaced or fixed by synchronizing from the remaining working Etcd members.
We will add another KB article describing these steps in the near future as well.
2. Etcd installations
- Bright Kubernetes setups always require an odd number of Etcd nodes.
- Three Etcd nodes are recommended, but single-node Etcd deployments are also possible.
- Etcd nodes are marked as datanodes. This prevents Full Provisioning from unintentionally wiping the Etcd database.
- Etcd stores its data in
/var/lib/etcd
by default, which is called the spool directory.
The spool directory can be changed, it can be found in the Etcd::Host role via cmsh:
[cluster->configurationoverlay[kube-default-etcd]->roles[Etcd::Host]]% get spool /var/lib/etcd
In case it’s not /var/lib/etcd
, please substitute it for the correct path in the rest of this article.
3. Check the cluster health
First login to one of the Etcd cluster nodes (any of the devices with the Etcd::Host
role)
Then load the module file and check for the Health:
# module load etcd/kube-default/3.4.4 # etcdctl endpoint health https://10.141.0.1:2379 is healthy: successfully committed proposal: took = 16.551531ms
4. Prepare a location for the backup
# mkdir -p /cm/shared/backup/etcd
5. Create the snapshot
# etcdctl snapshot save /cm/shared/backup/etcd/etcd-$(hostname)-$(date +"%Y%m%d%H%M%S") {"level":"info","ts":1647867412.727886,"caller":"snapshot/v3_snapshot.go:110","msg":"created temporary db file","path":"/cm/shared/backup/etcd/etcd-node001-20220321135652.part"} {"level":"info","ts":1647867412.7418113,"caller":"snapshot/v3_snapshot.go:121","msg":"fetching snapshot","endpoint":"https://10.141.0.1:2379"} {"level":"info","ts":1647867412.841619,"caller":"snapshot/v3_snapshot.go:134","msg":"fetched snapshot","endpoint":"https://10.141.0.1:2379","took":0.110848759} {"level":"info","ts":1647867412.849341,"caller":"snapshot/v3_snapshot.go:143","msg":"saved","path":"/cm/shared/backup/etcd/etcd-node001-20220321135652"} Snapshot saved at /cm/shared/backup/etcd/etcd-node001-20220321135652
6. Restore the snapshot
This section assumes that there is an actual need to restore a snapshot. For example, hardware failure of all Etcd members. If some are still up & running, then replacing, adding or removing members while it remains operational is possibly a better solution.
6.1. Important: First stop Etcd service(s)
Even if the nodes are SHUTOFF, we ensure that Bright Cluster Manager won’t try to start Etcd before the backup has been restored.
If an Etcd node comes up, with a new HDD, and this disk is provisioned from scratch, Etcd’s spool directory (/var/lib/etcd
) will be empty, and Etcd will come up as new, with nothing in its database.
This is a problem, since Kubernetes, once the connection with Etcd is established again. Will consider the empty database the new desired state, and start terminating all containers that do not match this desired state.
Once we restored the backup, Kubernetes will do its best to make the actual state match the desired state again, but running jobs and so on will have already been interrupted. This is why we need to ensure etcd services don’t come up before the backup has been restored.
Kube API servers
Since we’re about to unassign the Etcd::Host
roles from some nodes, and with that, stop the etcd services. The API servers will be restarted with an empty list of servers for the --etcd-servers=https://10.141.0.1:2379
parameter. This will result in them failing to restart.
During this period kubectl cannot be used to query Kubernetes resources, but the containerized services running inside Kubernetes, should continue to run where possible. When Pods crash, Kubernetes won’t be able to issue reschedules. This will all start working again once the Etcd backup has been restored, and we re-assign the Etcd::Host
roles.
How to Stop the Etcd service(s)
- Let’s launch
cmsh
. - Remove the node(s) from the Configuration Overlay
[cluster]% configurationoverlay [cluster->configurationoverlay]% use kube-default-etcd [cluster->configurationoverlay[kube-default-etcd]]% removefrom nodes node001 [cluster->configurationoverlay*[kube-default-etcd*]]% commit Tue Mar 22 07:14:05 2022 [notice] node001: Service etcd was stopped
If the node was already SHUTOFF that means Kubernetes is already incapable of updating any changes to its state. The If the node is still UP, it will become so after the last etcd
service is stopped.
Three extra steps if the node is still UP.
- Login to the Etcd node.
- Stop the
etcd
service:systemctl stop etcd
- Move the spool dir out of the way:
mv /var/lib/etcd /var/lib/etcd.old
Now it’s safe to power on the node and do a FULL provisioning.
6.2. Full re-provision of the Etcd node.
Please keep in mind that this step wipes the entire disk for the node, and reprovisions it from scratch.
In this example, we had an HDD failure in node001, so we already lost our Etcd data in /var/lib/etcd
. Let’s say this was a single-node Etcd cluster, and we want to do a FULL provisioning, and try to recover Etcd with the snapshot we made in Step 5 once this is completed.
# make sure we allow for FULL provisioning [cluster->device[node001]]% set datanode no [cluster->device*[node001*]]% commit # configure our next provisioning to be FULL [cluster->device[node001]]% set nextinstallmode full [cluster->device*[node001*]]% commit [cluster->device[node001]]% reboot # or start if powered off node001: Reboot in progress ...
6.3. Restore the snapshot
Once the node is back up, log in to the node, and restore the backup with the following commands.
# module load etcd/kube-default/3.4.4 # etcdctl snapshot restore --data-dir=/var/lib/etcd /cm/shared/backup/etcd/etcd-node001-20220321135652 {"level":"info","ts":1647875570.8151839,"caller":"snapshot/v3_snapshot.go:287","msg":"restoring snapshot","path":"/cm/shared/backup/etcd/etcd-node001-20220321135652","wal-dir":"/var/lib/etcd/member/wal","data-dir":"/var/lib/etcd","snap-dir":"/var/lib/etcd/member/snap"} {"level":"info","ts":1647875570.862523,"caller":"mvcc/kvstore.go:378","msg":"restored last compact revision","meta-bucket-name":"meta","meta-bucket-name-key":"finishedCompactRev","restored-compact-revision":10502} {"level":"info","ts":1647875570.8769147,"caller":"membership/cluster.go:392","msg":"added member","cluster-id":"cdf818194e3a8c32","local-member-id":"0","added-peer-id":"8e9e05c52164694d","added-peer-peer-urls":["http://localhost:2380"]} {"level":"info","ts":1647875570.9070866,"caller":"snapshot/v3_snapshot.go:300","msg":"restored snapshot","path":"/cm/shared/backup/etcd/etcd-node001-20220321135652","wal-dir":"/var/lib/etcd/member/wal","data-dir":"/var/lib/etcd","snap-dir":"/var/lib/etcd/member/snap"} # chmod 755 /var/lib/etcd # chown etcd:etcd -R /var/lib/etcd
Please don’t forget the last two commands (chmod
and chown
, or the Etcd won’t be able to start due to permissions issues).
6.4. Start the Etcd Service(s)
Something to keep in mind before continuing: Any potential changes that were made after the snapshot was made will not match the (restored) desired state. Those containers might still be running, since the connection with Etcd is lost. Once the connection is re-established, it will fix the actual state by terminating them. This is not a problem if the snapshot is new enough.
Having said that, re-assigning the Etcd::Host
role to the node(s) can be done by assigning back to the Configuration Overlay:
[cluster->configurationoverlay[kube-default-etcd]]% append nodes node001 [cluster->configurationoverlay*[kube-default-etcd*]]% commit
Bright Cluster Manager will start the Etcd service after some delays.
Now Kubernetes should be able to reconnect to Etcd.
7. Final notes
Please ensure that each Etcd::Host
node has the datanode property set to “yes”.
[cluster->device[node001]]% set datanode yes [cluster->device*[node001*]]% commit
The Kubernetes API servers should automatically start. If not, you can start them manually from cmsh
.