1. Home
  2. Etcd Backup and Restore with Bright 9.0+

Etcd Backup and Restore with Bright 9.0+

1. Prerequisites
  • The following article was written with Bright Cluster Manager 9.0 in mind but should work the same also for newer versions (at least versions 9.1 and 9.2)
  • We assume we have shared storage available as a mount in /cm/shared, we will create a target directory there for our Etcd backups.
  • The restore part of this KB article should only be followed if your entire Etcd cluster has to be recreated from the backup. If you run a multi-node Etcd cluster, broken members can be replaced or fixed by synchronizing from the remaining working Etcd members. This is often a better approach (compared to restoring from snapshots) and can be found here: https://kb.brightcomputing.com/knowledge-base/etcd-membership-reconfiguration-in-bright-9-0/. The snapshots are still a good backup to have.
  • Please note that there is a subtle difference between Etcd versions 3.4.x and 3.5.x. We used to set /var/lib/etcd with 0755 permissions, but this has been changed to 0700. (Setting to 0755 will result in Etcd not starting).
    • In the same vain, etcdctl has been replaced by etcdutl for certain operations. This KB article still uses etcdctl, and should still work with all versions in BCM 9.0, 9.1 and 9.2.
2. Etcd installations
  • Bright Kubernetes setups always require an odd number of Etcd nodes.
  • Three Etcd nodes are recommended, but single-node Etcd deployments are also possible.
  • Etcd nodes are marked as datanodes. This prevents Full Provisioning from unintentionally wiping the Etcd database.
  • Etcd stores its data in /var/lib/etcd by default, which is called the spool directory.

The spool directory can be changed, it can be found in the Etcd::Host role via cmsh:

[cluster->configurationoverlay[kube-default-etcd]->roles[Etcd::Host]]% get spool 
/var/lib/etcd

In case it’s not /var/lib/etcd, please substitute it for the correct path in the rest of this article.

3. Check the cluster health

First login to one of the Etcd cluster nodes (any of the devices with the Etcd::Host role)

Then load the module file and check for the Health:

# module load etcd/kube-default/<version>
# etcdctl endpoint health
https://10.141.0.1:2379 is healthy: successfully committed proposal: took = 16.551531ms
4. Prepare a location for the backup
# mkdir -p /cm/shared/backup/etcd
5. Create the snapshot
# etcdctl snapshot save /cm/shared/backup/etcd/etcd-$(hostname)-$(date +"%Y%m%d%H%M%S")
{"level":"info","ts":1647867412.727886,"caller":"snapshot/v3_snapshot.go:110","msg":"created temporary db file","path":"/cm/shared/backup/etcd/etcd-node001-20220321135652.part"}
{"level":"info","ts":1647867412.7418113,"caller":"snapshot/v3_snapshot.go:121","msg":"fetching snapshot","endpoint":"https://10.141.0.1:2379"}
{"level":"info","ts":1647867412.841619,"caller":"snapshot/v3_snapshot.go:134","msg":"fetched snapshot","endpoint":"https://10.141.0.1:2379","took":0.110848759}
{"level":"info","ts":1647867412.849341,"caller":"snapshot/v3_snapshot.go:143","msg":"saved","path":"/cm/shared/backup/etcd/etcd-node001-20220321135652"}
Snapshot saved at /cm/shared/backup/etcd/etcd-node001-20220321135652
6. Restore the snapshot

This section assumes that there is an actual need to restore a snapshot. For example, hardware failure of all Etcd members. If some are still up & running, then replacing, adding or removing members while it remains operational is possibly a better solution.

6.1. Important: First stop Etcd service(s)

Even if the nodes are SHUTOFF, we ensure that Bright Cluster Manager won’t try to start Etcd before the backup has been restored.

If an Etcd node comes up, with a new HDD, and this disk is provisioned from scratch, Etcd’s spool directory (/var/lib/etcd) will be empty, and Etcd will come up as new, with nothing in its database.

This is a problem, since Kubernetes, once the connection with Etcd is established again. Will consider the empty database the new desired state, and start terminating all containers that do not match this desired state.

Once we restored the backup, Kubernetes will do its best to make the actual state match the desired state again, but running jobs and so on will have already been interrupted. This is why we need to ensure etcd services don’t come up before the backup has been restored.

Kube API servers

Since we’re about to unassign the Etcd::Host roles from some nodes, and with that, stop the etcd services. The API servers will be restarted with an empty list of servers for the --etcd-servers=https://10.141.0.1:2379 parameter. This will result in them failing to restart.

During this period kubectl cannot be used to query Kubernetes resources, but the containerized services running inside Kubernetes, should continue to run where possible. When Pods crash, Kubernetes won’t be able to issue reschedules. This will all start working again once the Etcd backup has been restored, and we re-assign the Etcd::Host roles.

How to Stop the Etcd service(s)

  • Let’s launch cmsh.
  • Remove the node(s) from the Configuration Overlay
[cluster]% configurationoverlay 
[cluster->configurationoverlay]% use kube-default-etcd 
[cluster->configurationoverlay[kube-default-etcd]]% removefrom nodes node001 
[cluster->configurationoverlay*[kube-default-etcd*]]% commit
Tue Mar 22 07:14:05 2022 [notice] node001: Service etcd was stopped

If the node was already SHUTOFF that means Kubernetes is already incapable of updating any changes to its state. The If the node is still UP, it will become so after the last etcd service is stopped.

To re-emphasize, if this is a three-node Etcd cluster, do this for all the Etcd nodes, not just one, or one-by-one. Etcd needs to be “collectively” not-running before we take a snapshot.

Three extra steps if the node is still UP and we’re dealing with a Compute Node.

  • Login to the Etcd node.
  • Stop the etcd service:
    systemctl stop etcd

    (Optionally check first with systemctl status etcd if stopping it is necessary.)
  • Move the spool dir out of the way:
    mv /var/lib/etcd /var/lib/etcd.old

    Do this if you do not plan to do a full reprovisioning of the Compute Node: recreate the directory.
    For Etcd version 3.4.x:
    mkdir /var/lib/etcd; chmod 0755 /var/lib/etcd 
    For Etcd version 3.5.x:
    mkdir /var/lib/etcd; chmod 0700 /var/lib/etcd

Now it’s safe to power on the node and do a FULL provisioning. (See next section)

Next steps are useful if we’re dealing with a Head Node:

  • Stop the etcd service:
    systemctl stop etcd

    (Optionally check first with systemctl status etcd if stopping it is necessary.)
  • Move the spool dir out of the way:
    mv /var/lib/etcd /var/lib/etcd.old
  • Recreate the directory.
    For Etcd version 3.4.x:
    mkdir /var/lib/etcd; chmod 0755 /var/lib/etcd 
    For Etcd version 3.5.x:
    mkdir /var/lib/etcd; chmod 0700 /var/lib/etcd

FULL provisioning

Please keep in mind that this step wipes the entire disk for the node, and reprovisions it from scratch.

In this example, we had an HDD failure in node001, so we already lost our Etcd data in /var/lib/etcd. Let’s say this was a single-node Etcd cluster, and we want to do a FULL provisioning, and try to recover Etcd with the snapshot we made in Step 5 once this is completed.

# make sure we allow for FULL provisioning
[cluster->device[node001]]% set datanode no
[cluster->device*[node001*]]% commit

# configure our next provisioning to be FULL
[cluster->device[node001]]% set nextinstallmode full 
[cluster->device*[node001*]]% commit
[cluster->device[node001]]% reboot  # or start if powered off
node001: Reboot in progress ...

If we were dealing with a three-node Etcd cluster, we’d have to do this for all three-nodes.

6.3. Restore the snapshot

Once the node is back up, log in to the node, and restore the backup with the following commands.

# module load etcd/kube-default/<your_version>
# etcdctl snapshot restore --data-dir=/var/lib/etcd /cm/shared/backup/etcd/etcd-node001-20220321135652 
{"level":"info","ts":1647875570.8151839,"caller":"snapshot/v3_snapshot.go:287","msg":"restoring snapshot","path":"/cm/shared/backup/etcd/etcd-node001-20220321135652","wal-dir":"/var/lib/etcd/member/wal","data-dir":"/var/lib/etcd","snap-dir":"/var/lib/etcd/member/snap"}
{"level":"info","ts":1647875570.862523,"caller":"mvcc/kvstore.go:378","msg":"restored last compact revision","meta-bucket-name":"meta","meta-bucket-name-key":"finishedCompactRev","restored-compact-revision":10502}
{"level":"info","ts":1647875570.8769147,"caller":"membership/cluster.go:392","msg":"added member","cluster-id":"cdf818194e3a8c32","local-member-id":"0","added-peer-id":"8e9e05c52164694d","added-peer-peer-urls":["http://localhost:2380"]}
{"level":"info","ts":1647875570.9070866,"caller":"snapshot/v3_snapshot.go:300","msg":"restored snapshot","path":"/cm/shared/backup/etcd/etcd-node001-20220321135652","wal-dir":"/var/lib/etcd/member/wal","data-dir":"/var/lib/etcd","snap-dir":"/var/lib/etcd/member/snap"}
# chmod 755 /var/lib/etcd  # use chmod 700 for 3.5.x Etcd!
# chown etcd:etcd -R /var/lib/etcd

Please don’t forget the last two commands (chmod and chown, or the Etcd won’t be able to start due to permissions issues).

If we were dealing with a three-node Etcd cluster, we’d have to do this for all three-nodes.

6.4. Start the Etcd Service(s)

Something to keep in mind before continuing: Any potential changes that were made after the snapshot was made will not match the (restored) desired state. Those containers might still be running, since the connection with Etcd is lost. Once the connection is re-established, it will fix the actual state by terminating them. This is not a problem if the snapshot is new enough.

Having said that, re-assigning the Etcd::Host role to the node(s) can be done by assigning back to the Configuration Overlay:

[cluster->configurationoverlay[kube-default-etcd]]% append nodes node001  
[cluster->configurationoverlay*[kube-default-etcd*]]% commit

Bright Cluster Manager will start the Etcd service after some delays.

Now Kubernetes should be able to reconnect to Etcd.

If we were dealing with a three-node Etcd cluster, we’d have to do this for all three-nodes.

7. Final notes

Please ensure that each Etcd::Host node has the datanode property set to “yes”.

[cluster->device[node001]]% set datanode yes 
[cluster->device*[node001*]]% commit

The Kubernetes API servers should automatically start. If not, you can start them manually from cmsh.

Updated on October 15, 2024

Leave a Comment