1. Prerequisites
- The following article was written with Bright Cluster Manager 9.1 in mind but should work the same for versions 9.0 and 9.2.
- The feature used by this article is called Learners in etcd, and more can be read about the design here: https://etcd.io/docs/v3.5/learning/design-learner/
- The minimum required version of etcd should be at least 3.4.4 (earlier versions, such as 3.4.3 contain serious bugs).
2. Background
Crucially, a majority of nodes have to agree to elect a new leader (2/3, 4/6, etc.), and if a majority can’t be reached, the entire cluster will be unavailable. What this means in practice is that etcd will remain available as long as a majority of nodes is online.
Source: here
In etcd terminology a node is also often referred to as a member. In this article we will try to talk about members and use the term node for the underlying machine.
This means a three-member cluster can tolerate one broken member. And a five-member cluster can tolerate two broken members.
To ensure a majority of members in a healthy state at all times, it is recommended to remove or add always one member at a time.
3. Common scenarios
This KB article focuses on the features that etcd itself provides for building new members.
- Extend a single-member etcd cluster to three members.
- Replace a member, due to unexpected hardware failure.
- Migrate one of the members to another node.
Backing up etcd’s data to shared storage using different tools, such as rsync can in some cases be more practical. We dedicate a separate section (section 6.) to this near the end of this KB article.
4. Create snapshots
This article is about adding, removing or replacing members for a running etcd cluster, it might still be worth creating a snapshot of the database anyway. For details refer to this KB article.
In short, a simple parallel approach to this could be:
- Check each endpoint’s health:
pdsh -w node00[1-3] "module load etcd; etcdctl -w table endpoint health"
- Inspect the output, see if all members that are expected to be healthy are.
- Create a directory on shared storage that is not tied to the node:
mkdir -p /cm/shared/backup/etcd
. - Create the snapshots:
pdsh -w node00[1-3] "module load etcd; etcdctl snapshot save /cm/shared/backup/etcd/etcd-\$(hostname)-\$(date +"%Y%m%d%H%M%S")"
- Inspect the output.
5. Removing a member
- If you wish to only add new members, skip this section and go to section 5.
- If you wish to migrate a member to another node, skip this section first, add the new member (section 5.) and then come back to this section (4.) to remove the old member.
- If you wish to replace an offline broken member, and you want to bring it back up with new hardware, continue with the following steps in this section.
- If you wish to replace an online member, consider taking the following steps:
- Take the etcd member on the node offline by following section 4.1.
- Backup the etcd directories by following section 6.1.
Do all the needed changes, for example, a FULL provisioning.
- Restore the important directories by following section 6.2.
(Section 6.2. also includes how to bring the etcd member online.)
5.1. Remove node from the etcd Configuration Overlay
For this example we will remove node003. We do not want Bright Cluster Manager to start the etcd service. The etcd::role
needs to be unassigned as follows:
[root@headnode ~]# cmsh
[headnode]% configurationoverlay
[headnode->configurationoverlay]% use kube-default-etcd
[headnode->configurationoverlay[kube-default-etcd]]% removefrom nodes node003
[headnode->configurationoverlay*[kube-default-etcd*]]% commit
[headnode->configurationoverlay[kube-default-etcd]]%
Tue Apr 5 12:19:30 2022 [notice] node003: Service etcd was stopped
This is necessary to prevent etcd from starting once the node comes back up with the following error:member 1c38cdf4114b933d has already been bootstrapped
This is the result of the other etcd members recognizing the host as an existing member with identifier 1c38cdf4114b933d
. However, this doesn’t match with the internal database on node003 since it lost its database.
5.2. Remove member from etcd
SSH to one of the etcd members that are up and running and list all the members to get their identifiers.
[root@node001 ~]# module load etcd/kube-default/3.4.13
[root@node001 ~]# etcdctl member list
10cee25dc156ff4a, started, node002, https://10.141.0.2:2380, https://10.141.0.2:2379, false
4a336cbcb0bafdc0, started, node001, https://10.141.0.1:2380, https://10.141.0.1:2379, false
bd786940e5446229, started, node003, https://10.141.0.3:2380, https://10.141.0.3:2379, false
Save the above output if you wish to re-add a member later with the same endpoint. Then proceed with the removal.
[root@node001 ~]# etcdctl member remove bd786940e5446229
Member bd786940e5446229 removed from cluster eef6e88516650e5b
[root@node001 ~]# etcdctl member list
10cee25dc156ff4a, started, node002, https://10.141.0.2:2380, https://10.141.0.2:2379, false
4a336cbcb0bafdc0, started, node001, https://10.141.0.1:2380, https://10.141.0.1:2379, false
6. Add new etcd member
In this example we will add a new node, node003, with IP address 10.141.0.3, as a learner. Whether this node has been removed in the previous section doesn’t matter, it might as well be a completely new node. Let’s say the hard-drive has been completely replaced, and the node is back online.
6.1. Sanity checks on the node
This step should be unnecessary, but here are the preconditions that have to be met nonetheless:
- Confirm that the service is stopped/disabled with:
systemctl status etcd
- Confirm that the
/var/lib/etcd
directory is clean:ls -al /var/lib/etcd/
(If not, please delete its contents, e.g.rm -rf /var/lib/etcd/member
.) - Please ignore if the permissions of
/var/lib/etcd
are erroneously set to 0755, this is a bug at the time of writing that will soon be fixed (correct permissions should be 0700).
6.2. Add the node as an etcd learner
SSH to one of the healthy etcd members and add the node as a learner (note the --learner
flag). The terminology comes from the fact that potential new members must first learn the existing cluster’s database. Once they finish, they can be promoted to a non-learning member.
[root@node001 ~]# etcdctl member add node003 --learner --peer-urls=https://10.141.0.3:2380
Member 690ed538336601f4 added to cluster eef6e88516650e5b
ETCD_NAME="node003"
ETCD_INITIAL_CLUSTER="node002=https://10.141.0.2:2380,node001=https://10.141.0.1:2380,node003=https://10.141.0.3:2380"
ETCD_INITIAL_ADVERTISE_PEER_URLS="https://10.141.0.3:2380"
ETCD_INITIAL_CLUSTER_STATE="existing"
The above output first prints a confirmation, and the identifier for the new member. Then follows a few environmental variables. These are meant to be set as environment before running the etcd service (etcd binary) as a learner.
In our case we also need Bright Cluster Manager to generate certificates, since we also use secure communication. Therefore it is easier to ignore these values, and use a different approach. We create a configuration overlay with an equivalent flag (--initial-cluster-state=existing
).
6.3. Create etcd learners Configuration Overlay
This has to be done once, since it may be helpful to keep this configuration overlay for the future.
[root@headnode ~]# cmsh
[headnode]% configurationoverlay
[headnode->configurationoverlay]% clone kube-default-etcd kube-default-etcd-learners
[headnode->configurationoverlay*[kube-default-etcd-learners*]]% show
Parameter Value
-------------------------------- ------------------------------------------------
Name kube-default-etcd-learners
Revision
All head nodes no
Priority 500
Nodes node001,node002
Categories
Roles Etcd::Host
Customizations <0 in submode>
[headnode->configurationoverlay*[kube-default-etcd-learners*]]% set priority 510
[headnode->configurationoverlay*[kube-default-etcd-learners*]]% set allheadnodes no
[headnode->configurationoverlay*[kube-default-etcd-learners*]]% set nodes
[headnode->configurationoverlay*[kube-default-etcd-learners*]]% set categories
[headnode->configurationoverlay*[kube-default-etcd-learners*]]% append nodes node003
[headnode->configurationoverlay*[kube-default-etcd-learners*]]% show
Parameter Value
-------------------------------- ------------------------------------------------
Name kube-default-etcd-learners
Revision
All head nodes no
Priority 510
Nodes node003
Categories
Roles Etcd::Host
Customizations <0 in submode>
[headnode->configurationoverlay*[kube-default-etcd-learners*]]% roles
[headnode->configurationoverlay*[kube-default-etcd-learners*]->roles*]% use etcd::host
[headnode->configurationoverlay*[kube-default-etcd-learners*]->roles*[Etcd::Host*]]% append options "--initial-cluster-state=existing"
[headnode->configurationoverlay*[kube-default-etcd-learners*]->roles*[Etcd::Host*]]% commit
The reason we set “allheadnodes”, “nodes”, “categories” to “no” and empty values respectively, is that we don’t want this overlay to apply to any node, other than the one we have in mind. The “clone” command copies whatever was in the original configuration overlay.
Note that besides clearing, we did add “node003” explicitly to it, since we wish this particular etcd::role that contains the extra option, to be assigned to it.
If you kept this configuration overlay around, appending the node would be the only needed step here.
Wait a while after the commit, some back-and-forth will happen at this point (certs are created, API servers restarted, and finally etcd will be started). Output inside cmsh might be similar to:
Tue Apr 5 12:37:12 2022 [notice] headnode: New certificate request with ID: 48
Tue Apr 5 12:37:13 2022 [notice] node003: Service etcd was not allowed to restart
Tue Apr 5 12:37:13 2022 [notice] node003: Service etcd was not allowed start
Tue Apr 5 12:37:23 2022 [notice] node001: Service etcd was restarted
Tue Apr 5 12:37:23 2022 [notice] node002: Service etcd was restarted
Tue Apr 5 12:37:25 2022 [notice] headnode: Service kube-apiserver was restarted
Tue Apr 5 12:37:35 2022 [warning] node003: Service etcd died
Tue Apr 5 12:37:35 2022 [notice] node003: Service etcd was not restarted
Tue Apr 5 12:37:35 2022 [notice] headnode: New certificate request with ID: 49
Tue Apr 5 12:37:39 2022 [notice] headnode: New certificate request with ID: 50
Tue Apr 5 12:38:07 2022 [warning] node003: Service etcd died
Tue Apr 5 12:38:15 2022 [notice] node003: Service etcd was restarted
6.4. Confirm and Promote the learner
Confirm this via etcdctl
on a working etcd node:
[root@node001 ~]# etcdctl member list
10cee25dc156ff4a, started, node002, https://10.141.0.2:2380, https://10.141.0.2:2379, false
4a336cbcb0bafdc0, started, node001, https://10.141.0.1:2380, https://10.141.0.1:2379, false
690ed538336601f4, started, node003, https://10.141.0.3:2380, https://10.141.0.3:2379, true
The last line in the above output shows node003 to be added, and the very last boolean indicates that it’s a learner due to the value “true”. Now we can promote it:
[root@node001 ~]# etcdctl member promote 690ed538336601f4
Member 690ed538336601f4 promoted in cluster eef6e88516650e5b
[root@node001 ~]# etcdctl member list
10cee25dc156ff4a, started, node002, https://10.141.0.2:2380, https://10.141.0.2:2379, false
4a336cbcb0bafdc0, started, node001, https://10.141.0.1:2380, https://10.141.0.1:2379, false
690ed538336601f4, started, node003, https://10.141.0.3:2380, https://10.141.0.3:2379, false
After promoting, we see the learner flag has changed to “false”.
6.5. Move the node to the original Configuration Overlay
Using cmsh, remove the node from “kube-default-etcd-learners”, and add it to “kube-default-etcd”, then commit both.
[root@headnode ~]# cmsh
[headnode]% configurationoverlay
[headnode->configurationoverlay]% use kube-default-etcd-learners
[headnode->configurationoverlay[kube-default-etcd-learners]]% removefrom nodes node003
[headnode->configurationoverlay*[kube-default-etcd-learners*]]% ..
[headnode->configurationoverlay*]% use kube-default-etcd
[headnode->configurationoverlay*[kube-default-etcd]]% append nodes node003
[headnode->configurationoverlay*[kube-default-etcd*]]% ..
[headnode->configurationoverlay*]% commit
Successfully committed 2 ConfigurationOverlays
[headnode->configurationoverlay]%
Tue Apr 5 12:46:02 2022 [notice] node003: Service etcd was restarted
This results in a restart once more, because the service is no longer started with the --initial-cluster-state=existing
.
7. Backup and Restore etcd with rsync
This is useful for backing up etcd members where the members are still online and scheduled to be replaced. Common scenario for this is a change in disk layout.
Do not follow these steps if you wish to keep the node up, use the etcd’s snapshot functionality in that case (see section 4.). This method of backup and restore requires you to stop the etcd service first. Since we don’t want etcd writing to it’s spool directory while we create the backup.
There are two important directories:
/var/lib/etcd
(“spool” directory, contains the data)/cm/local/apps/etcd/var/etc/
(contains config and certificates)
Please beware that these are the default paths, and that they can be changed within Bright Cluster Manager. The spool directory is the single most important one. Bright Cluster Manager will automatically re-create the config + certificates in case we do not back it up.
You can find the spool directory configured here:
[root@headnode ~]# cmsh
[headnode]% configurationoverlay
[headnode->configurationoverlay]% use kube-default-etcd
[headnode->configurationoverlay[kube-default-etcd]]% roles
[headnode->configurationoverlay[kube-default-etcd]->roles]% use etcd::host
[headnode->configurationoverlay[kube-default-etcd]->roles[Etcd::Host]]% show
Parameter Value
-------------------------------- ------------------------------------------------
Name Etcd::Host
Revision
Type EtcdHostRole
Add services yes
Member Certificate
Member Certificate Key
Provisioning associations <0 internally used>
Etcd Cluster kube-default
Member Name $hostname
Spool /var/lib/etcd
Listen Client URLs https://0.0.0.0:2379
Listen Peer URLs https://0.0.0.0:2380
Advertise Client URLs https://$ip:2379
Advertise Peer URLs https://$ip:2380
Snapshot Count 5000
Options
Debug no
Should the path be different from /var/lib/etcd
, please substitute it with the correct path for the rest of this section.
7.1. Backing up
- Follow section 4.1. to remove the node from the etcd configuration overlay. This ensures the service is stopped.
- SSH to the node and ensure that etcd has stopped:
systemctl status etcd
- Prepare a directory where we can store our backup, e.g. on mounted shared storage.
mkdir -p /cm/shared/etcd-backups/$(hostname)/{etcd,etc}
- Rsync the directories to this location.
rsync -raPv --delete /var/lib/etcd/ /cm/shared/etcd-backups/$(hostname)/etcd/
rsync -raPv --delete /cm/local/apps/etcd/var/etc/ /cm/shared/etcd-backups/$(hostname)/etc/
Now the etcd data should be saved in such a way that we can restore it later.
7.2. Restoring
Assuming that the node is now back online, we want to restore the backup we made in section 6.2 before we re-assign the etcd role in cmsh.
- SSH to the node and execute the rsync into the other direction:
rsync -raPv --delete /cm/shared/etcd-backups/$(hostname)/etcd/ /var/lib/etcd/
rsync -raPv --delete /cm/shared/etcd-backups/$(hostname)/etc/ /cm/local/apps/etcd/var/etc/
- Go to cmsh and add the node back into the etcd configuration overlay:
[headnode->configurationoverlay[kube-default-etcd]]% append nodes node003
[headnode->configurationoverlay*[kube-default-etcd*]]% commit
Confirm if the member is accepted to the cluster by inspecting its status:
[root@node001 ~]# module load etcd/kube-default/3.4.13
[root@node001 ~]# ETCDCTL_ENDPOINTS=$(etcdctl member list | awk -F ',' '{print $5}' | sed 's/\s//' | paste -sd ",")
[root@node001 ~]# etcdctl endpoint status
https://10.141.0.2:2379, 10cee25dc156ff4a, 3.4.13, 4.8 MB, false, false, 10, 24644, 24644,
https://10.141.0.1:2379, 4a336cbcb0bafdc0, 3.4.13, 4.9 MB, true, false, 10, 24644, 24644,
https://10.141.0.3:2379, 690ed538336601f4, 3.4.13, 4.9 MB, false, false, 10, 24644, 24644,