Etcd membership reconfiguration in Bright 9.0+

Contents

1. Prerequisites

The following article was written with Bright Cluster Manager 9.1 in mind but should work the same for versions 9.0 and 9.2.
The feature used by this article is called Learners in etcd, and more can be read about the design here: https://etcd.io/docs/v3.5/learning/design-learner/
The minimum required version of etcd should be at least 3.4.4 (earlier versions, such as 3.4.3 contain serious bugs).

2. Background

Crucially, a majority of nodes have to agree to elect a new leader (2/3, 4/6, etc.), and if a majority can’t be reached, the entire cluster will be unavailable. What this means in practice is that etcd will remain available as long as a majority of nodes is online.
Source: here

In etcd terminology a node is also often referred to as a member. In this article we will try to talk about members and use the term node for the underlying machine.

This means a three-member cluster can tolerate one broken member. And a five-member cluster can tolerate two broken members.

To ensure a majority of members in a healthy state at all times, it is recommended to remove or add always one member at a time.

3. Common scenarios

This KB article focuses on the features that etcd itself provides for building new members.

Extend a single-member etcd cluster to three members.
Replace a member, due to unexpected hardware failure.
Migrate one of the members to another node.

Backing up etcd’s data to shared storage using different tools, such as rsync can in some cases be more practical. We dedicate a separate section (section 6.) to this near the end of this KB article.

4. Create snapshots

This article is about adding, removing or replacing members for a running etcd cluster, it might still be worth creating a snapshot of the database anyway. For details refer to this KB article.

In short, a simple parallel approach to this could be:

Check each endpoint’s health:
pdsh -w node00[1-3] "module load etcd; etcdctl -w table endpoint health"
Inspect the output, see if all members that are expected to be healthy are.
Create a directory on shared storage that is not tied to the node:
mkdir -p /cm/shared/backup/etcd.
Create the snapshots:
pdsh -w node00[1-3] "module load etcd; etcdctl snapshot save /cm/shared/backup/etcd/etcd-\$(hostname)-\$(date +"%Y%m%d%H%M%S")"
Inspect the output.

5. Removing a member

If you wish to only add new members, skip this section and go to section 5.
If you wish to migrate a member to another node, skip this section first, add the new member (section 5.) and then come back to this section (4.) to remove the old member.
If you wish to replace an offline broken member, and you want to bring it back up with new hardware, continue with the following steps in this section.
If you wish to replace an online member, consider taking the following steps:
- Take the etcd member on the node offline by following section 4.1.
- Backup the etcd directories by following section 6.1.
  
  Do all the needed changes, for example, a FULL provisioning.
- Restore the important directories by following section 6.2.
  (Section 6.2. also includes how to bring the etcd member online.)

5.1. Remove node from the etcd Configuration Overlay

For this example we will remove node003. We do not want Bright Cluster Manager to start the etcd service. The etcd::role needs to be unassigned as follows:

[root@headnode ~]# cmsh
[headnode]% configurationoverlay
[headnode->configurationoverlay]% use kube-default-etcd
[headnode->configurationoverlay[kube-default-etcd]]% removefrom nodes node003
[headnode->configurationoverlay*[kube-default-etcd*]]% commit
[headnode->configurationoverlay[kube-default-etcd]]%
Tue Apr  5 12:19:30 2022 [notice] node003: Service etcd was stopped

This is necessary to prevent etcd from starting once the node comes back up with the following error:

member 1c38cdf4114b933d has already been bootstrapped
This is the result of the other etcd members recognizing the host as an existing member with identifier 1c38cdf4114b933d. However, this doesn’t match with the internal database on node003 since it lost its database.

5.2. Remove member from etcd

SSH to one of the etcd members that are up and running and list all the members to get their identifiers.

[root@node001 ~]# module load etcd/kube-default/3.4.13
[root@node001 ~]# etcdctl member list
10cee25dc156ff4a, started, node002, https://10.141.0.2:2380, https://10.141.0.2:2379, false
4a336cbcb0bafdc0, started, node001, https://10.141.0.1:2380, https://10.141.0.1:2379, false
bd786940e5446229, started, node003, https://10.141.0.3:2380, https://10.141.0.3:2379, false

Save the above output if you wish to re-add a member later with the same endpoint. Then proceed with the removal.

[root@node001 ~]# etcdctl member remove bd786940e5446229
Member bd786940e5446229 removed from cluster eef6e88516650e5b

[root@node001 ~]# etcdctl member list
10cee25dc156ff4a, started, node002, https://10.141.0.2:2380, https://10.141.0.2:2379, false
4a336cbcb0bafdc0, started, node001, https://10.141.0.1:2380, https://10.141.0.1:2379, false

6. Add new etcd member

In this example we will add a new node, node003, with IP address 10.141.0.3, as a learner. Whether this node has been removed in the previous section doesn’t matter, it might as well be a completely new node. Let’s say the hard-drive has been completely replaced, and the node is back online.

6.1. Sanity checks on the node

This step should be unnecessary, but here are the preconditions that have to be met nonetheless:

Confirm that the service is stopped/disabled with:
systemctl status etcd
Confirm that the /var/lib/etcd directory is clean:
ls -al /var/lib/etcd/
(If not, please delete its contents, e.g. rm -rf /var/lib/etcd/member.)
Please ignore if the permissions of /var/lib/etcd are erroneously set to 0755, this is a bug at the time of writing that will soon be fixed (correct permissions should be 0700).

6.2. Add the node as an etcd learner

SSH to one of the healthy etcd members and add the node as a learner (note the --learner flag). The terminology comes from the fact that potential new members must first learn the existing cluster’s database. Once they finish, they can be promoted to a non-learning member.

[root@node001 ~]# etcdctl member add node003 --learner --peer-urls=https://10.141.0.3:2380
Member 690ed538336601f4 added to cluster eef6e88516650e5b
 
ETCD_NAME="node003"
ETCD_INITIAL_CLUSTER="node002=https://10.141.0.2:2380,node001=https://10.141.0.1:2380,node003=https://10.141.0.3:2380"
ETCD_INITIAL_ADVERTISE_PEER_URLS="https://10.141.0.3:2380"
ETCD_INITIAL_CLUSTER_STATE="existing"

The above output first prints a confirmation, and the identifier for the new member. Then follows a few environmental variables. These are meant to be set as environment before running the etcd service (etcd binary) as a learner.

In our case we also need Bright Cluster Manager to generate certificates, since we also use secure communication. Therefore it is easier to ignore these values, and use a different approach. We create a configuration overlay with an equivalent flag (--initial-cluster-state=existing).

6.3. Create etcd learners Configuration Overlay

This has to be done once, since it may be helpful to keep this configuration overlay for the future.

[root@headnode ~]# cmsh
[headnode]% configurationoverlay
[headnode->configurationoverlay]% clone kube-default-etcd kube-default-etcd-learners
[headnode->configurationoverlay*[kube-default-etcd-learners*]]% show
Parameter                        Value                                          
-------------------------------- ------------------------------------------------
Name                             kube-default-etcd-learners                      
Revision                                                                        
All head nodes                   no                   
Priority                         500                  
Nodes                            node001,node002      
Categories                                                                      
Roles                            Etcd::Host                                      
Customizations                   <0 in submode>                                  
[headnode->configurationoverlay*[kube-default-etcd-learners*]]% set priority 510
[headnode->configurationoverlay*[kube-default-etcd-learners*]]% set allheadnodes no
[headnode->configurationoverlay*[kube-default-etcd-learners*]]% set nodes
[headnode->configurationoverlay*[kube-default-etcd-learners*]]% set categories
[headnode->configurationoverlay*[kube-default-etcd-learners*]]% append nodes node003
[headnode->configurationoverlay*[kube-default-etcd-learners*]]% show
Parameter                        Value                                          
-------------------------------- ------------------------------------------------
Name                             kube-default-etcd-learners                      
Revision                                                                        
All head nodes                   no                                              
Priority                         510                                            
Nodes                            node003
Categories                                                                      
Roles                            Etcd::Host                                      
Customizations                   <0 in submode>                                  
 
[headnode->configurationoverlay*[kube-default-etcd-learners*]]% roles
[headnode->configurationoverlay*[kube-default-etcd-learners*]->roles*]% use etcd::host                                              
[headnode->configurationoverlay*[kube-default-etcd-learners*]->roles*[Etcd::Host*]]% append options "--initial-cluster-state=existing"
[headnode->configurationoverlay*[kube-default-etcd-learners*]->roles*[Etcd::Host*]]% commit

The reason we set “allheadnodes”, “nodes”, “categories” to “no” and empty values respectively, is that we don’t want this overlay to apply to any node, other than the one we have in mind. The “clone” command copies whatever was in the original configuration overlay.

Note that besides clearing, we did add “node003” explicitly to it, since we wish this particular etcd::role that contains the extra option, to be assigned to it.

If you kept this configuration overlay around, appending the node would be the only needed step here.

Wait a while after the commit, some back-and-forth will happen at this point (certs are created, API servers restarted, and finally etcd will be started). Output inside cmsh might be similar to:

Tue Apr  5 12:37:12 2022 [notice] headnode: New certificate request with ID: 48
Tue Apr  5 12:37:13 2022 [notice] node003: Service etcd was not allowed to restart
Tue Apr  5 12:37:13 2022 [notice] node003: Service etcd was not allowed start
Tue Apr  5 12:37:23 2022 [notice] node001: Service etcd was restarted
Tue Apr  5 12:37:23 2022 [notice] node002: Service etcd was restarted
Tue Apr  5 12:37:25 2022 [notice] headnode: Service kube-apiserver was restarted
Tue Apr  5 12:37:35 2022 [warning] node003: Service etcd died
Tue Apr  5 12:37:35 2022 [notice] node003: Service etcd was not restarted
Tue Apr  5 12:37:35 2022 [notice] headnode: New certificate request with ID: 49
Tue Apr  5 12:37:39 2022 [notice] headnode: New certificate request with ID: 50
Tue Apr  5 12:38:07 2022 [warning] node003: Service etcd died
Tue Apr  5 12:38:15 2022 [notice] node003: Service etcd was restarted

6.4. Confirm and Promote the learner

Confirm this via etcdctl on a working etcd node:

[root@node001 ~]# etcdctl member list
10cee25dc156ff4a, started, node002, https://10.141.0.2:2380, https://10.141.0.2:2379, false
4a336cbcb0bafdc0, started, node001, https://10.141.0.1:2380, https://10.141.0.1:2379, false
690ed538336601f4, started, node003, https://10.141.0.3:2380, https://10.141.0.3:2379, true

The last line in the above output shows node003 to be added, and the very last boolean indicates that it’s a learner due to the value “true”. Now we can promote it:

[root@node001 ~]# etcdctl member promote 690ed538336601f4
Member 690ed538336601f4 promoted in cluster eef6e88516650e5b

[root@node001 ~]# etcdctl member list
10cee25dc156ff4a, started, node002, https://10.141.0.2:2380, https://10.141.0.2:2379, false
4a336cbcb0bafdc0, started, node001, https://10.141.0.1:2380, https://10.141.0.1:2379, false
690ed538336601f4, started, node003, https://10.141.0.3:2380, https://10.141.0.3:2379, false

After promoting, we see the learner flag has changed to “false”.

6.5. Move the node to the original Configuration Overlay

Using cmsh, remove the node from “kube-default-etcd-learners”, and add it to “kube-default-etcd”, then commit both.

[root@headnode ~]# cmsh
[headnode]% configurationoverlay
[headnode->configurationoverlay]% use kube-default-etcd-learners
[headnode->configurationoverlay[kube-default-etcd-learners]]% removefrom nodes node003
[headnode->configurationoverlay*[kube-default-etcd-learners*]]% ..
[headnode->configurationoverlay*]% use kube-default-etcd
[headnode->configurationoverlay*[kube-default-etcd]]% append nodes node003
[headnode->configurationoverlay*[kube-default-etcd*]]% ..
[headnode->configurationoverlay*]% commit
Successfully committed 2 ConfigurationOverlays
[headnode->configurationoverlay]%
Tue Apr  5 12:46:02 2022 [notice] node003: Service etcd was restarted

This results in a restart once more, because the service is no longer started with the --initial-cluster-state=existing.

7. Backup and Restore etcd with rsync

This is useful for backing up etcd members where the members are still online and scheduled to be replaced. Common scenario for this is a change in disk layout.

Do not follow these steps if you wish to keep the node up, use the etcd’s snapshot functionality in that case (see section 4.). This method of backup and restore requires you to stop the etcd service first. Since we don’t want etcd writing to it’s spool directory while we create the backup.

There are two important directories:

/var/lib/etcd (“spool” directory, contains the data)
/cm/local/apps/etcd/var/etc/ (contains config and certificates)

Please beware that these are the default paths, and that they can be changed within Bright Cluster Manager. The spool directory is the single most important one. Bright Cluster Manager will automatically re-create the config + certificates in case we do not back it up.

You can find the spool directory configured here:

[root@headnode ~]# cmsh
[headnode]% configurationoverlay 
[headnode->configurationoverlay]% use kube-default-etcd
[headnode->configurationoverlay[kube-default-etcd]]% roles
[headnode->configurationoverlay[kube-default-etcd]->roles]% use etcd::host 
[headnode->configurationoverlay[kube-default-etcd]->roles[Etcd::Host]]% show
Parameter                        Value                                           
-------------------------------- ------------------------------------------------
Name                             Etcd::Host                                      
Revision                                                                         
Type                             EtcdHostRole                                    
Add services                     yes                                             
Member Certificate                                                               
Member Certificate Key                                                           
Provisioning associations        <0 internally used>                             
Etcd Cluster                     kube-default                                    
Member Name                      $hostname                                       
Spool                            /var/lib/etcd                                   
Listen Client URLs               https://0.0.0.0:2379                            
Listen Peer URLs                 https://0.0.0.0:2380                            
Advertise Client URLs            https://$ip:2379                                
Advertise Peer URLs              https://$ip:2380                                
Snapshot Count                   5000                                            
Options                                                                          
Debug                            no

Should the path be different from /var/lib/etcd, please substitute it with the correct path for the rest of this section.

7.1. Backing up

Follow section 4.1. to remove the node from the etcd configuration overlay. This ensures the service is stopped.
SSH to the node and ensure that etcd has stopped:
systemctl status etcd
Prepare a directory where we can store our backup, e.g. on mounted shared storage.
mkdir -p /cm/shared/etcd-backups/$(hostname)/{etcd,etc}
Rsync the directories to this location.
rsync -raPv --delete /var/lib/etcd/ /cm/shared/etcd-backups/$(hostname)/etcd/
rsync -raPv --delete /cm/local/apps/etcd/var/etc/ /cm/shared/etcd-backups/$(hostname)/etc/

Now the etcd data should be saved in such a way that we can restore it later.

7.2. Restoring

Assuming that the node is now back online, we want to restore the backup we made in section 6.2 before we re-assign the etcd role in cmsh.

SSH to the node and execute the rsync into the other direction:
rsync -raPv --delete /cm/shared/etcd-backups/$(hostname)/etcd/ /var/lib/etcd/
rsync -raPv --delete /cm/shared/etcd-backups/$(hostname)/etc/ /cm/local/apps/etcd/var/etc/
Go to cmsh and add the node back into the etcd configuration overlay:
[headnode->configurationoverlay[kube-default-etcd]]% append nodes node003
[headnode->configurationoverlay*[kube-default-etcd*]]% commit

Confirm if the member is accepted to the cluster by inspecting its status:

[root@node001 ~]# module load etcd/kube-default/3.4.13 
[root@node001 ~]# ETCDCTL_ENDPOINTS=$(etcdctl member list | awk -F ',' '{print $5}' | sed 's/\s//' | paste -sd ",")
[root@node001 ~]# etcdctl endpoint status
https://10.141.0.2:2379, 10cee25dc156ff4a, 3.4.13, 4.8 MB, false, false, 10, 24644, 24644,
https://10.141.0.1:2379, 4a336cbcb0bafdc0, 3.4.13, 4.9 MB, true, false, 10, 24644, 24644,
https://10.141.0.3:2379, 690ed538336601f4, 3.4.13, 4.9 MB, false, false, 10, 24644, 24644,

Updated on October 31, 2024