How do I set up high-availability NFS with Bright?

Contents

This article describes how to setup a high availability NFS storage service using Bright.

Architecture

The two NFS servers are directly attached to a shared storage. Only one server should be able to mount the storage at any one time, in order to avoid data corruption. The nodes should have access to the shared storage without interruption as long as one of the two NFS servers is up.

Prerequisites

You will need CMDaemon revision 29304 or later
For the DAS device, any type of software RAID is not support and will most likely lead to data corruption

Create the HA group

[root@b70-c6 ~]# cmsh
[b70-c6]% partition failovergroups base
[b70-c6->partition[base]->failovergroups]% add storage
[b70-c6->partition*[base*]->failovergroups*[storage*]]% set nodes nfsserver1 nfsserver2
[b70-c6->partition*[base*]->failovergroups*[storage*]]% commit

Create the shared interface

The shared interface will be used to mount the mount storage across the cluster. This shared interface will be active only on the active server. The interface is an alias interface, which means the name can be arbitrary. A sensible alias for eth0 is eth0:storage, and it is created for both NFS servers, with a virtual shared IP address, like this:

[root@b70-c6 ~]# cmsh
[b70-c6]% device interfaces nfsserver1
[b70-c6->device[nfsserver1]->interfaces]% add alias eth0:storage
[b70-c6->device*[nfsserver1*]->interfaces*[eth0:storage*]]% set network internalnet
[b70-c6->device*[nfsserver1*]->interfaces*[eth0:storage*]]% set ip 10.141.255.230
[b70-c6->device*[nfsserver1*]->interfaces*[eth0:storage*]]% set startif active
[b70-c6->device*[nfsserver1*]->interfaces*[eth0:storage*]]% commit
 
[b70-c6]% device interfaces nfsserver2
[b70-c6->device[nfsserver2]->interfaces]% add alias eth0:storage
[b70-c6->device*[nfsserver2*]->interfaces*[eth0:storage*]]% set network internalnet
[b70-c6->device*[nfsserver2*]->interfaces*[eth0:storage*]]% set ip 10.141.255.230
[b70-c6->device*[nfsserver2*]->interfaces*[eth0:storage*]]% set startif active
[b70-c6->device*[nfsserver2*]->interfaces*[eth0:storage*]]% commit

Automatic Failover

By default automatic failover is disabled for HA groups. It can be enabled as follows:

[b70-c6->partition[base]->failovergroups[storage]]% set automaticfailoveraftergracefulshutdown yes
[b70-c6->partition*[base*]->failovergroups*[storage*]]% commit

Automatic Failover Dead time

This is the time after which the automatic failover should be triggered. By default it’s set to a very small value (10 seconds). It would be better to increase this dead time to avoid triggering any unnecessary failover events.

[b70-c6->partition[base]->failovergroups*[storage*]]% set deadtime 300
[b70-c6->partition*[base*]->failovergroups*[storage*]]% commit

Exclude the DAS filesystem mountpoints

You will need to exclude the filesystems on the DAS from provisioning operations in order to prevent accidental data loss:

This can be done by modifying the category exclude list:

[bright1->category[nfs]]% set excludelistupdate
[bright1->category[nfs]]% commit

etc.

Exclude list for updates and SYNC install:

- /cmshared/*
- /home/*
- /apps/*
no-new-files: - /cmshared/*
no-new-files: - /home/*
no-new-files: - /apps/*

Exclude list for GRAB and GRANBEW:

- /cmshared/*
- /home/*
- /apps/*

Preventing stale file handles during failover

To prevent stale file hanles the NFS pipe filesystem should reside on shared storage. This way, when the passive NFS server takes over it will be able to maintian the current state. To achive that you need to move /var/lib/nfs to the shared storage and create a symbolig link as follows:

cd /cmshared/
mkdir varlibnfs
cp -aR /var/lib/nfs/ varlibnfs/
cd /var/lib
mv nfs nfs.OLD
ln -s /cmshared/varlibnfs nfs

Since /cmshared is excluded from provisioning operations, this change will require full provisioning.

Other considerations

For NFSv4 the fsid needs to be set for the root partition. /etc/exports should look like:

/cmshared                               172.19.0.0/20(rw,fsid=2,no_root_squash,async) 172.18.0.0/20(rw,fsid=2,no_root_squash,async)
/home                                   172.19.0.0/20(rw,fsid=3,no_root_squash,async) 172.18.0.0/20(rw,fsid=3,no_root_squash,async)
/apps                                   172.19.0.0/20(rw,fsid=4,no_root_squash,async) 172.18.0.0/20(rw,fsid=4,no_root_squash,async)
/cm                                     172.19.0.0/20(rw,fsid=1,no_root_squash,async) 172.18.0.0/20(rw,fsid=1,no_root_squash,async)
/                                       172.19.0.0/20(ro,fsid=0,no_root_squash,async) 172.18.0.0/20(ro,fsid=0,no_root_squash,async)

Do not assign the storage role to HA NFS servers as it can lead to race conditions with respect to the NFS services

Generic failover script

The generic failover script that is listed here is a “common” script that we design and put somewhere convenient. Using one generic failover script means it is easier to maintain, and in one place. The generic failover script can be called from the user-defined failover scripts. The user-defined failover scripts are composed of commands the administrator wants to run during the failover stages, and the scripts have locations that are defined in CMDaemon. These locations can be set via cmsh, for example:

[b70-c6->partition[base]->failover]% set prefailoverscript /root/myprefscript

The user-defined failover scripts that can use the generic failover script are:

prefailoverscript
postfailoverscript
mountscript
unmountscript

The generic failover script reads an argument from one of the user-defined scripts that call it, and then executes accordingly. A generic failover script listing is:

#!/usr/bin/python

import os, sys

# check the command line argument:
if (len(sys.argv) != 2) or (sys.argv[1] not in ('mount', 'umount', 'pre', 'post')):
  print "Provide 1 argument: mount, umount, pre, post"
  sys.exit(1)

import pythoncm

# Connect to the head node with admin certificates
clustermanager = pythoncm.ClusterManager()
cluster = clustermanager.addCluster('https://master:8081';, '/cm/local/apps/cmd/etc/cert.pem', '/cm/local/apps/cmd/etc/cert.key');

# if connection fails exit
if not cluster.connect():
  print "Unable to connect"
  print cluster.getLastError()
  sys.exit(1)

# Find your own node configuration
hostname_pipe = os.popen('hostname')
#hostname = os.getenv('CMD_HOSTNAME')
hostname = hostname_pipe.read()
hostname = hostname.replace('\n', '').replace('\r', '')
print len(hostname)
node = cluster.find(hostname)
if not node:
  print "Unable to find node %s" % hostname
  sys.exit(1)

# Read current state
f = open('/var/spool/cmd/state')
state = f.readline()
f.close()

print "%s for %s, state = %s" % (sys.argv[0], hostname, state)

# remove CMDaemon fsmounts
def removeFSMount(mountpoint):
  print "Remove mount:", mountpoint
  node.fsmounts = [x for x in node.fsmounts if x.mountpoint != mountpoint]

# add CMDaemon fsmounts: if the mountoptions is not passed, use defaults
def addFSMount(mountpoint, device, filesystem, mountoptions = 'defaults'):
  find = sum([1 for x in node.fsmounts if x.mountpoint == mountpoint])
  if find == 0:  # duplicate check
    print "Adding mount:", mountpoint
    fsmount = pythoncm.FSMount()
    fsmount.mountpoint = mountpoint
    fsmount.device = device
    fsmount.filesystem = filesystem
    fsmount.mountoptions = mountoptions
    node.fsmounts += [fsmount]         # append doesn't work for pythoncm objects.list
  else:
    print "Duplicate mount:", mountpoint

if sys.argv[1] == "pre":
  if state == 'SLAVEACTIVE':
     print "SLAVEACTIVE";
  if state == 'SLAVEPASSIVE':
     print "SLAVEPASSIVE";
elif sys.argv[1] == "post":
  if state == 'SLAVEACTIVE':
     print "SLAVEACTIVE";
     # make sure that the mount points are removed from the passive server
     if hostname=='nfs01':
        print "nfs01"
        node = cluster.find("nfs02")
        removeFSMount('/home')
        removeFSMount('/cmshared')
        removeFSMount('/apps')
     else:
        print "nfs02"
        node = cluster.find("nfs01")
        removeFSMount('/home')
        removeFSMount('/cmshared')
        removeFSMount('/apps')
  if state == 'SLAVEPASSIVE':
     print "SLAVEPASSIVE";

# Add extra mount / exports here
elif sys.argv[1] == "mount":
  addFSMount('/cmshared', '/dev/mapper/mpathd1', 'xfs')
  addFSMount('/home', '/dev/mapper/mpathd3', 'xfs')
  addFSMount('/apps', '/dev/mapper/mpathd2', 'xfs')

# remove extra mount / exports here
elif sys.argv[1] == "umount":
  removeFSMount('/cmshared')
  removeFSMount('/home')
  removeFSMount('/apps')

# All operations done: commit then exit
exit = 0
c = node.commit()
if not c.result:
  print "Commit of %s failed:" % node.resolveName()
  for j in range(c.count):
    print c.getValidation(j).msg
  exit = 1
else:
  print "Committed: %s" % node.resolveName()

cluster.disconnect()
sys.exit(exit)

Pre-failover script

#!/bin/bash

if [ $(cat /var/spool/cmd/state) == "SLAVEACTIVE" ]
then
        echo "ACTIVE"
        /cm/local/scripts/generic-nfsserver-ha.py pre
else
        echo "PASSIVE"
fi

Mount Script

#!/bin/bash

# make sure that the NFS service is stopped
service nfs stop

# add the required export points and mount the storage
/cm/local/scripts/generic-nfsserver-ha.py mount

# check the available mounts
mounts=`cat /proc/mounts | grep "/dev/mapper/mpathd1" | awk -F" " '{print $1}'`

for i in {1..10}
do
        # check if the mount is available.
        if [ "$mounts" == "/dev/mapper/mpathd1" ]
        then
                service nfs restart
                exportfs -a
                exit 0;
        else
                echo "Mount point not available? retrying"
                sleep 5
        fi
        mounts=`cat /proc/mounts | grep "/dev/mapper/mpathd1" | awk -F" " '{print $1}'`
done

Unmount Script

#!/bin/bash

NFS=nfs
FUSER=/sbin/fuser
if [ -e /etc/SuSE-release ]; then
  NFS=nfsserver
  FUSER=/bin/fuser
fi

MAXRETRIES=50

# unmount the storage
doUmount() {
  retries=1
  while [ $retries -lt $MAXRETRIES ]; do
    if grep -q " $1 " /proc/mounts; then
      echo "(`date`) Unmount $1 ($retries)"
      $FUSER -mk $1
      /bin/umount $1
      sleep 1
      let retries=${retries}+1
    else
      let retries=${MAXRETRIES}
    fi
  done
}

# check if the mount point is still available
mountCheck(){
  if grep -q " $1 " /proc/mounts; then
    echo "Still mounted: $1"
    exit 1
  fi
}


# stop services offering access to the mount point and call the generic script
# with the umount option
service nfs stop
service rpcidmapd stop
service smb stop
$(dirname $0)/generic-nfsserver-ha.py umount
doUmount /cmshared/varlibnfs/rpc_pipefs
doUmount /cmshared
doUmount /home
doUmount /apps
mountCheck /cmshared
mountCheck /home
mountCheck /apps
echo "remove fsmounts"

Post-failover script

#!/bin/bash

if [ $(cat /var/spool/cmd/state) == "SLAVEACTIVE" ]
then
        echo "ACTIVE"
        /cm/local/scripts/generic-nfsserver-ha.py post
else
        echo "PASSIVE"
fi

Updated on October 27, 2020