ID #1255

How do I set up high-availability NFS with Bright?

How do I set up high-availability NFS with Bright?


Setting up an HA NFS storage


This article describes how to setup a high availability NFS storage service using Bright.




The two NFS servers are directly attached to a shared storage. Only one server should be able to mount the storage at any one time, in order to avoid data corruption. The nodes should have access to the shared storage without interruption as long as one of the two NFS servers is up.





  • You will need CMDaemon revision 29304 or later
  • For the DAS device, any type of software RAID is not support and will most likely lead to data corruption



Create the HA group


[root@b70-c6 ~]# cmsh
[b70-c6]% partition failovergroups base
[b70-c6->partition[base]->failovergroups]% add storage
[b70-c6->partition*[base*]->failovergroups*[storage*]]% set nodes nfsserver1 nfsserver2
[b70-c6->partition*[base*]->failovergroups*[storage*]]% commit



Create the shared interface


The shared interface will be used to mount the mount storage across the cluster. This shared interface will be active only on the active server. The interface is an alias interface, which means the name can be arbitrary. A sensible alias for eth0 is eth0:storage, and it is created for both NFS servers, with a virtual shared IP address, like this:


[root@b70-c6 ~]# cmsh
[b70-c6]% device interfaces nfsserver1
[b70-c6->device[nfsserver1]->interfaces]% add alias eth0:storage
[b70-c6->device*[nfsserver1*]->interfaces*[eth0:storage*]]% set network internalnet
[b70-c6->device*[nfsserver1*]->interfaces*[eth0:storage*]]% set ip
[b70-c6->device*[nfsserver1*]->interfaces*[eth0:storage*]]% set startif active
[b70-c6->device*[nfsserver1*]->interfaces*[eth0:storage*]]% commit
[b70-c6]% device interfaces nfsserver2
[b70-c6->device[nfsserver2]->interfaces]% add alias eth0:storage
[b70-c6->device*[nfsserver2*]->interfaces*[eth0:storage*]]% set network internalnet
[b70-c6->device*[nfsserver2*]->interfaces*[eth0:storage*]]% set ip
[b70-c6->device*[nfsserver2*]->interfaces*[eth0:storage*]]% set startif active
[b70-c6->device*[nfsserver2*]->interfaces*[eth0:storage*]]% commit



Automatic Failover

By default automatic failover is disabled for HA groups. It can be enabled as follows:


[b70-c6->partition[base]->failovergroups[storage]]% set automaticfailoveraftergracefulshutdown yes
[b70-c6->partition*[base*]->failovergroups*[storage*]]% commit



Automatic Failover Dead time

This is the time after which the automatic failover should be triggered. By default it’s set to a very small value (10 seconds). It would be better to increase this dead time to avoid triggering any unnecessary failover events.


[b70-c6->partition[base]->failovergroups*[storage*]]% set deadtime 300
[b70-c6->partition*[base*]->failovergroups*[storage*]]% commit


Exclude the DAS filesystem mountpoints


You will need to exclude the filesystems on the DAS from provisioning operations in order to prevent accidental data loss:


This can be done by modifying the category exclude list:


[bright1->category[nfs]]% set excludelistupdate
[bright1->category[nfs]]% commit




  • Exclude list for updates and SYNC install:
- /cmshared/*
- /home/*
- /apps/*
no-new-files: - /cmshared/*
no-new-files: - /home/*
no-new-files: - /apps/*
  • Exclude list for GRAB and GRANBEW:
- /cmshared/*
- /home/*
- /apps/*  


Preventing stale file handles during failover


To prevent stale file hanles the NFS pipe filesystem should reside on shared storage. This way, when the passive NFS server takes over it will be able to maintian the current state. To achive that you need to move /var/lib/nfs to the shared storage and create a symbolig link as follows:


cd /cmshared/
mkdir varlibnfs
cp -aR /var/lib/nfs/ varlibnfs/
cd /var/lib
mv nfs nfs.OLD
ln -s /cmshared/varlibnfs nfs


Since /cmshared is excluded from provisioning operations, this change will require full provisioning.


Other considerations


  • For NFSv4 the fsid needs to be set for the root partition. /etc/exports should look like:
/cmshared                     ,fsid=2,no_root_squash,async),fsid=2,no_root_squash,async)
/home                         ,fsid=3,no_root_squash,async),fsid=3,no_root_squash,async)
/apps                         ,fsid=4,no_root_squash,async),fsid=4,no_root_squash,async)
/cm                           ,fsid=1,no_root_squash,async),fsid=1,no_root_squash,async)
/                             ,fsid=0,no_root_squash,async),fsid=0,no_root_squash,async)
  • Do not assign the storage role to HA NFS servers as it can lead to race conditions with respect to the NFS services


Generic failover script


The generic failover script that is listed here is a  “common” script that we design and put somewhere convenient. Using one generic failover script means it is easier to maintain, and in one place. The generic failover script can be called from the user-defined failover scripts. The user-defined failover scripts are composed of commands the administrator wants to run during the failover stages, and the scripts have locations that are defined in CMDaemon. These locations can be set via cmsh, for example:


[b70-c6->partition[base]->failover]% set prefailoverscript /root/myprefscript



The user-defined failover scripts that can use the generic failover script are:

  • prefailoverscript
  • postfailoverscript
  • mountscript
  • unmountscript


The generic failover script reads an argument from one of the user-defined scripts that call it, and then executes accordingly. A generic failover script listing is:



import os, sys

# check the command line argument:
if (len(sys.argv) != 2) or (sys.argv[1] not in ('mount', 'umount', 'pre', 'post')):
  print "Provide 1 argument: mount, umount, pre, post"

import pythoncm

# Connect to the head node with admin certificates
clustermanager = pythoncm.ClusterManager()
cluster = clustermanager.addCluster('https://master:8081', '/cm/local/apps/cmd/etc/cert.pem', '/cm/local/apps/cmd/etc/cert.key');

# if connection fails exit
if not cluster.connect():
  print "Unable to connect"
  print cluster.getLastError()

# Find your own node configuration
hostname_pipe = os.popen('hostname')
#hostname = os.getenv('CMD_HOSTNAME')
hostname =
hostname = hostname.replace('\n', '').replace('\r', '')
print len(hostname)
node = cluster.find(hostname)
if not node:
  print "Unable to find node %s" % hostname

# Read current state
f = open('/var/spool/cmd/state')
state = f.readline()

print "%s for %s, state = %s" % (sys.argv[0], hostname, state)

# remove CMDaemon fsmounts
def removeFSMount(mountpoint):
  print "Remove mount:", mountpoint
  node.fsmounts = [x for x in node.fsmounts if x.mountpoint != mountpoint]

# add CMDaemon fsmounts: if the mountoptions is not passed, use defaults
def addFSMount(mountpoint, device, filesystem, mountoptions = 'defaults'):
  find = sum([1 for x in node.fsmounts if x.mountpoint == mountpoint])
  if find == 0:  # duplicate check
    print "Adding mount:", mountpoint
    fsmount = pythoncm.FSMount()
    fsmount.mountpoint = mountpoint
    fsmount.device = device
    fsmount.filesystem = filesystem
    fsmount.mountoptions = mountoptions
    node.fsmounts += [fsmount]         # append doesn't work for pythoncm objects.list
    print "Duplicate mount:", mountpoint

if sys.argv[1] == "pre":
  if state == 'SLAVEACTIVE':
     print "SLAVEACTIVE";
  if state == 'SLAVEPASSIVE':
     print "SLAVEPASSIVE";
elif sys.argv[1] == "post":
  if state == 'SLAVEACTIVE':
     print "SLAVEACTIVE";
     # make sure that the mount points are removed from the passive server
     if hostname=='nfs01':
        print "nfs01"
        node = cluster.find("nfs02")
        print "nfs02"
        node = cluster.find("nfs01")
  if state == 'SLAVEPASSIVE':
     print "SLAVEPASSIVE";

# Add extra mount / exports here
elif sys.argv[1] == "mount":
  addFSMount('/cmshared', '/dev/mapper/mpathd1', 'xfs')
  addFSMount('/home', '/dev/mapper/mpathd3', 'xfs')
  addFSMount('/apps', '/dev/mapper/mpathd2', 'xfs')

# remove extra mount / exports here
elif sys.argv[1] == "umount":

# All operations done: commit then exit
exit = 0
c = node.commit()
if not c.result:
  print "Commit of %s failed:" % node.resolveName()
  for j in range(c.count):
    print c.getValidation(j).msg
  exit = 1
  print "Committed: %s" % node.resolveName()



Pre-failover script



if [ $(cat /var/spool/cmd/state) == "SLAVEACTIVE" ]
        echo "ACTIVE"
        /cm/local/scripts/ pre
        echo "PASSIVE"


Mount Script



# make sure that the NFS service is stopped
service nfs stop

# add the required export points and mount the storage
/cm/local/scripts/ mount

# check the available mounts
mounts=´cat /proc/mounts | grep "/dev/mapper/mpathd1" | awk -F" " '{print $1}'´

for i in {1..10}
        # check if the mount is available.
        if [ "$mounts" == "/dev/mapper/mpathd1" ]
                service nfs restart
                exportfs -a
                exit 0;
                echo "Mount point not available? retrying"
                sleep 5
        mounts=´cat /proc/mounts | grep "/dev/mapper/mpathd1" | awk -F" " '{print $1}'´



Unmount Script



if [ -e /etc/SuSE-release ]; then


# unmount the storage
doUmount() {
  while [ $retries -lt $MAXRETRIES ]; do
    if grep -q " $1 " /proc/mounts; then
      echo "(´date´) Unmount $1 ($retries)"
      $FUSER -mk $1
      /bin/umount $1
      sleep 1
      let retries=${retries}+1
      let retries=${MAXRETRIES}

# check if the mount point is still available
  if grep -q " $1 " /proc/mounts; then
    echo "Still mounted: $1"
    exit 1

# stop services offering access to the mount point and call the generic script
# with the umount option
service nfs stop
service rpcidmapd stop
service smb stop
$(dirname $0)/ umount
doUmount /cmshared/varlibnfs/rpc_pipefs
doUmount /cmshared
doUmount /home
doUmount /apps
mountCheck /cmshared
mountCheck /home
mountCheck /apps
echo "remove fsmounts"



Post-failover script



if [ $(cat /var/spool/cmd/state) == "SLAVEACTIVE" ]
        echo "ACTIVE"
        /cm/local/scripts/ post
        echo "PASSIVE"

Tags: -

Related entries:

You cannot comment on this entry