How do I create a multi-tenant user environment?

By default a Bright cluster provides a multi-user environment rather than a multi-tenant environment. In the context of a cluster, the difference between multi-user and multi-tenant systems is around visibility of the environment.

In a multi-user system, all users log in to the same set of login nodes and submit their jobs to the same workload management system. Users may have access to directories containing shared data. Multi-user systems tend to work well when all users belong to the same organization and there is no need for isolating groups of people within the organization from each other.

In a multi-tenant system, users may come from different organizations or different groups within an organization that should be isolated from each other. Users may even come from organizations that are direct competitors to each other. In such a scenario, it is important that users that belong to one tenant have no visibility over what users from another tenant are doing on the system. Another use-case where a multi-tenant system is useful is when certain types of workload (e.g. classified versus non-classified workloads) should be kept strictly isolated from each other.

In a multi-user system, all computational resources are typically shared through a single workload management system instance. In a multi-tenant system, computational resources are typically partitioned and each partition is dedicated to one particular tenant. Users typically belong to a single tenant (although it is possible that users may belong to multiple tenants). Each partition of the cluster typically runs its own workload management system instance so that users have no visibility into jobs that are being run by other tenants.

Bright cluster manager can be used to build a multi-tenant user environment where a single administrator or group of administrators manages the entire cluster, but where each user only has visibility into what happens within the partition of the cluster that belongs to their tenant. Administrators can scale up or scale down individual partitions by assigning or taking away computational resources.

If more isolation is required, Bright Cluster Manager provides Cluster-on-Demand features that allows for groups of users (e.g. tenants) to be provided their own isolated Bright cluster. Such a cluster can be hosted inside of AWS, Azure, OpenStack or VMware (as of Bright version 9.1). For more information, please consult the latest Bright Cloudbursting Manual from the documentation page. This rest of this article will focus on creating a multi-tenant user environment within a single Bright cluster.

Bright Cluster Manager 9.0 has the ability to run multiple workload management system instances within the same cluster. To create a new workload management system instance, the cm-wlm-setup utility can be used. Each workload management instance will have a unique name. A number of configuration overlays will be created for each workload management system instance. A configuration overlay is a construct in Bright Cluster Manager that binds roles to individual nodes or categories of nodes. A role causes a node to fulfill a particular task in the cluster (e.g. provisioning node, monitoring node, workload management system client or server). For example:

[root@mdv-bigcluster ~]# cmsh
mdv-bigcluster]% device list -f hostname,category,status
hostname (key)       category             status              
-------------------- -------------------- --------------------
fire                 login                [   UP   ]
mdv-bigcluster                            [   UP   ]          
node001              default              [   UP   ]          
node002              default              [   UP   ]          
node003              default              [   UP   ]          
node004              default              [   UP   ]          
node005              default              [   UP   ]          
water                login                [   UP   ]

[mdv-bigcluster]% configurationoverlay 
[mdv-bigcluster->configurationoverlay]% list
Name (key)          Priority   All head nodes Nodes                  Categories       Roles           
------------------- ---------- -------------- ---------------------- ---------------- ----------------
slurm-accounting    500        yes                                                    slurmaccounting 
slurm-fire-client   500        no             node003..node005                        slurmclient     
slurm-fire-server   500        no             fire                                    slurmserver     
slurm-fire-submit   500        no             fire,node003..node005                   slurmsubmit     
slurm-water-client  500        no             node001,node002                         slurmclient     
slurm-water-server  500        no             water                                   slurmserver     
slurm-water-submit  500        no             water,node001,node002                   slurmsubmit

Rather than having all users use the same set of login nodes (as with a multi-user system), or having users log into the head node, in a multi-tenant system it is advisable to create a dedicated login node for each partition.

As can be seen above, we have defined two partitions called water and fire. Each partition has its own login node (namely nodes with the names water and fire).

When creating user accounts, it is a good idea to make users a member of a tenant-specific group.

[root@mdv-bigcluster ~]# cmsh
[mdv-bigcluster]% user list
Name (key)       ID (key)         Primary group    Secondary groups
---------------- ---------------- ---------------- ----------------
alice            1001             alice            water                
bob              1002             bob              water                
charlie          1003             charlie          fire                                        
donna            1004             donna            fire                
ernie            1005             ernie            fire

This can then be used to restrict access to tenant specific login nodes. For example, to restrict access to the water and fire login nodes:

# cd /cm/images/default-image
# mkdir -p cm/conf/node/{water,fire}/etc/security
# mkdir -p cm/conf/node/{water,fire}/etc/pam.d
# cp etc/pam.d/{system-auth,password-auth} cm/conf/node/water/etc/pam.d
# cp etc/pam.d/{system-auth,password-auth} cm/conf/node/fire/etc/pam.d
# cp etc/security/access.conf cm/conf/node/fire/etc/security
# cp etc/security/access.conf cm/conf/node/water/etc/security
# cd cm/conf/node
# echo +:water:ALL  >> water/etc/security/access.conf 
# echo +:root:ALL  >> water/etc/security/access.conf 
# echo -:ALL:ALL  >> water/etc/security/access.conf
# echo +:fire:ALL  >> fire/etc/security/access.conf 
# echo +:root:ALL  >> fire/etc/security/access.conf 
# echo -:ALL:ALL  >> fire/etc/security/access.conf
# echo account required pam_access.so >> fire/etc/pam.d/system-auth
# echo account required pam_access.so >> water/etc/pam.d/system-auth
# echo account required pam_access.so >> fire/etc/pam.d/password-auth
# echo account required pam_access.so >> water/etc/pam.d/password-auth

To prevent users from logging in directly to compute nodes:

[root@mdv-bigcluster ~]# cmsh
[mdv-bigcluster]% category use default
[mdv-bigcluster->category[default]]% set usernodelogin never
[mdv-bigcluster->category*[default*]]% commit

To prevent users who are not in the admin group from logging into the head node:

# echo account required pam_access.so >> /etc/pam.d/system-auth
# echo account required pam_access.so >> /etc/pam.d/password-auth
# echo +:admin:ALL  >> /etc/security/access.conf 
# echo +:root:ALL  >> /etc/security/access.conf 
# echo -:ALL:ALL  >> /etc/security/access.conf

We have created a setup where ordinary users cannot log into the head node or compute nodes, but must log in to the login node that is part of the cluster partition that they belong to. From this login node, users may submit jobs that will be executed on nodes that have been assigned to the cluster partition that belongs to the user’s tenant.

Example session for user alice:

[alice@water ~]$ module load slurm
[alice@water ~]$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
defq*        up   infinite      2   idle node[001-002]
[alice@water ~]$ srun hostname
node001

Example session for user ernie:

[ernie@fire ~]$ module load slurm
[ernie@fire ~]$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
defq*        up   infinite      3   idle node[003-005]
[ernie@fire ~]$ srun hostname
node003

Nodes can be allocated or de-allocated from partitions by adding or removing them from the relevant configuration overlays. To move nodes from one partition to another, the movenodes command in cmsh‘s configurationoverlay mode is useful:

[mdv-bigcluster->configurationoverlay]% list | grep client
slurm-fire-client   500        no             node003..node005                        slurmclient     
slurm-water-client  500        no             node001,node002                         slurmclient     
[mdv-bigcluster->configurationoverlay]% movenodes slurm-fire-client slurm-water-client -n node003..node004
[mdv-bigcluster->configurationoverlay]% movenodes slurm-fire-submit slurm-water-client -n node003..node004
[mdv-bigcluster->configurationoverlay*]% list | grep client
slurm-fire-client   500        no             node005                                 slurmclient     
slurm-water-client  500        no             node001..node004                        slurmclient     
[mdv-bigcluster->configurationoverlay*]% commit

This now gives us the following setup:

[root@mdv-bigcluster ~]# ssh root@water "module load slurm; sinfo"
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
defq*        up   infinite      4   idle node[001-004]
[root@mdv-bigcluster ~]# ssh root@fire "module load slurm; sinfo"
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
defq*        up   infinite      1   idle node005
[root@mdv-bigcluster ~]#

Note that it is a good idea to first perform a drain operation on nodes before moving them from other cluster partition to another. This will prevent jobs that may be running still to crash.

Depending on the level of isolation that is needed, it may be desirable to place nodes that have been assigned to a particular partition into a different category. This would also allow nodes to mount different external storage depending on which partition they belong to. The downside of this approach is that it will require nodes to be rebooted after moving them to a different cluster partition.

When moving nodes between partitions it may be a good idea to also re-image the node from scratch to make sure that there are no leftovers on the file system anywhere (e.g. in /scratch, /tmp or /data). This can be done by setting the nextinstallmode property of the node to FULL, and then rebooting the node.

Updated on September 16, 2020

Related Articles

Leave a Comment Cancel