By default a Bright cluster provides a multi-user environment rather than a multi-tenant environment. In the context of a cluster, the difference between multi-user and multi-tenant systems is around visibility of the environment.
In a multi-user system, all users log in to the same set of login nodes and submit their jobs to the same workload management system. Users may have access to directories containing shared data. Multi-user systems tend to work well when all users belong to the same organization and there is no need for isolating groups of people within the organization from each other.
In a multi-tenant system, users may come from different organizations or different groups within an organization that should be isolated from each other. Users may even come from organizations that are direct competitors to each other. In such a scenario, it is important that users that belong to one tenant have no visibility over what users from another tenant are doing on the system. Another use-case where a multi-tenant system is useful is when certain types of workload (e.g. classified versus non-classified workloads) should be kept strictly isolated from each other.
In a multi-user system, all computational resources are typically shared through a single workload management system instance. In a multi-tenant system, computational resources are typically partitioned and each partition is dedicated to one particular tenant. Users typically belong to a single tenant (although it is possible that users may belong to multiple tenants). Each partition of the cluster typically runs its own workload management system instance so that users have no visibility into jobs that are being run by other tenants.
Bright cluster manager can be used to build a multi-tenant user environment where a single administrator or group of administrators manages the entire cluster, but where each user only has visibility into what happens within the partition of the cluster that belongs to their tenant. Administrators can scale up or scale down individual partitions by assigning or taking away computational resources.
If more isolation is required, Bright Cluster Manager provides Cluster-on-Demand features that allows for groups of users (e.g. tenants) to be provided their own isolated Bright cluster. Such a cluster can be hosted inside of AWS, Azure, OpenStack or VMware (as of Bright version 9.1). For more information, please consult the latest Bright Cloudbursting Manual from the documentation page. This rest of this article will focus on creating a multi-tenant user environment within a single Bright cluster.
Bright Cluster Manager 9.0 has the ability to run multiple workload management system instances within the same cluster. To create a new workload management system instance, the cm-wlm-setup
utility can be used. Each workload management instance will have a unique name. A number of configuration overlays will be created for each workload management system instance. A configuration overlay is a construct in Bright Cluster Manager that binds roles to individual nodes or categories of nodes. A role causes a node to fulfill a particular task in the cluster (e.g. provisioning node, monitoring node, workload management system client or server). For example:
[root@mdv-bigcluster ~]# cmsh
mdv-bigcluster]% device list -f hostname,category,status
hostname (key) category status
-------------------- -------------------- --------------------
fire login [ UP ]
mdv-bigcluster [ UP ]
node001 default [ UP ]
node002 default [ UP ]
node003 default [ UP ]
node004 default [ UP ]
node005 default [ UP ]
water login [ UP ]
[mdv-bigcluster]% configurationoverlay
[mdv-bigcluster->configurationoverlay]% list
Name (key) Priority All head nodes Nodes Categories Roles
------------------- ---------- -------------- ---------------------- ---------------- ----------------
slurm-accounting 500 yes slurmaccounting
slurm-fire-client 500 no node003..node005 slurmclient
slurm-fire-server 500 no fire slurmserver
slurm-fire-submit 500 no fire,node003..node005 slurmsubmit
slurm-water-client 500 no node001,node002 slurmclient
slurm-water-server 500 no water slurmserver
slurm-water-submit 500 no water,node001,node002 slurmsubmit
Rather than having all users use the same set of login nodes (as with a multi-user system), or having users log into the head node, in a multi-tenant system it is advisable to create a dedicated login node for each partition.
As can be seen above, we have defined two partitions called water and fire. Each partition has its own login node (namely nodes with the names water
and fire
).
When creating user accounts, it is a good idea to make users a member of a tenant-specific group.
[root@mdv-bigcluster ~]# cmsh
[mdv-bigcluster]% user list
Name (key) ID (key) Primary group Secondary groups
---------------- ---------------- ---------------- ----------------
alice 1001 alice water
bob 1002 bob water
charlie 1003 charlie fire
donna 1004 donna fire
ernie 1005 ernie fire
This can then be used to restrict access to tenant specific login nodes. For example, to restrict access to the water
and fire
login nodes:
# cd /cm/images/default-image
# mkdir -p cm/conf/node/{water,fire}/etc/security
# mkdir -p cm/conf/node/{water,fire}/etc/pam.d
# cp etc/pam.d/{system-auth,password-auth} cm/conf/node/water/etc/pam.d
# cp etc/pam.d/{system-auth,password-auth} cm/conf/node/fire/etc/pam.d
# cp etc/security/access.conf cm/conf/node/fire/etc/security
# cp etc/security/access.conf cm/conf/node/water/etc/security
# cd cm/conf/node
# echo +:water:ALL >> water/etc/security/access.conf
# echo +:root:ALL >> water/etc/security/access.conf
# echo -:ALL:ALL >> water/etc/security/access.conf
# echo +:fire:ALL >> fire/etc/security/access.conf
# echo +:root:ALL >> fire/etc/security/access.conf
# echo -:ALL:ALL >> fire/etc/security/access.conf
# echo account required pam_access.so >> fire/etc/pam.d/system-auth
# echo account required pam_access.so >> water/etc/pam.d/system-auth
# echo account required pam_access.so >> fire/etc/pam.d/password-auth
# echo account required pam_access.so >> water/etc/pam.d/password-auth
To prevent users from logging in directly to compute nodes:
[root@mdv-bigcluster ~]# cmsh
[mdv-bigcluster]% category use default
[mdv-bigcluster->category[default]]% set usernodelogin never
[mdv-bigcluster->category*[default*]]% commit
To prevent users who are not in the admin group from logging into the head node:
# echo account required pam_access.so >> /etc/pam.d/system-auth
# echo account required pam_access.so >> /etc/pam.d/password-auth
# echo +:admin:ALL >> /etc/security/access.conf
# echo +:root:ALL >> /etc/security/access.conf
# echo -:ALL:ALL >> /etc/security/access.conf
We have created a setup where ordinary users cannot log into the head node or compute nodes, but must log in to the login node that is part of the cluster partition that they belong to. From this login node, users may submit jobs that will be executed on nodes that have been assigned to the cluster partition that belongs to the user’s tenant.
Example session for user alice
:
[alice@water ~]$ module load slurm
[alice@water ~]$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
defq* up infinite 2 idle node[001-002]
[alice@water ~]$ srun hostname
node001
Example session for user ernie
:
[ernie@fire ~]$ module load slurm
[ernie@fire ~]$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
defq* up infinite 3 idle node[003-005]
[ernie@fire ~]$ srun hostname
node003
Nodes can be allocated or de-allocated from partitions by adding or removing them from the relevant configuration overlays. To move nodes from one partition to another, the movenodes
command in cmsh
‘s configurationoverlay
mode is useful:
[mdv-bigcluster->configurationoverlay]% list | grep client
slurm-fire-client 500 no node003..node005 slurmclient
slurm-water-client 500 no node001,node002 slurmclient
[mdv-bigcluster->configurationoverlay]% movenodes slurm-fire-client slurm-water-client -n node003..node004
[mdv-bigcluster->configurationoverlay]% movenodes slurm-fire-submit slurm-water-client -n node003..node004
[mdv-bigcluster->configurationoverlay*]% list | grep client
slurm-fire-client 500 no node005 slurmclient
slurm-water-client 500 no node001..node004 slurmclient
[mdv-bigcluster->configurationoverlay*]% commit
This now gives us the following setup:
[root@mdv-bigcluster ~]# ssh root@water "module load slurm; sinfo"
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
defq* up infinite 4 idle node[001-004]
[root@mdv-bigcluster ~]# ssh root@fire "module load slurm; sinfo"
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
defq* up infinite 1 idle node005
[root@mdv-bigcluster ~]#
Note that it is a good idea to first perform a drain operation on nodes before moving them from other cluster partition to another. This will prevent jobs that may be running still to crash.
Depending on the level of isolation that is needed, it may be desirable to place nodes that have been assigned to a particular partition into a different category. This would also allow nodes to mount different external storage depending on which partition they belong to. The downside of this approach is that it will require nodes to be rebooted after moving them to a different cluster partition.
When moving nodes between partitions it may be a good idea to also re-image the node from scratch to make sure that there are no leftovers on the file system anywhere (e.g. in /scratch, /tmp or /data). This can be done by setting the nextinstallmode
property of the node to FULL
, and then rebooting the node.