Purpose
By default, a Bright cluster provides a multi-user environment rather than a multi-tenant environment. In the context of a cluster, the difference between multi-user and multi-tenant systems is the user’s visibility into an environment.
Multi-User System
In a multi-user system, all users log in to the same login nodes and submit their jobs to the same workload management (WLM) system. Users may have access to directories containing shared data.
In a multi-user system, all computational resources are typically shared through a single WLM system instance.
Multi-user systems tend to work well when all users belong to the same organization, and there is no need to isolate groups of people within the organization from each other.
Multi-Tenant System
In a multi-tenant system, users may come from different organizations or groups within an organization that should be isolated from each other. Users may even come from organizations that are direct competitors.
In such a scenario, users who belong to one tenant must have no visibility over what users from another tenant are doing on the system. Another use case where a multi-tenant system is useful is when certain types of workload (e.g., classified versus non-classified workloads) should be kept strictly isolated.
In a multi-tenant system, computational resources are typically partitioned, and each partition is dedicated to one particular tenant. Users typically belong to a single tenant (although it is possible that they may belong to multiple tenants). Each partition of the cluster typically runs its own WLM system instance, so users have no visibility into jobs that are being run by other tenants.
Multi-Tenant Environment in BCM
BCM can be used to build a multi-tenant user environment where a single administrator or group of administrators manages the entire cluster, but where each user only has visibility into what happens within the partition of the cluster that belongs to their tenant. Administrators can scale up or down individual partitions by assigning or taking away computational resources.
If more isolation is required, BCM also provides Cluster-on-Demand features that allow groups of users (e.g., tenants) to have their own, isolated cluster. Such a cluster can be hosted inside AWS, Azure, OpenStack, or VMware (as of Bright version 9.1). For more information, please consult the latest Cloudbursting Manual from the documentation page. The rest of this article will focus on creating a multi-tenant user environment within a single Bright cluster.
Establishing a Multi-Tenant Environment in BCM
BCM can run multiple WLM system instances within the same cluster. The cm-wlm-setup
utility is used to create a new WLM system instance.
WLM Instance Management
Each WLM instance will have a unique name. Several configuration overlays will be created for each workload management system instance. A configuration overlay is a construct in BCM that binds roles to individual nodes or categories of nodes. A role causes a node to fulfill a particular task in the cluster (e.g., a provisioning node, monitoring node, WLM system client, or WLM system server). For example:
# cmsh
% device list -f hostname,category,status
hostname (key) category status
-------------------- -------------------- --------------------
fire login [ UP ]
mdv-bigcluster [ UP ]
node001 default [ UP ]
node002 default [ UP ]
node003 default [ UP ]
node004 default [ UP ]
node005 default [ UP ]
water login [ UP ]
% configurationoverlay
% list
Name (key) Priority All head nodes Nodes Categories Roles
------------------- ---------- -------------- ---------------------- ---------------- ----------------
slurm-accounting 500 yes slurmaccounting
slurm-fire-client 500 no node003..node005 slurmclient
slurm-fire-server 500 no fire slurmserver
slurm-fire-submit 500 no fire,node003..node005 slurmsubmit
slurm-water-client 500 no node001,node002 slurmclient
slurm-water-server 500 no water slurmserver
slurm-water-submit 500 no water,node001,node002 slurmsubmit
In a multi-tenant system, creating a dedicated login node for each partition is advisable rather than having all users use the same set of login nodes (as with a multi-user system) or having users log into the head node.
As seen above, we have defined two partitions: water and fire. Each partition has its login node (namely, nodes with the names water
and fire
).
WLM Instance Member Management
When creating user accounts, making users members of a tenant-specific group is a good idea.
# cmsh
% user list
Name (key) ID (key) Primary group Secondary groups
---------------- ---------------- ---------------- ----------------
alice 1001 alice water
bob 1002 bob water
charlie 1003 charlie fire
donna 1004 donna fire
ernie 1005 ernie fire
This membership can then be used to restrict access to tenant-specific login nodes.
For example, to limit access to the water
and fire
login nodes:
# cd /cm/images/default-image
# mkdir -p cm/conf/node/{water,fire}/etc/security
# mkdir -p cm/conf/node/{water,fire}/etc/pam.d
# cp etc/pam.d/{system-auth,password-auth} cm/conf/node/water/etc/pam.d
# cp etc/pam.d/{system-auth,password-auth} cm/conf/node/fire/etc/pam.d
# cp etc/security/access.conf cm/conf/node/fire/etc/security
# cp etc/security/access.conf cm/conf/node/water/etc/security
# cd cm/conf/node
# echo +:water:ALL >> water/etc/security/access.conf
# echo +:root:ALL >> water/etc/security/access.conf
# echo -:ALL:ALL >> water/etc/security/access.conf
# echo +:fire:ALL >> fire/etc/security/access.conf
# echo +:root:ALL >> fire/etc/security/access.conf
# echo -:ALL:ALL >> fire/etc/security/access.conf
# echo account required pam_access.so >> fire/etc/pam.d/system-auth
# echo account required pam_access.so >> water/etc/pam.d/system-auth
# echo account required pam_access.so >> fire/etc/pam.d/password-auth
# echo account required pam_access.so >> water/etc/pam.d/password-auth
Then, to prevent users from logging in directly to compute nodes, we can use the cmsh usernodelogin
setting. This setting restricts direct user logins from outside the WLM:
# cmsh
% category use default
% set usernodelogin never
% commit
To prevent users who are not in the admin group from logging into the head node, we can add additional configuration:
# echo account required pam_access.so >> /etc/pam.d/system-auth
# echo account required pam_access.so >> /etc/pam.d/password-auth
# echo +:admin:ALL >> /etc/security/access.conf
# echo +:root:ALL >> /etc/security/access.conf
# echo -:ALL:ALL >> /etc/security/access.conf
We have created a setup in which ordinary users cannot log into the head node or compute nodes. Instead, they must log in to the login node for their cluster partition.
WLM Instance Job Execution
From this login node, users may submit jobs that will be executed on nodes assigned to the cluster partition belonging to the user’s tenant.
Example session for user alice
:
[alice@water ~]$ module load slurm
[alice@water ~]$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
defq* up infinite 2 idle node[001-002]
[alice@water ~]$ srun hostname
node001
Example session for user ernie
:
[ernie@fire ~]$ module load slurm
[ernie@fire ~]$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
defq* up infinite 3 idle node[003-005]
[ernie@fire ~]$ srun hostname
node003
WLM Instance Node Management
Nodes can be allocated or de-allocated from partitions by adding or removing them from the relevant configuration overlays. To move nodes from one partition to another, the movenodes
command in cmsh
‘s configurationoverlay
mode is useful:
% list | grep client
slurm-fire-client 500 no node003..node005 slurmclient
slurm-water-client 500 no node001,node002 slurmclient
% movenodes slurm-fire-client slurm-water-client -n node003..node004
% movenodes slurm-fire-submit slurm-water-client -n node003..node004
*% list | grep client
slurm-fire-client 500 no node005 slurmclient
slurm-water-client 500 no node001..node004 slurmclient
*% commit
This now gives us the following setup:
# ssh root@water "module load slurm; sinfo"
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
defq* up infinite 4 idle node[001-004]
# ssh root@fire "module load slurm; sinfo"
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
defq* up infinite 1 idle node005
NOTE: It is a good idea to perform a drain operation on nodes before moving them from one cluster partition to another. This will prevent running jobs from crashing.
Depending on the level of isolation needed, it may be desirable to place nodes assigned to a particular partition into a different category. This would also allow nodes to mount different external storage depending on which partition they belong to. The downside of this approach is that it will require nodes to be rebooted after being moved to a different cluster partition.
When moving nodes between partitions, it may be a good idea to re-image the node from scratch to ensure there are no leftovers on the file system anywhere (e.g., in /scratch, /tmp, or /data). This can be done by setting the nextinstallmode
property of the node to FULL
, and then rebooting the node.