Categories

ID #1344

How do I Install HTCondor from sources on top of a Bright Cluster

How to Install HTCondor from sources on top of a Bright Cluster


HTCondor can be installed on top of a Bright Cluster as follows:

Note: The following instructions have been tested on Bright 7.3 with CentOS 7 as the base OS.


On the head node


1. Install dependencies:

# yum install setools-console policycoreutils-python perl-Date-Manip.noarch


2. untar the sources:

# tar -xzvf condor-8.4.9-x86_64_RedHat7-stripped.tar.gz


3. Add condor user:

# cmsh

% user add condor

% comit


4. Install Condor using condor_install script (please note that installation should be done using any other user than root using the --owner option):


# cd condor-8.4.9-x86_64_RedHat7-stripped/


# ./condor_install --prefix=/cm/shared/apps/condor/8.4.9 --owner=condor --install-dir=/cm/shared/apps/condor/8.4.9

Installing Condor from /root/condor-8.4.9-x86_64_RedHat7-stripped to /cm/shared/apps/condor/8.4.9


Condor has been installed into:

   /cm/shared/apps/condor/8.4.9


Configured condor using these configuration files:

 global: /cm/shared/apps/condor/8.4.9/etc/condor_config

 local:  /cm/shared/apps/condor/8.4.9/local.ma-c-12-30-b73-c7u2/condor_config.local


In order for Condor to work properly you must set your CONDOR_CONFIG

environment variable to point to your Condor configuration file:

/cm/shared/apps/condor/8.4.9/etc/condor_config before running Condor

commands/daemons.

Created scripts which can be sourced by users to setup their

Condor environment variables.  These are:

  sh: /cm/shared/apps/condor/8.4.9/condor.sh

 csh: /cm/shared/apps/condor/8.4.9/condor.csh


5. Copy the condor environment variables setup scritps under /etc/profile.d :

# cp --preserve /cm/shared/apps/condor/8.4.9/condor.{sh,csh} /etc/profile.d/



6. Configure Condor with condor_configure script:

# ./condor_configure --type=manager,submit --verbose --install-dir=/cm/shared/apps/condor/8.4.9

Condor will be run as user: condor

Install directory: /cm/shared/apps/condor/8.4.9

Main config file: /cm/shared/apps/condor/8.4.9/etc/condor_config

Local directory: /cm/shared/apps/condor/8.4.9/local.ma-c-12-30-b73-c7u2

Local config file: /cm/shared/apps/condor/8.4.9/local.ma-c-12-30-b73-c7u2/condor_config.local


Writing settings to file: /cm/shared/apps/condor/8.4.9/etc/condor_config

CONDOR_HOST=ma-c-12-30-b73-c7u2.cm.cluster

COLLECTOR_NAME=

DAEMON_LIST=COLLECTOR MASTER NEGOTIATOR SCHEDD


Configured condor using these configuration files:

 global: /cm/shared/apps/condor/8.4.9/etc/condor_config

 local:  /cm/shared/apps/condor/8.4.9/local.ma-c-12-30-b73-c7u2/condor_config.local


In order for Condor to work properly you must set your CONDOR_CONFIG

environment variable to point to your Condor configuration file:

/cm/shared/apps/condor/8.4.9/etc/condor_config before running Condor

commands/daemons.

Created scripts which can be sourced by users to setup their

Condor environment variables.  These are:

  sh: /cm/shared/apps/condor/8.4.9/condor.sh

 csh: /cm/shared/apps/condor/8.4.9/condor.csh



7. modify the condor_config file to point to the correct paths for different configuration parameters and expand the Condor pool beyond a single host (set ALLOW_WRITE to match all of the hosts):


# cat /cm/shared/apps/condor/8.4.9/etc/condor_config | grep -vE "^#|^$"
RELEASE_DIR = /cm/shared/apps/condor/8.4.9
LOCAL_DIR = /cm/shared/apps/condor/8.4.9/local.$(HOSTNAME)
LOCAL_CONFIG_FILE = /cm/shared/apps/condor/8.4.9/local.$(HOSTNAME)/condor_config.local
LOCAL_CONFIG_DIR = $(LOCAL_DIR)/config
use SECURITY : HOST_BASED
ALLOW_WRITE = *.cm.cluster
use ROLE : Personal
CONDOR_HOST = master.cm.cluster
UID_DOMAIN = cm.cluster
FILESYSTEM_DOMAIN = cm.cluster
LOCK = /tmp/condor-lock.0.0129490057743205
CONDOR_IDS = 1001.1001
CONDOR_ADMIN = root@master.cm.cluster
MAIL = /usr/bin/mail
JAVA = /usr/bin/java
JAVA_MAXHEAP_ARGUMENT = -Xmx1024m
DAEMON_LIST = MASTER COLLECTOR SCHEDD NEGOTIATOR
STARTD_DEBUG = D_FULLDEBUG
COLLECTOR_DEBUG = D_FULLDEBUG
COLLECTOR_HOST = $(CONDOR_HOST):9618



8. Create startup/boot script for starting Condor services


# cp --preserve /cm/shared/apps/condor/8.4.9/etc/examples/condor.service /lib/systemd/system/

# cat /lib/systemd/system/condor.service


[Unit]

Description=Condor Distributed High-Throughput-Computing

After=syslog.target network-online.target nslcd.service ypbind.service

Wants=network-online.target


[Service]

Environment=CONDOR_CONFIG=/cm/shared/apps/condor/8.4.9/etc/condor_config

ExecStart=/cm/shared/apps/condor/8.4.9/sbin/condor_master -f

ExecStop=/cm/shared/apps/condor/8.4.9/sbin/condor_off -master

ExecReload=/bin/kill -HUP $MAINPID

Restart=always

RestartSec=1minute

StandardOutput=syslog

LimitNOFILE=16384


[Install]

WantedBy=multi-user.target


# systemctl enable condor.service

Created symlink from /etc/systemd/system/multi-user.target.wants/condor.service to /usr/lib/systemd/system/condor.service.



9. Start the Condor service:

# systemctl restart condor.service

# # systemctl status condor.service

● condor.service - Condor Distributed High-Throughput-Computing

  Loaded: loaded (/usr/lib/systemd/system/condor.service; enabled; vendor preset: disabled)

  Active: active (running) since Fri 2016-12-30 11:24:23 CET; 4s ago

Main PID: 15093 (condor_master)

  CGroup: /system.slice/condor.service

          ├─15093 /cm/shared/apps/condor/8.4.9/sbin/condor_master -f

          ├─15118 condor_procd -A /tmp/condor-lock.0.0129490057743205/procd_pipe -L /cm/shared/apps/condor/8.4.9/local.ma-c-12-30-b73-c7u2/log/ProcLog -R 1000000 -S 60 -C 1001

          ├─15119 condor_collector -f

          ├─15132 condor_negotiator -f

          └─15133 condor_schedd -f


Dec 30 11:24:23 ma-c-12-30-b73-c7u2 systemd[1]: Started Condor Distributed High-Throughput-Computing.

Dec 30 11:24:23 ma-c-12-30-b73-c7u2 systemd[1]: Starting Condor Distributed High-Throughput-Computing...


# ps aux | grep condor

condor     15093  0.0  0.1  42884  5516 ?        Ss   11:24   0:00 /cm/shared/apps/condor/8.4.9/sbin/condor_master -f

root       15118  0.0  0.1  23004  4580 ?        S    11:24   0:00 condor_procd -A /tmp/condor-lock.0.0129490057743205/procd_pipe -L /cm/shared/apps/condor/8.4.9/local.ma-c-12-30-b73-c7u2/log/ProcLog -R 1000000 -S 60 -C 1001

condor     15119  0.0  0.1  64064  6352 ?        Ss   11:24   0:00 condor_collector -f

condor     15132  0.0  0.1  42884  5480 ?        Ss   11:24   0:00 condor_negotiator -f

condor     15133  0.0  0.1  63268  7144 ?        Ss   11:24   0:00 condor_schedd -f

root       15211  0.0  0.0 112648   956 pts/0    S+   11:25   0:00 grep --color=auto condor




In the software image -- assuming default-image is the image currently used by the compute nodes


1. Install dependencies:


# yum install setools-console policycoreutils-python perl-Date-Manip.noarch --installroot=/cm/images/default-image


2. Install Condor inside the software image

  • Create a local configuration directory for each compute node (substitute node001/node002 with the correct node name and repeat/loop for the required number of nodes):

# cp -r --preserve /cm/shared/apps/condor/8.4.9/local.ma-c-12-30-b73-c7u2/ /cm/shared/apps/condor/8.4.9/local.node001/

# cp -r --preserve /cm/shared/apps/condor/8.4.9/local.ma-c-12-30-b73-c7u2/ /cm/shared/apps/condor/8.4.9/local.node002/


  • Create a startup/boot script for starting Condor services in the software image:

# cat /cm/images/default-image/lib/systemd/system/condor.service


[Unit]
Description=Condor Distributed High-Throughput-Computing
After=syslog.target network-online.target nslcd.service ypbind.service network.target
Wants=network-online.target network.target

[Service]
Environment=CONDOR_CONFIG=/cm/shared/apps/condor/8.4.9/local.%H/condor_config.local
ExecStart=/cm/shared/apps/condor/8.4.9/sbin/condor_master -f
ExecStop=/cm/shared/apps/condor/8.4.9/sbin/condor_off -master
ExecReload=/bin/kill -HUP $MAINPID
Restart=always
RestartSec=1minute
StandardOutput=syslog
LimitNOFILE=16384

[Install]
WantedBy=multi-user.target


  • Copy the condor_config file to condor_config.local under each local.<node> directory after making the necessary changes for the DAEMON_LIST

# cat /cm/shared/apps/condor/8.4.9/local.node001/condor_config.local | grep -vE "^#|^$"

RELEASE_DIR = /cm/shared/apps/condor/8.4.9

LOCAL_DIR = /cm/shared/apps/condor/8.4.9/local.$(HOSTNAME)

LOCAL_CONFIG_FILE = /cm/shared/apps/condor/8.4.9/local.$(HOSTNAME)/condor_config.local

LOCAL_CONFIG_DIR = $(LOCAL_DIR)/config

use SECURITY : HOST_BASED

use ROLE : Personal

CONDOR_HOST = master

ALLOW_WRITE = *

UID_DOMAIN = cm.cluster

FILESYSTEM_DOMAIN = cm.cluster

LOCK = /tmp/condor-lock.0.0129490057743205

CONDOR_IDS = 1001.1001

CONDOR_ADMIN = root@master.cm.cluster

MAIL = /usr/bin/mail

JAVA = /usr/bin/java

JAVA_MAXHEAP_ARGUMENT = -Xmx1024m

COLLECTOR_HOST = $(CONDOR_HOST):9618

DAEMON_LIST = MASTER STARTD


3. reboot the compute nodes to be provisioned using the modified software image


(check status from the head node after the nodes are up)

# condor_status

Name               OpSys      Arch   State     Activity LoadAv Mem   ActvtyTime


node001.cm.cluster LINUX      X86_64 Unclaimed Idle      0.000  993  0+00:30:04

node002.cm.cluster LINUX      X86_64 Unclaimed Idle      0.150  993  0+00:00:04

                    Total Owner Claimed Unclaimed Matched Preempting Backfill


X86_64/LINUX     2     0       0         2       0          0        0


              Total     2     0       0         2       0          0        0

[root@ma-c-12-30-b73-c7u2 ~]#



Submitting a job


Submitting jobs as root is not allowed so you have to switch to any other user to be able to submit jobs.


# su - cmsupport

$ cat hostname.sh

#!/bin/bash

hostname -f

date

 

sleep 20

date

echo "exit"


$ cat hostname.condor

############

# Example job file

############

Universe=vanilla

Executable=/home/cmsupport/hostname.sh

input=/dev/null

output=hostname.out

error=hostname.error

Queue


$ condor_submit hostname.condor

$ condor_q

-- Schedd: ma-c-12-30-b73-c7u2.cm.cluster : <10.141.255.254:50275?...
ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
  9.0   cmsupport      12/30 17:33   0+00:00:06 R  0   0.0  hostname.sh

1 jobs; 0 completed, 0 removed, 0 idle, 1 running, 0 held, 0 suspended

1 jobs; 0 completed, 0 removed, 0 idle, 1 running, 0 held, 0 suspended


$ cat hostname.out

node002.cm.cluster

Fri Dec 30 17:33:58 CET 2016

Fri Dec 30 17:34:18 CET 2016

exit


$ condor_history
ID     OWNER          SUBMITTED   RUN_TIME     ST COMPLETED   CMD            
  9.0   cmsupport      12/30 17:33   0+00:00:20 C  12/30 17:34 /home/cmsupport/hostname.sh
  6.0   cmsupport      12/30 17:22   0+00:00:00 X         ???  /home/cmsupport/hostname.sh
  5.0   cmsupport      12/30 16:58   0+00:00:00 X         ???  /home/cmsupport/hostname.sh
  8.0   cmsupport      12/30 17:33   0+00:00:00 X         ???  /home/cmsupport/hostname.sh
  7.0   cmsupport      12/30 17:32   0+00:00:00 X         ???  /home/cmsupport/hostname.sh
  1.0   cmsupport      12/30 16:38   0+00:00:00 X         ???  /home/cmsupport/hostname.sh
  2.0   cmsupport      12/30 16:40   0+00:00:00 X         ???  /home/cmsupport/hostname.sh
  4.0   condor         12/30 16:43   0+00:00:00 X         ???  /home/condor/hostname.sh


Tags: -

Related entries:

You cannot comment on this entry