Cluster Management with Condor, Part 2

Last month's column introduced Condor and presented a sample installation of the software package in a cluster environment. Condor is a system that creates a "high-throughput computing" environment by effectively utilizing computing resources from a pool of cluster nodes and disparate workstations distributed around a network. Like many batch queuing systems, Condor provides a queuing mechanism, scheduling policy, job priority scheme, and resource classification. Unlike most other batch systems, Condor doesn't require dedicated compute servers.

Last month’s column introduced Condor and presented a sample installation of the software package in a cluster environment. Condor is a system that creates a “high-throughput computing” environment by effectively utilizing computing resources from a pool of cluster nodes and disparate workstations distributed around a network. Like many batch queuing systems, Condor provides a queuing mechanism, scheduling policy, job priority scheme, and resource classification. Unlike most other batch systems, Condor doesn’t require dedicated compute servers.

Condor continuously matches job requirements called job ClassAds, akin to classified advertisements, with advertised resource attributes called machine ClassAds. Condor jobs run in one of a number of universes, where the supported universes are standard, vanilla, PVM, MPI, globus, java, and scheduler. The standard universe supports automatic process migration among nodes for serial jobs and remote system calls on the originating hosts, but it restricts what the running programs can do. The vanilla universe provides fewer services, but has very few restrictions.

The PVM and MPI universes provide support for parallel programs written in PVM (Parallel Virtual Machine) and MPI (Message Passing Interface), specifically MPICH, respectively. The globus universe allows users to submit Globus (http://www.globus.org) jobs through Condor, and the java universe supports jobs written for the Java Virtual Machine (JVM). The scheduler universe is used internally to execute a job immediately.

A Condor pool consists of a single machine, the central manager, and a number of other machines that join the pool as participating resources. For a pool consisting of a cluster with dedicated nodes, some basic assumptions are usually made: the front-end node is the central manager, the pool members have a shared filesystem (usually /home), and all pool members are execution hosts. Details of Condor installation and configuration on a Linux cluster can be found in last month’s column. The remaining configuration information and examples included below assume that Condor’s been installed as described last month.

MPI Applications

Most cluster owners will want Condor to schedule and run parallel programs written in PVM or MPI. Doing so requires a few additional configuration steps. Only the MPI setup will be covered here.

For Condor to run MPI jobs, programs must be compiled using mpicc from the MPICH distribution (freely available at http://www-unix.mcs.anl.gov/mpi/mpich). Unfortunately, the latest version of Condor (version 6.5.3) can’t spawn parallel programs compiled with the latest version of MPICH (version 1.2.5). Until the Condor development team fixes the problem, version 1.2.4 of MPICH should be used instead.

Condor can only run MPI jobs on dedicated resources. Dedicated machines execute programs from beginning to end; no job preemption can take place. In a typical cluster, this sort of dedicated configuration for all nodes is typically best; however, Condor was originally written to harvest cycles from disparate workstations, so dedicated resources (machines) must be specifically configured as such.

After performing the initial installation described last month, edit the global configuration file for Condor and the node-specific (local) configurations for each machine.

First, open /home/condor/etc/condor_config, and in “Part 1″ of the file, make the following changes: RELEASE_DIR should be set to /home/condor; LOCAL_DIR should be set to $(RELEASE_DIR)/hosts/$(HOSTNAME); and LOCAL_ CONFIG_FILE should be set to $(RELEASE_DIR)/etc/ $(HOSTNAME).local.

Next, machine-specific local configuration files should be created. A set of local configuration files should be present in the /home/condor/etc directory. They are probably all zero length files except for the file for the central manager (usually the front-end node). The local configuration file for the central manager machine contains the name of machine pool, a list of daemons to run, the paths for the collector and negotiator binaries, and additional settings that can be changed to alter default behavior. The remaining compute nodes need local configuration files that establish them as dedicated resources.

An example of such a local configuration file is contained in /home/condor/etc/examples/condor_config.local.dedicated. resource. This file may be copied and edited, and then used as the local configuration file for each of the compute nodes.

First, the full hostname of the machine from which users will submit MPI jobs must be specified. This should be the front-end node. Next, one of three policy settings must be chosen for the dedicated nodes: run only dedicated jobs; always run jobs, but prefer dedicated ones; or always run dedicated jobs, but allow non-dedicated jobs to run only on an opportunistic basis.

When starting with a copy of the example configuration file, uncomment the settings for the desired policy. Figure One shows the settings used for the second option (without most of the comments provided in the example file). Once the local configuration file is correct for one compute node, it may be copied for use by all other compute nodes. Afterward, Condor should be restarted on every node.

Figure One: Machine-specific local configuration file, specifying a dedicated resource (sans most comments)

DedicatedScheduler = “DedicatedScheduler@node001.cluster.ornl.gov
## 2) Always run jobs, but prefer dedicated ones
START = True
KILL = False
RANK = Scheduler =?= $(DedicatedScheduler)
## Settings you should leave alone, but that
must be defined
STARTD_EXPRS = $(STARTD_EXPRS), DedicatedScheduler

Running condor_status should now show the state of all compute nodes as Unclaimed. You may notice that condor_status treats each processor as a virtual machine, and each dual-processor node is listed as vm1@nodeNNN and vm2@nodeNNN. Now Condor is ready to schedule and run parallel MPI jobs.

Submitting Jobs to Condor

Jobs are submitted to Condor using the condor_submit command along with the name of a submit description file. The job description file contains everything needed to run the job. It lists the name of the executable to run, the initial run-time directory, command-line arguments to the program, and additional job-related requirements used by Condor to create a job ClassAd. The most basic form of a submit description file for the MPI universe is shown in Figure Two.

Figure Two: hello-world1.condor, an example job submit description file for the MPI universe

# Condor submit description file
universe = MPI
executable = hello-world
machine_count = 16

In the file, the universe is set to MPI, the executable name is provided (hello-world), a number of machines is specified (4), and the queue command is given. While this file is adequate for scheduling a parallel job, the program can get no input, and its output is directed to /dev/null.

A more useful example is shown in Figure Three. In Figure Three, the universe is set to MPI and the executable is hello-world, as in the previous example. However, the input file is explicitly set to /dev/null (since the program takes no input), the output file (to which stdout will be directed) is set to output.$(NODE), and the error file (for stderr) is set to errfile.$(NODE). The macro $(NODE) is defined only for the MPI universe, and is replaced with the rank of the MPI process (starting at zero as usual).

Figure Three: hello-world2.condor, a more useful job submit description file for the MPI universe

# Condor submit description file
universe = MPI
executable = hello-world
log = logfile
input = /dev/null
output = outfile.$(NODE)
error = errfile.$(NODE)
machine_count = 16

The hello-world program can be your favorite short pro-gram which, at a minimum, calls MPI_Init() and MPI_ Finalize(). Listing One contains a sample program that prints Hello, world! and the rank of the process, the size of the communicator, and the processor name.

Listing One: hello-world.c

#include <stdio.h>
#include <mpi.h>

int main(int argc, char** argv)
int rank, size, namelen;
char processor_name[MPI_MAX_PROCESSOR_NAME];
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Get_processor_name(processor_name, &namelen);
printf(“Hello world! I’m rank %d of %d on %s\n”,
rank, size,

This program must be compiled prior to submitting the job description file to Condor. The job number is returned to the user, and the job appears in the queue listing obtained by running condor_q. Initially, the job is listed in the Idle state (denoted by the I in the column labeled ST) while it awaits assignment to cluster nodes. Later, the job state changes to R, meaning that it’s running. A look at the status of the machines, obtained using the condor_status command, shows that sixteen of the virtual machines are now listed as Claimed and either Busy or Idle, depending on what point they have reached in executing the very short Hello, World! program. Once the job has completed, it disappears from the Condor queue. The output from each MPI process is now contained in separate output files, one for stdout and one for stderr. The errfile.* files should be empty. Upon job termination, Condor sends an e-mail message to the user describing the resources and machines used, along with the exit status of the program on each node.

The condor_q and condor_status commands can be used to check on jobs and resources, respectively. If a job appears to be stuck in the Condor queue, running condor_q with the -analyze flag should help explain the problem. Jobs may be removed from the queue using the condor_rm command.

Submitted jobs may be placed in a hold state so that they remain in the queue, but are not considered for execution. They may be subsequently removed from the hold state using the condor_release command. Additional commonly-used Condor commands are shown in Table One.

Table One: Condor commands

condor_compile Re-link program with Condor libraries for checkpoint/restart

condor_history Examine the job history file

condor_hold Put a job in the queue into the hold state

condor_prio Alter the priority of a job in the Condor queue

condor_q Show status of submitted jobs

condor_release Release a held job in the Condor queue

condor_rm Remove a jobs from the Condor queue

condor_status Show a listing of available pool resources

condor_submit Submit a job to the Condor queue

condor_userprio Examine or change a user priority

Priorities and Preemption

Condor has two independent priority controls: job priorities and user priorities. Job priorities, which range from -20 (the worst) to +20 (the best), allow the user to control the order of execution of submitted jobs. The condor_prio command can be used to alter the priority of a job in the Condor queue. Job priorities have no effect on user priorities.

User priorities determine how machines are allocated to users’ jobs. A lower value means a higher priority. User priorities can be examined with the condor_userprio command. Using the same command, Condor administrators can change the priority of individual users.

Condor continuously evaluates the share of available machines that each user should be allocated. This share is inversely related to the ratio between user priorities. Condor adjusts priorities to ensure that each user gets his fair share. If the number of machines allocated for a user is less than her priority, her priority will be improved by decreasing the numeric effective priority over time. Likewise, if a user has more machines than his priority would dictate, his priority is worsened by increasing his effective priority value. The rate at which Condor adjusts priorities is established by the PRIORITY_HALFLIFE configuration macro.

Condor enforces fair share allocation when allocating available machines and by preempting jobs currently running on allocated machines. For instance, if a low priority user has jobs running on all available machines and a higher priority user submits new jobs, Condor immediately checkpoints (if running in the standard universe) and vacates jobs belonging to the lower priority user, thereby freeing up machines for the higher priority user’s jobs. Condor preempts only enough jobs so that the higher priority user’s fair share may be realized based on the ratio between user priorities.

When a job must vacate a machine, Condor sends the job an asynchronous signal specified in the KillSig attribute of the job’s ClassAd. The signal may also be specified in the kill_ sig option of the job submit description file. By default it’s SIGSTP (which causes the job to checkpoint and exit) in the standard universe, and SIGTERM in the vanilla universe. If a program needs to perform some special operations to stop and vacate the machine, it may set up a signal handler to use a trappable signal as an indication to clean up. When submitted, this job must specify the desired signal with the kill_sig option.

But Wait! There’s More!

This should be enough information to get you started using Condor for parallel jobs using MPICH. But as you can probably tell, there’s more to Condor than just running MPI jobs. In many cases, Beowulf clusters are used for distributed applications or lots of simultaneous serial jobs. Condor is especially useful for these sorts of “massively serial” tasks. Additional features in Condor that help manage lots of serial jobs will be presented in next month’s column.

Forrest Hoffman is a computer modeling and simulation researcher at Oak Ridge National Laboratory in Oak Ridge, Tennessee. You can reach Forrest at forrest@climate.ornl.gov.

Comments are closed.