Job Scheduling and Batch Systems

One of the measures of success of a Beowulf cluster is the number of people waiting in line to run their code on the system. Build your own low-cost supercomputer, and your cycle-starved colleagues will quickly become your new best friends. But, when they all get accounts and start running jobs, they'll soon find themselves battling each other for the limited resources of the machine.

One of the measures of success of a Beowulf cluster is the number of people waiting in line to run their code on the system. Build your own low-cost supercomputer, and your cycle-starved colleagues will quickly become your new best friends. But, when they all get accounts and start running jobs, they’ll soon find themselves battling each other for the limited resources of the machine.

At that time, you’ll need a package that automatically schedules jobs and allocates resources on the cluster — something akin to the batch job queuing facilities used on the business systems and mainframes of yesteryear. Batch facilities would line up jobs, execute them in turn as the appropriate resources became available, and deliver the output of each job back to the submitter.

Clustered systems and emerging grid technologies have driven the need for new job scheduling packages in the computational science realm. Two scheduling packages that are increasingly being used are OpenPBS (Portable Batch System, http://www.openpbs.org) and Sun Microsystems’s Grid Engine (http://gridengine.sunsource.net). OpenPBS is an open source package offered and supported by Veridian Systems. (Veridian also offers an enhanced commercial version called PBS Pro.) The source for Grid Engine is also available at no cost. Both packages run on a wide

variety of Unix and Linux systems, and may be used for both serial and parallel job control. The month’s column will focus on implementing OpenPBS on a typical Beowulf cluster.

Introducing OpenPBS

OpenPBS consists of three primary components — a job server, a job executor, and a job scheduler — and a set of commands and X-based tools for submitting jobs and monitoring queues. The job server (pbs_server) handles basic queuing services such as creating and modifying a batch job and placing a job into execution when it’s scheduled to be run. The job executor (pbs_mom) is the daemon that actually runs jobs. The job scheduler (pbs_sched) is another daemon. It knows the site’s policies and rules about when and where jobs can be run. In the simplest implementation, pbs_server and pbs_sched are run only on the front-end node, while pbs_mom is run on every node of the cluster that can run jobs, including the front-end node.

Figure One: A typical 8-node Beowulf cluster running OpenPBS

Figure One presents a typical eight-node Beowulf cluster running OpenPBS. In the figure, every node, including the front-end machine, node01, runs the pbs_mom daemon, while only the front-end node runs the pbs_server and pbs_sched daemons. (By the way, a Mom is a node running pbs_mom; the Server is the node running pbs_server and pbs_sched.) OpenPBS commands (qsub, qstat, etc.) are available on every node. In the configuration shown, an external machine, climate .ornl.gov, can access the queues and monitor the batch system remotely.

A variety of schedulers are available for OpenPBS, but only the default scheduler, called fifo, will be discussed here. (Don’t worry. Jobs are not run first-in, first-out by this scheduler, despite what its name suggests.) Third party schedulers may also be used in combination with OpenPBS. For instance, Maui (http://www.supercluster.org) is one scheduler often used with OpenPBS on Beowulf clusters.

Configuring OpenPBS

Source for OpenPBS is available on its website, but you must register on the website and await approval before being allowed to download the code and documentation. RPMs are also available on the site, and these may be used for the implementation described here.

All components may be built using the standard configure, make, and make install procedure. A host of configure options are described in the OpenPBS Administrator’s Guide along with instructions for building and installing OpenPBS on various Unix systems like the Cray, SGI, and IBM SP.

Once the code is built and installed on the front-end node, the pbs_mom and command components should be installed on the remaining cluster nodes. You can install the full suite on every node in the cluster if disk space is ample and you don’t get confused about which node is the server. Run-time information is stored in subdirectories under $PBS_HOME, which is assumed to be /usr/spool/PBS, the default location.

Before starting any daemons, a few configuration files need to be created or updated. First, each node needs to know what machine is running the server. This is conveyed through the $PBS_HOME/server_name file, which, for our configuration, should contain the following line:


Second, the pbs_server daemon must know which nodes are available for executing jobs. This information is kept in a file called $PBS_HOME/server_priv/nodes, and the file appears only on the front-end node where the jobs server runs. You can set various properties for each node listed in the nodes file, but for this simple configuration, only the number of processors is included. $PBS_HOME/server_priv/nodes should contain the following lines:

node01 np=2
node02 np=2
node03 np=2
node04 np=2
node05 np=2
node06 np=2
node07 np=2
node08 np=2

Third, each pbs_mom daemon needs some basic information to participate in the batch system. This configuration information is contained in $PBS_HOME/mom_priv/config on every node. The following lines should be in this file for the example configuration:

$logevent 0x0ff
$clienthost node01
$restricted climate.ornl.gov

The $logevent directive specifies what information should be logged during operation. A value of 0x0ff causes all messages except debug messages to be logged, while 0x1ff causes all messages, including debug messages, to be logged. The $clienthost directive tells each Mom where the Server is — in this case it’s on node01.

$restricted details which hosts are allowed to connect to Mom directly. Hosts allowed to connect can make internal queries of Mom using monitoring tools such as xpbsmon. In this case, climate.ornl.gov is a system external to the cluster that can monitor the batch system (again, see Figure One for the topology).

Starting Up OpenPBS

Once all of the configuration files are ready, the component daemons can be started. It’s easiest if the Moms are started first to be ready to communicate with the Server once it’s launched. For this configuration pbs_mom should be started on all computational nodes, including node01 as follows:

[root@node01 root]# pbs_mom
[root@node02 root]# pbs_mom
[root@node03 root]# pbs_mom

[root@node07 root]# pbs_mom
[root@node08 root]# pbs_mom

Next, the Server should be started on node01. The first time you run pbs_server, start it with the -t create flag to initialize the server configuration.

Once the Server is running, qmgr can be used to construct one or more jobs queues and their properties. Using the commands shown in Figure Two, we create a single execution queue, called penguin_exec, for jobs that run for more than one second and less than 48 hours. The default time for jobs put in the queue is set to thirty minutes. Once the queue is created, we enable and start the queue. The Server is then provided a list of managers, forrest@climate.ornl.gov in this case. Finally, server scheduling is set to true, which causes jobs to be scheduled. The Server configuration can be saved to a file as follows:

Listing Two: jobscript.csh

#PBS -N Hello_job
#PBS -l nodes=2:ppn=2
#PBS -l walltime=05:00
#PBS -m be
echo “The nodefile is ${PBS_NODEFILE} and it contains:”
echo “”
time /usr/local/bin/mpirun -nolocal -machinefile ${PBS_NODEFILE} \
-np `wc -l ${PBS_NODEFILE} | awk ‘{print $1}’` hello_waster

Figure Two: Launching the Server and creating a queue

[root@node01 root]# pbs_server -t create
Then run qmgr to configure the server:
[root@node01 root]# qmgr
Max open servers: 4
Qmgr: create queue penguin_exec
Qmgr: set queue penguin_exec queue_type = execution
Qmgr: set queue penguin_exec resources_max.cput = 48:00:00
Qmgr: set queue penguin_exec resources_min.cput = 00:00:01
Qmgr: set queue penguin_exec resources_default.cput = 00:30:00
Qmgr: set queue penguin_exec enabled = true
Qmgr: set queue penguin_exec started = true
Qmgr: set server managers = forrest@climate.ornl.gov
Qmgr: set server scheduling = true
Qmgr: quit

[root@node01 root]# qmgr -c “print server” \
> /root/server.config

The Server configuration can later be fed back into qmgr to recreate the configuration as follows:

[root@node01 root]# qmgr < /root/server.config

Finally, the default scheduler (fifo) is started using the default configuration (provided with the scheduler):

[root@node01 root]# pbs_sched

The scheduler may be configured with various policies by editing the $PBS_HOME/sched_priv/sched_config file. You can specify job sorting methods, assign individual users elevated priorities, and establish prime-time and holiday scheduling policies.

After the Server has been manually started and configured as described above, scripts should be written and placed in appropriate /etc/rc.d directories to automatically start the three daemons on the front-end node and Mom on all of the nodes at boot. Alternatively, these daemons can be started out of the rc.local file.

Submitting Jobs to OpenPBS

Both serial and parallel jobs can be submitted to OpenPBS queues, but the package has no way of enforcing the use of allocated nodes. A special script file can be used to ensure that the intentions of OpenPBS are met.

First, however, let’s write some parallel code to test the batch queue. The hello_waster.c code shown in Listing One is a standard “Hello World” program that prints the MPI process rank and processor name, and then wastes time so that the job will run for a few minutes.

Listing One: hello_waster.c

#include <stdio.h>
#include <math.h>
#include “mpi.h”

int main(int argc, char **argv)
int me, nprocs, namelen, i, j;
double x;
char processor_name[MPI_MAX_PROCESSOR_NAME];

MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &nprocs);
MPI_Comm_rank(MPI_COMM_WORLD, &me);
MPI_Get_processor_name(processor_name, &namelen);

printf(“Hello World! I’m process %d of %d on %s\n”, me, nprocs,

/* Waste a little time… */
for (i = 0; i < (2*999); i++)
for (j = 0; j < 999999; j++)
x = sqrt((double)(i * j));


return 0;

To run the program, you submit it to the batch queue with qsub and a special script like the one shown in Listing Two. jobscript.csh is just a shell script, but the lines that start with #PBS (ignored by csh as comments) are OpenPBS directives that describe the job to the Server. PBS directives provide a way of specifying job attributes in addition to qsub command line options.

In this example, the job is named “Hello_ job” (#PBS -N Hello_job), two nodes are requested with two processors per node (#PBS -l nodes=2:ppn=2), wall-clock time is guessed to be 5 minutes (#PBS -l walltime=05:00), and mail should be sent at the beginning and end of the job (#PBS -m be). The script prints the list of nodes that the Scheduler allocates for the job (contained in $PBS_NODEFILE), then executes the program by calling mpirun.

To be sure that all of the allocated nodes are used, the -machinefile flag is passed to mpirun with name of the file which contains the node list.

The -nolocal flag is passed to mpirun so that only the nodes listed in $PBS_ NODEFILE are used and no local process is created. The -np flag of mpirun is used to specify the number of processes. In this example, the number of requested processes is obtained by counting the number of lines in the $PBS_NODEFILE. As a result, the number of processes can be changed for future submissions of this script by simply changing the nodes and ppn values at the top of the script.

The code should be compiled using mpicc, and the job script submitted to the penguin_ exec queue using the qsub command as follows:

[forrest@node01 forrest]$ mpicc -o \
hello_waster hello_waster.c -lm
[forrest@node01 forrest]$ qsub -q penguin_exec \
jobscript.csh 21.node01

The status of queues and jobs can be monitored using the qstat command show in Figure Three.

Figure Three: Monitoring queues with qstat

[forrest@node01 forrest]$ qstat

Job id Name User Time Use S Queue
21.node01 Hello_job forrest 0 R penguin_exec
Job ID Username Queue Jobname SessID NDS TSK Req’d Memory Req’d Time S Elap Time
22.node01 forrest penguin_ Hello_job 21697 2 — — 0:30 R –

[forrest@node01 forrest]$

The output from this job is saved in a file called Hello_ job.o21 shown here.


Hello World! I’m process 2 of 4 on node01
Hello World! I’m process 3 of 4 on node01
Hello World! I’m process 1 of 4 on node02
Hello World! I’m process 0 of 4 on node02
0.110u 0.170s 4:46.01 0.0% 0+0k 0+0io

As requested in the script file, email is sent to the user to inform him that the job has been scheduled, and a second message is sent upon completion of the job. An example of the second email message is shown here:

Date: Tue, 16 Jul 2002 23:06:11 -0400
From: adm <adm@node01>
To: forrest@node01
Subject: PBS JOB 21.node01
PBS Job Id: 21.node01
Job Name: Hello_job
Execution terminated

Scheduling jobs and managing resources on a Beowulf cluster can be challenging when more than a few users want to run parallel codes there. Job queuing, and scheduling packages like OpenPBS, Grid Engine, and others, make job handling much more manageable.

This month we installed OpenPBS and learned how to create, submit, and control jobs. But that’s just the beginning. OpenPBS is quite powerful. For example, you can define rules such as execution order, synchronization, and conditional execution between batch jobs. Future columns will cover other features of job scheduling facilities useful on Beowulf clusters.

Forrest Hoffman is a computer modeling and simulation researcher at Oak Ridge National Laboratory. Forrest can be reached at forrest@climate.ornl.gov. You can download the code and scripts used in this column from http://www.linux-mag.com/2002-10/extreme.

Comments are closed.