Job Scheduling with Maui

Scheduling jobs and allocating resources on a Beowulf cluster quickly becomes a challenge once more than a few users start running codes on the system. Manual coordination of runs is tedious, particularly when different codes have very different resource requirements. A job queuing and scheduling facility solves these problems by automatically executing jobs as resources become available ensuring optimal utilization of the cluster. Moreover, a good job scheduler can be configured to enforce operational policies about when and where jobs belonging to different users may be run.

Scheduling jobs and allocating resources on a Beowulf cluster quickly becomes a challenge once more than a few users start running codes on the system. Manual coordination of runs is tedious, particularly when different codes have very different resource requirements. A job queuing and scheduling facility solves these problems by automatically executing jobs as resources become available ensuring optimal utilization of the cluster. Moreover, a good job scheduler can be configured to enforce operational policies about when and where jobs belonging to different users may be run.

Last month’s column introduced batch queues and job scheduling for Beowulf clusters. We installed and configured OpenPBS (http://www.openpbs.org/), and created a sample parallel program and execution script for the job queuing system. We also created a single execution queue, and ran jobs using the default OpenPBS scheduler, fifo. Contrary to what its name implies, fifo can do much than simple first-in first-out scheduling.

As mentioned last month, fifo is just one of many available schedulers. Third party schedulers with more capabilities can also be used with OpenPBS. One particularly popular scheduler is Maui (http://www.supercluster.org). This month’s column focuses on installing and using Maui.

Maui is an advanced batch scheduler designed for high performance computing platforms. As such, it makes decisions about where, when, and how to run jobs based on specified policies, priorities, and system limits. Maui provides extensive administrative control over system resources, handles job reservations, and offers detailed logging support and powerful tracking tools for management of workload. Maui runs on Alpha and PC clusters, the IBM SP, and SGI Origin systems, and Maui must be used in conjunction with a resource manager like OpenPBS, PBSPro, Loadleveler, or the Sun Grid Engine.








extreme_01
Figure One: A typical 8-node Beowulf cluster running OpenPBS

Installing and Configuring Maui

It’s assumed that OpenPBS is already installed and configured on your Beowulf cluster as described in last month’s column (available online at http://www.linux-mag.com/2002-10/extreme_01.html). That configuration is shown in Figure One.

Notice that node01 is running pbs_sched, the OpenPBS scheduler. We’ll retain that configuration, and install and run Maui on node01, replacing fifo (of course, you can always choose to run the scheduler on any other node).

After downloading and unpacking the sources for Maui, run the configure script to construct both a Makefile and a configuration file for the Maui scheduler. The configure script asks a series of questions: Where should Maui be installed? What compiler should be used? What number should be used as a checksum seed? And, which resource manager should be used? For the latter, PBS should be chosen as the resource manager (as shown in Figure Two).




Figure Two: Building the Maui scheduler


[root@node01 maui-3.0.7]# ./configure

Configuring Maui…
checking system configuration…

checking Makefile… (building new Makefile)
Maui Installation Directory? (Default: /usr/local)
NOTE: This is where Maui executables will be copied:
Maui Home Directory? (Default: /usr/local/src/maui-3.0.7)
NOTE: This is where Maui config, log, and checkpoint files are
maintained: /usr/local/maui
Compiler? (Default: gcc)
Checksum Seed? (Any random number between 0 and MAX_INT) 123456789

OPSYS: LINUX
COMPILER: gcc
CHECKSUMSEED: 123456789
MAUI_HOME_DIR: /usr/local/maui
MAUI_INST_DIR: /usr/local
PRIMARY ADMIN: root
SERVERHOST: node01

Correct? [Y|N] (Default: N) Y
Do you want to use PBS? [Y|N] (Default: Y)
PBS Target Directory: (default: /usr/local)

checking maui.cfg… (building new maui.cfg)

NOTE: please link ‘docs/mauidocs.html’ to your local website for
access to user and administrator documentation NOTE: latest downloads,
patches, etc are available at ‘http://supercluster.org/maui

The configure script automatically sets up Maui so that the user running configure is the default administrator. The default administrator can be modified in the maui.cfg file. However, be careful that the user you choose as the default administrator is also an administrator in OpenPBS. Otherwise, Maui will be unable to communicate with the pbs_server and the pbs_moms.

After running configure, use make install to build and install the Maui binaries, the configuration file (maui.cfg), and the run-time directories. If the installation is successful, you should see the following entry in the maui.cfg file:


RMTYPE[0]PBS

This indicates that PBS — or OpenPBS in this case — is indeed the resource manager. Maui can be tested on a live and active system without interfering with an existing scheduler by setting the SERVERMODE parameter in maui.cfg to TEST. For normal operations, this parameter should be set to NORMAL.

Starting and Using Maui

Before starting Maui in normal mode, the pbs_server daemon should be running on the server host (node01), and pbs_mom daemons should be running on all of the computational nodes (node01 through node08). If the normal OpenPBS scheduler, pbs_sched, is running, it should be killed before you start Maui.

The Maui scheduler can be started by the root user by typing maui, or can be started automatically at boot time by including it in rc.local. Once Maui has been started, jobs may be submitted as described last month using the OpenPBS qsub command. To test the scheduler, you can use the script developed in last month’s column and submit it to the existing job queue a number of times.


[forrest@node01 forrest]$ qsub \
-q penguin_exec jobscript.csh \
24.node01
[forrest@node01 forrest]$ qsub \
-q penguin_exec jobscript.csh \
25.node01
.
.
.
[forrest@node01 forrest]$ qsub \
-q penguin_exec jobscript.csh \
31.node01

While OpenPBS commands can still be used to view queue and job status information, Maui offers alternative commands, that often provide additional information or capabilities. For example, If Maui is running, its showq command (which replaces the OpenPBS command qstat) displays the active, idle, and non-queued jobs on the system. Sample output of showq is shown in Figure Three.




Figure Three: View the jobs in the system with Maui’s showq command


[forrest@node01 forrest]# showq
ACTIVE JOBS——————–
JOBNAME USERNAME STATE PROC REMAINING STARTTIME

24 forrest Running 4 0:04:54 Sun Aug 4 22:06:55

1 Active Job4 of 4 Processors Active (100.00%)
2 of 2 Nodes Active (100.00%)

IDLE JOBS———————-
JOBNAME USERNAME STATE PROC WCLIMIT QUEUETIME

25 forrest Idle 4 0:05:00 Sun Aug 4 22:06:56
26 forrest Idle 4 0:05:00 Sun Aug 4 22:06:57
27 forrest Idle 4 0:05:00 Sun Aug 4 22:06:58
28 forrest Idle 4 0:05:00 Sun Aug 4 22:06:59
29 forrest Idle 4 0:05:00 Sun Aug 4 22:06:59
30 forrest Idle 4 0:05:00 Sun Aug 4 22:07:00
31 forrest Idle 4 0:05:00 Sun Aug 4 22:07:01

7 Idle Jobs

NON-QUEUED JOBS—————-
JOBNAME USERNAME STATE PROC WCLIMIT QUEUETIME

Total Jobs: 8 Active Jobs: 1 Idle Jobs: 7 Non-Queued Jobs: 0

The information provided by showq is more useful and easier to read than that provided by qstat. For instance, showq displays the number of processors required for each job instead of the number of nodes, and it describes the number of processors as well as the number of nodes which are active.

You can use the checkjob and checknode commands to show information about individual job and nodes, respectively. The checkjob command shows the state of the job number entered, including its queue and start times, the wall clock time it has consumed, and a list of nodes on which the job is executing. The checknode command shows the state of the selected node, some information about its available resources (including number of processors, amount of memory, size of swap space, and number of disks), its load average, and the reservations or job numbers to which it is assigned.

Jobs can be “put on hold” or canceled by an administrator with the sethold and canceljob commands, respectively. Canceled jobs are dropped from the queue, while held jobs are labelled as “non-queued.” Held jobs can be released for execution later with releasehold.

The showstart command displays an estimated start time for idle jobs, and the showstats command shows detailed usage statistics for users, groups, and accounts to which the user has access.

Maui employs a backfill scheduling approach that allows jobs to be run “out of order” without delaying the highest priority jobs in the queue. To maximize the efficiency of this algorithm, an estimate of the job’s required wall clock time job must be provided by the user. This time should be slightly over-estimated (in case the scheduler is configured to abort jobs that exceed their time estimates), but not by much. When accurate run time estimates are provided, Maui is more likely to optimally utilize available resources, meaning faster, more efficient throughput for all jobs.

The showbf command discovers exactly what resources are available for immediate use, as shown in Figure Four. When used with the -S flag, showbf displays the number of processors, amount of memory, disk, and swap space presently available on the cluster. Users will find this command useful because jobs that utilize only the presently-available resources run as soon as they are submitted. In addition, jobs can be forced to run immediately at any time by an administrator using the OpenPBS runjob command.




Figure Four: showbf shows what resources are available now


[forrest@node01 forrest]$ showbf
backfill window (user: ‘forrest’ group: ‘forrest’ partition: ALL) Sun Aug 4 22:30:14

4 procs available with no timelimit

[forrest@node01 forrest]$ showbf -S
HostName Procs Memory Disk Swap Time Available
——————————————————–
node01 2 896 1 1506 INFINITY

node02 2 896 1 1514 INFINITY

Now Accepting Reservations

Maui provides a mechanism for reservations, where a reservation guarantees the availability of some set of resources at a specified time. Reservations consist of a list of resources, a time frame, and an access control list.

For instance, a certain node, say node02, may be reserved for a certain user, say Bob, for a certain time frame, say Tuesday, September 17 from Noon to 6:00 p.m. Reservations are created using the setres command. Bob’s reservation is entered as follows:


% setres -s 12:00:00_09/17 -e \
18:00:00_09/17 -u bob node02 \
reservation ‘bob.1′ created on one node

While Maui ensures that the specified resource is available only to Bob, it does not guarantee that Bob’s jobs will use this resource during the requested time frame. However, the user can force his job to run within a given reservation by using the FLAGS extension when using qsub to queue the job. In this case, Bob could use to the following command to use his reserved node:


% qsub -l nodes=1,walltime=3:00:00 -W \
x=”FLAGS:ADVRES:bob.1″ bobjob.cmd

The showres command can be used to view detailed information about reservations. The releaseres command can be used to remove a reservation by either an administrator or by the owner of the reservation being released. Standing reservations can be established (in maui.cfg) when there is a recurring need for a particular type of resource distribution. A third type of reservation, called priority reservations, may be created to give priority to large jobs that might otherwise sit in the queue for extended periods of time while smaller jobs continuously use most cluster resources.

Partitions

Another useful feature in Maui is partitions. Partitions are logical constructs that divide available resources. Normally, a job can only use resources within a single partition, and any resource may only be associated with a single partition. Partitions can be used to divide up resources into independently managed sub-clusters with their own policies, limits, and priorities.

For example, a 64-node cluster owned equally by a math department and a physics department could be split into two 32-node partitions, where each partition could have different operating policies.

Partitions are also useful for establishing boundaries due to hardware constraints. For instance, a 128-node cluster may have two 64-port Ethernet switches (which are cheaper than a single 128-port switch), and as a result, running a code on nodes connected to a single switch will be more efficient than running on nodes spanning both switches. Using the partitioning features of Maui, the system could be configured into two different 64-node partitions, allowing jobs to be run in either partition, but not across both.

Partitions are configured in the maui.cfg file. Users can request a specific partition on a per job basis using the -W x=PARTITION:part flag on the qsub command line (where part is the name of valid partition).

Simply Better Computing

The combination of the OpenPBS resource manager and the Maui scheduler provides a powerful facility for scheduling, running, and managing jobs on Linux clusters. Users quickly tire of competing with colleagues for resources on a cluster without a job queuing system. Once they become accustomed to a queuing facility, they find their productivity improves significantly.

Moreover, system administrators enjoy the benefits of a system that tracks jobs, monitors resource utilization, and delivers output back to users.

If more than two users are trying to run codes on your cluster, try out OpenPBS and Maui. You’ll wonder how you ever lived without them, your users will thank you, and you may even sleep better at night.

Who knows, you might even get to take that Hawaiian vacation this year. First stop: Maui.



Forrest Hoffman is a computer modeling and simulation researcher at Oak Ridge National Laboratory. He can be reached at forrest@climate.ornl.gov.

Fatal error: Call to undefined function aa_author_bios() in /opt/apache/dms/b2b/linux-mag.com/site/www/htdocs/wp-content/themes/linuxmag/single.php on line 62