dcsimg

Share and Share Alike: The State of Cluster Resource Management and Scheduling

It's time you got your resources in order! Batch processing allows equitable access to the computing resource (making everyone more or less equally unhappy), but it also allows the system administrators to schedule the resource based on the goals and policies of the organization.

In a perfect world, everyone would have their own supercomputer. Sadly, we don’t live in a perfect world, and so clusters and other high performance computing systems tend to be shared resources. Unfortunately, once the number of simultaneous users on a system reaches double digits, scheduling methods that involve yelling down the hall “Is anybody using the machine now?” become impractical. One solution to this is to impose batch processing on the user community. This requires all users to submit their work to a central point of control that handles scheduling access to the system. Batch processing allows equitable access to the computing resource (making everyone more or less equally unhappy), but it also allows the system administrators to schedule the resource based on the goals and policies of the organization.

Resource Managers

First, some basic terminology. A resource is a finite quantity of something that can be used to do work: processors, memory, disk space, even software licenses. A job is a self-contained quantum of work which requires some resources for a period of time, while a job array is a set of of related tasks that can be treated as a unit. A resource manager (also known as a batch system) is software that manages the allocation of resources to jobs. It usually consists of at least two components: a queue server that accepts, classifies, and dispatches jobs; and one or (usually many) more execution servers that spawn jobs and monitors their execution. Most modern resource managers are designed to be distributed and fault-tolerant, with many supporting either active/passive or active/active fail-over for the queue server, and the vast majority being tolerant of execution server failures to varying degrees.

The PBS family: OpenPBS, TORQUE, and PBS Pro

The Portable Batch System (PBS) family of resource managers are perhaps the most widely used resource managers in scientific computing. PBS was originally created as a replacement for the equally venerable Network Queuing System (NQS) in the Numerical Aerodynamic Simulation facility at NASA Ames Research Center, and it formed the basis for POSIX 1003.2d, the POSIX standard for batch processing. PBS runs on a wide variety of proprietary and open source UNIX-like systems, from Cray vector mainframes running UNICOS to commodity PCs running Linux and FreeBSD. While it started out as an open-source product maintained by a for-profit company, over the years it has mutated into three distinct code bases:

  • OpenPBS, the original open source version, effectively unmaintained for several years.

  • TORQUE, an open source version forked from OpenPBS with numerous community-provided patches, maintained by Cluster Resources Inc.

  • PBS Pro, a commercial version with improved fault tolerance and grid features, maintained by Altair. Available for free to most academic sites.

These three code bases are interoperable for the most part; job scripts which work in one of them will generally work in the others with little or no modification, and virtually all of the user-level and administrative commands are similar if not identical. The major differences between the three PBS code bases are in scalability and fault tolerance. The default protocol used for communication between the queue and execution servers in OpenPBS is based on UDP, and it is not particularly forgiving of delays or dropped UDP packets. As a result, OpenPBS tends to have scalability problems with more than 200-300 nodes, depending on the performance of the network. In contrast, the communication protocols used by TORQUE and PBS Pro are considerably more robust; both are known to scale to well over 1000 nodes. Another significant difference between the PBSes is support for job arrays. OpenPBS does not support job arrays in any meaningful sense, and TORQUE has only recently added some beta-level support for them. However, PBS Pro has supported job arrays for quite some time.

One of the unique features pioneered by the PBS family of resource managers is the interactive job, which is an interactive session managed through the batch system as if it were a job. These are tremendously useful for debugging, pre- and post-processing of data sets, and other inherently interactive tasks that may be too resource-intensive to be run in the often tightly controlled environment of a shared login node.

Sun Grid Engine

Another widely used resource manager is Sun’s Grid Engine (SGE). Grid Engine was originally an academic software package called Distributed Queuing System (DQS), which later evolved into a commercial resource manager called Gridware CODINE. In 2000, Sun purchased Gridware and shortly thereafter released both commercial and open source versions of the newly renamed Grid Engine software. (Note: The term “Grid” is used in many contexts. While Sun Grid Engine can be used to build “Grid like” systems, it is not the same as wide area Grid projects like Globus.) SGE is known for its ability to sustain very high job submission rates and its fail-over capabilities. However, the latter requires the use of a shared file system which is accessible from all the nodes in the cluster.

Grid Engine is available as both Open Source and as a commercially supported version called N1 Grid Engine 6 (N1GE). There is no functional difference between the open source and commercial versions. The commercial version differs from the free version by offering additional QC/QA testing done internally by Sun, localization to non-English languages, global support and professional services, and tight integration with other “N1″ management, service and provisioning technologies. The open source Grid Engine development community is very active and does provide How To pages (i.e. MPI integration etc.) and mailing list support for SGE issues. There is also an on-line video from the Grid Engine team that introduces grid computing and talks about what Grid Engine is and how it’s commonly used.

SLURM

A relatively recent entry into the world of resource management is the Simple Linux Utility for Resource Management (SLURM). Developed by Lawrence Livermore National Laboratory, SLURM is an intentionally minimalist resource manager focused on portability, fault tolerance, and extreme scalability. This focus on scalability has resulted in its use on several of the largest systems in the world, including the 64k-processor IBM BlueGene/L at LLNL and several very large clusters with 8k or more processors. However, SLURM’s minimalism comes at a price; for instance, its accounting log support could be considered relatively primitive compared with other resource managers.

One of SLURM’s unique features is its built-in MPI support. Whereas most other resource managers either support a single MPI implementation or require third-party software to support execution of MPI parallel jobs within the context and control of the batch system, SLURM’s srun command already supports programs using any of at least half a dozen different MPI implementations, with more being added on a regular basis.

Others

In addition to the open source resource managers, there are commercial alternatives to consider. One such offering is Platform Computing’s Load Sharing Facility (LSF), which has seen wide use at U.S. Department of Defense and Department of Energy computing centers. Like SGE, LSF is known for being to sustain very high job submission rates. Platform also offers an extensive portfolio of system management and workflow automation products on top of LSF.

IBM’s LoadLeveler is another commercial resource manager which may present itself as an option. LoadLeveler originated at the batch system for IBM’s SP clusters running AIX, but as IBM embraced Linux, LoadLeveler was one of the first technologies ported from AIX to Linux. Like PBS Pro, SGE, and LSF, LoadLeveler is a mature and full-featured product; however, it is really only an option for sites purchasing hardware from IBM.

Table 1: Resource Manager Feature Comparison

Package

Open
Source

Support
Contracts
Available

Server
Redundancy

Requires
Shared
File system

Job
Steps / Arrays

Interactive
Jobs

OpenPBS

Y

N

N

N

N

Y

TORQUE

Y

Y

Y (beta)

N

Y (beta)

Y

Altair PBS Pro

N

Y

Y

N

Y

Y

Sun Grid Engine

Y

Y

Y

Y

Y

Y

SLURM

Y

Y

Y

N

Y

Y

Platform LSF

N

Y

Y

N

Y

Y

IBM LoadLeveler

N

Y

Y

N

Y

Y

Schedulers

The second piece of the batch processing puzzle is the scheduler, which determines in what order to run the jobs. Where the resource manager implements the mechanisms for running jobs, the scheduler implements the policies for scheduling set by the organization. In essence, the scheduler acts as the brain to the resource manager’s brawn. Virtually all resource managers include a built-in scheduler (which in some cases is rather simplistic and in others may be quite feature-rich), but most also include an API or other tools for sites to implement their own external schedulers.

One of the major functions of the queue server part of a resource manager is to assign each job submitted to it to a queue, a list of jobs with similar characteristics and policies. Queue structures can vary from minimalist to very elaborate, and how queues affect scheduling depends a great deal on the scheduler and the site’s scheduling policies. Some schedulers treat queues as first-in/first-out (FIFO), others round-robin between queues, and still others uses queues simply as an abstract classification mechanism.

FIFO is rarely a good scheduling policy, and schedulers generally have one or more policy mechanisms to prioritize and schedule jobs. The first and simplest policy mechanism is configurable priorities, where a job’s priority is a function of factors such as the size of the job’s resource requirements and how long it has been waiting to run. Another scheduling policy is a fair-share policy, where each user or group is allocated some maximum fraction of the system. Fair-share policies can be either soft, where exceeding the fair share causes a reduction in priority for the user or group’s other jobs, or hard, where the fair share is not allowed to be exceeded; the latter can also be thought of as a throttling policy. A third scheduling mechanism is the advance reservation, in which the idle job (or jobs) with highest priority has resources reserved for it at some time in the future. This enables an optimization called backfill, in which small or short-running jobs are placed into the holes between larger or higher-priority jobs without impacting the wait time of the larger jobs; this often has a substantial benefit in terms of overall system utilization. Finally, some schedulers allow arbitrary quality of service (QoS) levels to be assigned to individual users, groups, or even queues. QoS levels may include increased priority, a target for maximum queue wait time, or even the ability to preempt other running jobs.

Maui

Perhaps the most widely used scheduler that works with multiple types of resource managers is the Maui scheduler. As one might infer from the name, Maui was originally developed for the Maui High Performance Computing Center to schedule LoadLeveler jobs on an IBM SP2 system; it has since been extended to work with the PBSes, LSF, and Grid Engine. Maui is a very feature-rich scheduler, supporting fair-share, advance reservations, backfill, QoS levels assignable on a per-user/group level, and configurable prioritization based on a huge number of factors.

Maui’s flexible reservation model gives it much of its power. It supports dynamically created reservations for scheduled downtimes as well as standing reservations, which are reservations of resources on a periodic or continuous basis. Standing reservations allow for a large number of policy-based job placement options. For instance, standing reservations can be used to assign jobs to a set of nodes, or to reserve resources during business hours for short running debug jobs.

Since it originated in the realm of IBM SPs, Maui natively supports a number of abstractions used on those system. For example, Maui support the definition of partitions, which are logical grouping of nodes. Maui will not allow jobs to span nodes from multiple partitions without permission from the administrator, so partitions are often used to segment nodes with different hardware or network characteristics from each other. Maui also supports generic consumable resources, which can be used to keep track of shared resources such as software licenses. Consumable resources can be either shared across the entire cluster (e.g. floating software licenses) or assigned to an individual node (e.g. node-locked licenses).

Moab

Cluster Resources Inc., the developers of Maui, have also created a follow-on commercial product called Moab. In many ways, Moab is to Maui as PBS Pro is to OpenPBS; Moab is a closed-source product, but it supports a superset of Maui’s functionalities and scales to much larger systems. For instance, Moab extends Maui’s reservation model by allowing unprivileged users to create reservations within “sandboxes” defined by an administrator. Moab also has a number of features of interest to hosting and utility computing services, such as interfaces to host provisioning software and support for virtual private clusters within which users can see only the jobs and nodes accessible to them.

Others

With any resource manager that supports an external scheduler API, there is also the option to implement one’s own custom scheduler; for instance, the San Diego Supercomputer Center has a scheduler called Catalina, and Japan’s National Institute of Advanced Industrial Science and Technology has a scheduler called PluS. A site may choose to do this if none of the existing schedulers implement the desired scheduling policies. The PBS variants even include a special scripting language for implementing schedulers called Batch Scheduling Language (BaSL), as well as external scheduling API bindings for C and Tcl.

Table 2: Scheduler Feature Comparison

Package

Supported
Resource
Managers

Open
Source

Redundant
Servers

Fair-share

Advance
Reservations
& Backfill

Assignable
QoS

OpenPBS/TORQUE
built-in

OpenPBS
TORQUE

Y

N

Y

N

N

PBS Pro
built-in

PBS Pro

N

Y

Y

Y

N

Sun Grid Engine
built-in

SGE

Y

Y

Y

Y

N

LoadLeveler
default

LoadLeveler

N

Y

Y

N

N

LoadLeveler
backfill

LoadLeveler

N

Y

Y

Y

N

LSF
built-in

LSF

N

Y

Y

Y

N

Maui

LoadLeveler
LSF
PBS
SGE

Y

N

Y

Y

Y

Moab

LoadLeveler
LSF
PBS
SGE
SLURM

N

Y

Y

Y

Y

Catalina

LoadLeveler
PBS

Y

N

N

Y

N

PluS

PBS
SGE

Y

N

N

Y

N

Conclusions

Which is the right resource manager/scheduler combination for a given system? That depends on the site’s budget, the requirements of the overall system, the skill set of the administrators, and most of all the history and needs of its user community. For instance, a site with a modest budget is probably best served by an open source solution, and a site that has historically used a particular set of resource management software should probably continue to do so unless there is a compelling reason to switch. On the other hand, administrators of a “green field” system with no pre-existing use community may be best served by selecting two or three candidate packages from the field and evaluating those side by side.

Comments on "Share and Share Alike: The State of Cluster Resource Management and Scheduling"

gtelzur

I think no review about this a topic will be complete without
mentioning Condor (
http://www.cs.wisc.edu/condor )!!!

bsteph97

Where would BMC’s Control-M fit into this discussion, if at all?

deadline

You have a point about Condor. It can and is used
as an effective cluster scheduler. However, the majority of cluster users tend to prefer most of the applications we mention in the article. Next time
we will include Condor.
As for BMC Control-M, this is the first
time I have seen this mentioned in an HPC
cluster context. We will definitely check it out.

I intended to create you that very little observation to be able to thank you over again just for the beautiful views you’ve shown in this article. This has been really pretty generous of people like you to provide openly all most of us might have supplied for an ebook to get some money for themselves, even more so given that you might well have tried it if you ever considered necessary. The concepts also served as a easy way to know that other people have a similar dreams the same as my very own to grasp a great deal more in respect of this issue. I know there are lots of more pleasant situations in the future for people who check out your blog post.

My baby
The article is very helpful for me,i like it,thanks!
buy hermes belt online india http://bionor.es/cas/pdf/88/67/

Leave a Reply