Scheduling HPC In The Cloud

New additions to Sun Grid Engine allow integration with Clouds and Hadoop (map-reduce) jobs

Let’s face it, resource schedulers are not a sexy topic. They are necessary, but often a complicated affair that tend to frustrate users and keep system administrators busy. The most common complaint is of course, “Why won’t my job run?” The answer often lies in some scheduling policy, or a full queue, or in the extremely rare case that a users program is causing the problem, cough, cough.

If you don’t know what a resource schedule is, then the next several paragraphs are for you. Those advanced students can skip ahead or look for flaws in the following description. The name says is it all, you have multiple resources and you have work that needs to be scheduled on these resources in an optimal way. Some common resource schedules are Sun Grid Engine, Torque/Maui, Moab, PBS, Platform and Platform Lava. A cluster is good example of the need for a scheduler. In a cluster you may have 128 nodes, each with eight cores. Most of the user programs need 1-16 cores at a time, but some need 256. The questions is, Given a list of jobs, what is the best way to schedule them on you cluster so that the cluster is utilized in an optimum fashion?

Users often interact with the resource scheduler by submitting a “job” (usually a script) to the scheduling queue using a command like qsub (queue submit). After the job is submitted, the user then can monitor their job using something like qstat (queue status) that prints out a pile of confusing information, none of which answer the question “Why won’t my job run?” (Of course there are options to provide this information, but it always seems easier to fire off an email to the system administrator.)

To make the scheduling problem a little more intractable, in some cases we don’t know how long the applications will take to run and there may be many other resources that applications need (e.g. memory size, storage, processor type). Thus, the job of the resource scheduler is not easy, but very important to improve cluster utilization. Indeed, the advent of multi-core has made kernel-level scheduling more important (and difficult) than in the past. At the kernel level, a kernel must schedule and move tasks from core to core based on load and cache locality. Interestingly, the ability of a high level resource scheduler to reach down into the CPUs and control core placement and affinity is needed for best performance.

I’ll stop there with the boring resources scheduler lecture and tell you why resource schedulers are going to become the new “cool tool” in HPC and beyond. It is not because of a new GUI or some other arcane feature. The real reason is Cloud computing. Yes, remember Cloud computing. You probably have not heard that word for 5-10 minutes. Let me first state, that there are, in my opinion, some issues with cloud computing, but “the cloud” does not seem like it is going anywhere soon. Indeed, resource schedulers may just put Cloud on the map as it were.

Recently, I was forwarded this news item Dynamic Resource Reallocation and Apache Hadoop with Sun Grid Engine from David Perel of NJIT. For those that like a deeper dive, check out Dan Templeton’s excellent blog: Welcome Sun Grid Engine 6.2 update 5. There are two sexy items in this news release, the first is Cloud computing and the second is Hadoop, which is kind-of a Cloud thing for most people.

Specifically, the new version or SGE (Sun Grid Engine) allows integration with Clouds such as Amazon EC2. Jobs can be submitted and SGE will manage the connection. To use EC2, the user needs to compose AMI images for their application and register them with SGE. In addition, they also need to provide SGE with EC2 account information. Once that is done, users can queue up work and “cloud burst” out to EC2.

Another new feature is the integration with Hadoop. If you don’t know what Hadoop is, just google “map reduce.” (Nothing like demonstrating a technology while learning about it.) While setting up Hadoop cluster is not a small task, it is a powerful search paradigm that does not depend on a database. Typically, a map-reduce query is launched across multiple servers each with a different data set on its local hard drive. SGE has been enhanced to allow direct submission of Hadoop jobs. It also understands the HDFS (Hadoop File System) and can send work based on data locality. There are some other new features in this new SGE version including, Topology-aware Scheduling, Slot-wise Subordination, User-controlled Array Task Throttling, and some others, but probably boring for many people.

HPC in the Cloud is a mixed bag at this point. Unless you use a specially designed HPC Cloud like Penguin Computing POD service, the I/O resources critical to HPC performance can be quite variable. This may be changing, however, as individual servers contain more cores. In a recent column I mention that IDC has reported that 57% of all HPC applications/users surveyed use 32 processors (cores) or less. These numbers are confirmed by a poll from where 55% of those surveyed used 32 or less cores for their applications. When the clouds start forming around 48-core servers using the imminent Magny Cours processor from AMD many applications may fit on one server and thus eliminate the variability of server-to-server communication.

HPC may start to take a very different form as dense multi-core servers enter the Cloud. A user may sit at her desk submitting jobs to their own SGE desktop. The resource scheduler will then reach out to local resources or Cloud resources that can run virtualized or bare metal versions of her applications. The resource scheduler may become as valuable to the HPC desktop as the browser or word processor. Sounds like Grid, only easier.

Fatal error: Call to undefined function aa_author_bios() in /opt/apache/dms/b2b/ on line 62