Cluster In The Clouds

The new Amazon EC2 Cluster Compute Instance may be a game changer in HPC

Because Amazon had to make the Cluster EC2 announcement right in the middle of my latest summer language topic, I decided to take a brief intermission and comment on this latest Cloud development. I’ll get back to the R language in my next column, but I wanted to discuss what I believe to be a significant development.

In the past, when someone would ask me about “Cloud HPC,” I would note that there has been a significant effort to get HPC applications as close to the hardware as possible and Cloud tends to move (or abstract) the hardware further away. As a result, performance tends to be less than what you might expect and more importantly there is no guarantee of I/O performance (either storage or compute). Thus, HPC in the Cloud would probably work for some applications, but as a general solution, there is more work to be done.

Recently, Amazon Web Services took step in this direction by introducing cluster computer instances for their EC2 (Elastic Compute Cloud) service. The instance is designed for HPC applications and has the following features:

  • 23 GB of memory
  • 33.5 EC2 Compute Units (2 x Intel Xeon X5570, quad-core Nehalem)
  • 1690 GB of instance storage (in two volumes)
  • 64-bit platform
  • I/O Performance: 10 Gigabit Ethernet

There is a default usage limit for this instance type of 8 instances (providing 64 cores). You can request more instances from Amazon. As far as pricing, the reported price is set at $1.60 per single instance per hour or $12.80 per hour for a cluster instance. Not a bad price if you want to run a few jobs (even overflow work) without investing in new hardware. In addition, what was once a capital expense with significant lead time is now an instantly available operating expense.

So far, so good, but is it really effective? The answer, is of course, “It all depends on your application.” In terms of performance, High Performance LINPACK (HPL) results are on par with similar clusters built with 10 GigE. According to this blog, Amazon has run HPL on 880 Cluster Compute instances (7040 cores) and measured the overall performance at 41.82 TeraFLOPS (Intel Compiler, MPI, and MKL). This result places the EC2 cluster at position 146 on the current Top500 list. Looking beyond HPL, I find this quote from Keith Jackson, a computer scientist at the Lawrence Berkeley National Lab, most interesting:

“Many of our scientific research areas require high-throughput, low-latency, interconnected systems where applications can quickly communicate with each other, so we were happy to collaborate with Amazon Web Services to test drive our HPC applications on Cluster Compute Instances for Amazon EC2. In our series of comprehensive benchmark tests, we found our HPC applications ran 8.5 times faster on Cluster Compute Instances for Amazon EC2 than the previous EC2 instance types”.

This improvement is significant because slow downs of 10 times were the norm for the non-HPC EC2 instances. Of course it is not as good as the cluster in your data center, but the results are getting close enough for serious consideration.

The absence of InfiniBand is a non-starter for some users, in which case you may want to check out Penguin On Demand from Penguin Computing. The use 10 GigE is a big plus, however. Of course embarrassing parallel applications will work well on this type of hardware. It should also be noted that the HPC instances use hardware-assisted (HVM) virtualization instead of the paravirtualization (e.g. Xen) used by the other EC2 instance types and requires booting from Elastic Block Storage (EBS). This requires the user to create their own Amazon Machine Image (AMI). There is a Centos-based AMI available that can be used as a basis for creating your own AMIs.

Speaking of storage. Slow storage is a performance killer for many HPC applications. If you have any interest in using EC2 and need fast storage, read and study this blog from the crew at BIOTEAM. The initial results are interesting. In the EC2 cloud, there is persistent storage (EBS) that is used to hold data between compute instances in the Cloud. Included with each compute instance are two 840GB ephemeral disk volumes (i.e. they go away when the instance is finished). After testing, they suggest that using the ephemeral disks as a temporary scratch storage, possibly with PVFS2 or Gluster, and then moving data to S3 buckets or EBS volumes for long term storage maybe a workable and fast solution. Of course, more testing is needed with actual applications, but ephemeral I/O performance seems to be high and consistent enough for HPC.

One other aspect of EC2 that intrigues me is the integration with Sun Grid Engine, which some are now calling “Oracle Grid Engine” (It just does not sound right to me). This practice is often referred to as Cloud bursting. I don’t know if like that name either. In any case, imagine sitting at your desk and submitting jobs to either a local cluster or an EC2 HPC instance. Maybe your cluster is fully allocated, maybe it is down, or too slow, or perhaps you don’t even have a cluster. If you have an EC2 account, you can simply “qsub your job.”

The promise of simple on-demand HPC is an exciting option. One of the traditional hold-backs for HPC growth has been the need for competent administrators and infrastructure (i.e. space, power, cooling). For instance, if I am a small engineering company, it may not have made sense in the past to invest in a small cluster that may be used sporadically on a per contract basis. That resource may now, depending on the application, be available over the wire as it were at a predictable cost.

Of course, we are not really “there yet” in terms of Cloud HPC. I think Amazon EC2 is a step closer and I expect others to follow. It does have the one feature that I believe fueled cluster HPC from the very beginning — It is cheap to try.

Fatal error: Call to undefined function aa_author_bios() in /opt/apache/dms/b2b/linux-mag.com/site/www/htdocs/wp-content/themes/linuxmag/single.php on line 62