You have probably noticed, that I haven’t spoken that much of high performance computing (HPC) in the Cloud. Most of the SaaS are fulfilling consumer needs (Flickr, Youtube, etc), and some business needs (Salesforce, etc). HPC hasn’t been left out of the mix. Actually it is fairly well represented in terms of providers. As indicated previously, Tsunamic Technologies, CRL, and numerous others have set up large clusters on which people can buy cycles.
In the case of Amazon EC2, you have to provision the machines yourself as part of the process of using them, though a smart provider could build a business atop EC2, creating virtual clusters on the fly for consumers to use, and abstracting/simplifying the tools for end users to use.
In the case of Tsunamic Technologies and some of the other providers such as CRL, and the IBM on-demand groups, you can get a Linux system pre-configured and ready to use. You just need a fast network pipe to their systems. In these cases, you are running “on demand” effectively on a shared cluster versus on a dedicated resource. Isolation is usually via normal *nix user methods, which are often good enough for most HPC users. If you require dedicated resources due to data sensitivity, you need to consider VM’s or real dedicated hardware for isolation, but you will pay a performance cost for the VM, and a premium for dedicated isolated hardware.
The issues most users run into using these resources center around, not surprisingly, licensed applications. These applications often require flexlm daemons or similar for license compliance monitoring. ISV’s that write these applications would be very well served by the on-demand model. They could get a more accurate counting of use, as well as lowering the entry barriers to use, which moves users along the demand curve, so that the cost to use the software drops, while the number of (possible) users expands dramatically.
Viewed another way, this model of HPC actually allows the ISV to increase their total addressable market. The numerical size of the market increases as the barriers for consumers to use the product drop. The value of the market size could also increase with careful pricing strategies.
That said, one of the major differences in the HPC-in-a-Cloud model relative to the Cloud model as a whole, is that HPC users often focus upon getting the fastest system they can afford. That is, they are trying to maximize affordable performance. This goal has implications on hardware refresh rates, as the SaaS model generally doesn’t care if Salesforce runs on a Celeron or an Opteron, but the HPC user will need to know about the underlying hardware and network interconnect, as well as the storage system.
In cases like this, the S3 model may work fine for some cases, but users will generally need high performance storage tightly coupled to their on-demand resource. Unfortunately, few on-demand service providers have the combination of tightly coupled high performance computing and storage resources that. You probably don’t want to run large applications on Slicehost clusters, you will not be as happy with the performance as if they were running natively.
To perform HPC in a Cloud, you need to start out with a good understanding of what resources your code needs. Not all codes can run in an embarrassingly parallel mode similar to Folding@Home. A fairly large subset of codes require very tightly coupled resources to properly function. You can have clouds of these resources, but they are decidedly non-commodity resources.
Once you understand which hardware your code requires, you need to decide whether or not virtualization is a possibility to use for presentation of the hardware to the code. Virtualization provides many interesting benefits. Everything comes with a price, however, and in the case of virtualization, this is paid in performance and latency. Virtual machines can operate under a hypervisor or a para-virtualized model. The latter is closer to the silicon, but you lose some of the benefits of the hypervisor approach. You are running less on a virtual machine in the paravirtualized world, and more within an operating system container of sorts. This will get you closer to base machine performance. But all I/O, including the latency and bandwidth sensitive will still have to traverse additional layers, which will reduce performance. A simple graphical rule of thumb would look like this:
Latency and Bandwidth Requirements In Relation To Virtualization
Most of the Cloud Computing platform providers will have hardware that is appropriate for applications towards the lower left of Figure One, that is, applications that are not very latency or bandwidth sensitive. These are loosely coupled codes that spend most of their time calculating, and very little time in I/O or data motion to other nodes. Moreover, they are largely insensitive to when new data arrives. These are typically not MPI programs, more typically isolated programs performing in-core computing, possibly running on more than one processor within a single machine. This situation does in fact describe a fair number of HPC applications which are considered loosely coupled, or even uncoupled applications. These tend to be (nearly) embarrassingly parallel.
The other end of this spectrum is represented by the upper edge, or the right edge. These are actually different cases, and usually represent different coding styles. Codes at the upper edge of this picture are very latency sensitive. The faster you can get data between processes is a major determining factor in whether or not they can run quickly. In this case, adding latency is not a good idea. These are the tightly coupled codes.
Similarly, the right edge is where bandwidth sensitive codes would live, including codes doing out-of-core operations, and other I/O intensive operations. These are the high bandwidth codes live, requiring huge I/O capacity which VM systems generally impede. You as the user of these resources need to decide if that slow-down is worth the benefits the Cloud can provide.
We have discussed some of the issues surrounding the infrastructure for HPC in a Cloud. Amazon EC2 lets you create the loosely coupled systems with relative ease, though you will not get significant I/O bandwidth, nor low latency interconnects. So the HPC systems you can build with it will typically be to the lower half and left side of the chart. These will generally operate reasonably well as VM’s.
The opportunity for HPC in in the Cloud brings us full circle to the fundamental reason for Cloud Computing — low cost of entry. Computers in the clouds may not be the fastest systems you can buy, but they offer a very low cost to start using them. For HPC, this can be significant. Remember, ISVs want your hardware cost to be as low as possible, so that you have more money in your declining budgets for their product offerings. If you can avoid paying the initial up-front costs to get the systems, and simply start using them, you can (potentially) be more productive.
This opportunity is where the Cloud providers are placing their bets. You don’t have to install anything more than a web browser to use Gmail, Google Earth, and many other “Cloud” services. What if your HPC could be delivered that way? HPC still does require more specification and interaction with the underlying hardware or VM, though Cloud based HPC tools are being developed. If you can solve the deployment and provisioning, and make that work on demand, then you are off and running with your own flexible HPC Cloud system.
It is not a foregone conclusion that Cloud Computing will be successful in HPC. It has been tried before with different names. And it has failed, badly. What is different now? Economics, computing power, and bandwidth. Economics make it easy and inexpensive to scale up what are effectively disposable computing elements. Computing power has followed Moore’s law for the most part. And there is significantly more bandwidth available to end users, though many (correctly) argue that it isn’t growing fast enough for them.
As with all other computing platforms, Cloud Computing needs a few things to grow and become a stable HPC option. I have talked about most of them above. The one that has not developed just yet are Cloud based applications. Once applications can be hosted and run effectively in the Cloud, it is likely that we will see a large in-rush of HPC players.
Recently, I have had multiple conversations with ISVs on these types of issues. Many HPC ISVs don’t understand what the Cloud can do for them. Currently my company is working with an ISV that does understand the value proposition. Our efforts are focused on helping them get their code up and available to customers over the web. This client certainly understands such a capability can potential increase the size of their user base, reduce barriers and costs to usage, and enable the ISV to capture more revenue. Lots of people are watching the clouds to see if it does.