Does Cloud Computing have a play in HPC? Join us for a pragmtic view of "the cloud" and how it may expand the horizons of HPC.
Which means that to minimize your cost to stand up your infrastructure you should select your OS and software appropriately. Yes, you will have TCO costs. As Cloud Computing focuses on the producer side in terms of time/effort to stand up the applications, Cloud Computing is curiously also a TCO reduction play. Drive your TCO so low that the marginal cost of adding the additional capacity is the majority of the cost over the lifetime of the system and make management as simple as possible.
This might be why the CEO of Microsoft recently started talking about Cloud Computing versions of their Windows product line. Azure will be an interesting addition to the Cloud Computing atmosphere. But, the point is that adding/reducing capacity on demand is one of the hallmarks of Cloud Computing.
Of course, everything comes at a price, and in this case, the marginal cost benefits you gain come at the cost of control over the hardware. You no longer have this box sitting next to you, or even in “your” data center. Which is in part, why Richard Stallman noted in an interview:
“that Cloud Computing was simply a trap aimed at forcing more people to buy into locked, proprietary systems that would cost them more and more over time.”
I am not sure if I agree with the word “trap”, but he is right in terms of converting a one time asset purchase into a subscription model. His concerns appear to come in the form of loss of control over the hardware, and more importantly, the data and programs on the hardware.
Some of his concerns have merit. They should not be dismissed lightly. Especially if you are looking at Cloud Computing for disaster recovery. Think of it this way, if you lose control over your VM (Virtual Machine), by say, your Cloud Computing provider going out of business, or deciding that they no longer wish to be in that business, do you have the freedom to pick up your VM and move it? If you don’t own the machine upon which the VM resides, this requires you ask their permission. Moreover, if the VM is somehow a proprietary system, say the Microsoft or VMware offering, and the powers that be decide enough is enough, and they won’t support or allow their VM systems to run anymore?
This scenario is somewhat of a rhetorical question, as the VMware server is freely available, as long as you have serial numbers. So at least in theory you can boot the machine and move your data off. The Microsoft offering may also be freely available, the download is free. But as noted with other systems with embedded DRM, such as music sold by Walmart it is possible that once those in charge decide that you may no longer use something, they can throw a switch, and voila. You can no longer use something.
This loss of freedom to do with your data, whatever you want, and not what someone else allows you to, is the likely source of concern for Mr. Stallman.
There is a cost and risk to everything though, and you have to decide on the risk model most appropriate for your needs. This said, a multiple basket for your important data, as well as data and VM mobility appear to be some of the most useful. Likely we will see VM or storage escrow emerge so that even in the event of a disaster, you can recover your data, even if the owners of the VM machine and storage no longer have the ability to make a case one way or another.
That isn’t the only action you can take to mitigate risk, but it makes sense to have a guarantee of VM access and escrow in event of a failure/issue on the proprietary providers side. If you use a FOSS VM such as the KVM or Xen system within Linux, you have the freedom to move your VM where you wish. The only real risk that you will need mitigation on will be access in the event of a failure.
Obviously VM’s aren’t the only method one can use to create infrastructure on the provider side. You can leverage pre-setup systems, existing clusters and “grids” with pre-defined configurations. You don’t need to have your Cloud service provide virtual bare metal for you, especially when a pre-defined configuration may work well for your needs. Aside from this, VM’s may not scale as well for your applications, so you might require somewhat more specialized hardware. Here the Cloud provider could be somewhat difficult to distinguish from a hosting provider.
This brings up the point about the emergence of providers of Cloud infrastructure. Amazon EC2 is one of the best examples, though not the only one. They provide a variety of hardware types for running your VMs. We have seen and heard of people using Slicehost for standing up virtual servers, as well as more HPC focused groups such as Tsunamic Technologies, Tata’s CRL, and more general computing groups such as IBM’s On-demand initiative, Sun’s Solaris-centric Network.com, and others. Dell has long been rumored to be getting into this game. Microsoft may get into this as well, as they are announcing initiatives to kickstart windows Cloud. I am not sure what that will be, hopefully we will hear more about this soon.
With Amazon, you create VM images that you can then launch (or drop) into the Cloud. These images are stored on the S3 storage “Cloud”. Basically you can minimize data motion by keeping data close to the VM.
There is risk. Software is not without bugs. Services are not without failure modes. Risk mitigation is about minimizing the impact of these failure modes. RAID1 disk is making a bet with the disk vendor that indeed, a disk will fail, and you will have sufficient time to replace this disk with another one before the second disk fails. But what if the RAID card fails? Or in the case of S3, what if it goes offline, and doesn’t let you get to your data for a while? This is sadly not a theoretical prospect. It has happened. In February 2008, S3 went down and it took quite a few web startups with it until it could be restored. They were, proverbially, up a very large river, without their data. Amazon worked diligently to rectify the problem and was back online within 14 hours.
Amazon wasn’t the only one hit by outages. A month before Amazon was hit, Joyent was down for over a week due to a bug in their underlying infrastructure OS and file system layer. Detailed explanations were given by the CEO and the staff. The much celebrated zfs file system appears to have been the source of the bug, having been triggered by an OS update.
Omnidrive is the case in point where business risks and failure resulted from lack of revenue, which took down a service that customers relied upon for permanent storage. Which meant, that when Omnidrive failed, it took data down with it, permanently. The point that shouldn’t be missed is that sometimes failures happen, and just because your machines or applications are in the “Cloud”, it doesn’t mean that they cannot fail.
Another sobering lesson learned from another market in the last several weeks, is that volatility can take down even very large companies, so size isn’t necessarily a predictor of success or failure.
Another model apart from PaaS (Processing as a Service) used in Cloud Computing is Software as a Service (SaaS). Yes, this is as buzzword-enabled an article as you will get from me, and I am not done yet. SaaS is a re-birthing of the ASP model from several years ago. In the ASP model (which largely failed, with one or two notable exceptions), your application and data was hosted, and you moved data to the application. In general, ASP failed for a number of reasons, not the least of which were the economics of the systems were simply not amenable to this model. Currently, SaaS runs on small machines with multiple VM’s. Had you said this in 2001 at the end of the previous dot-bomb era, you would have been laughed at as you were escorted out of the facility. In those days, SaaS required big machines running very expensive disks, big expensive networks, and all sorts of management tools. Today’s SaaS has a much smaller price footprint per platform, with better performance than in 2001. Moore’s law gives you an order of magnitude performance increase every ~5 years or so. So you need much less hardware, at a lower cost, to host your application.
There is still, unfortunately a problem due to some web browsers lack of adherence to standards. As Firefox continues to gain momentum, with Opera and others in a cross platform manner (Safari doesn’t work on Linux last I checked), this is rapidly proving to be less of a problem. Combining these systems with rich content delivered by cross platform tools such as Adobe’s Flash, you can get generally excellent interactive web application support. (Although the absence of a 64 Linux Flash plug-in is still an issue, but I digress).
Salesforce is a service. As is flickr. And, Youtube. And, …, you get the idea. Basically the intent is to reduce what customers need to access your value. Make it as painless as possible. Lower the cost of access for the customers, which should move them along the demand curve so that they have an effectively lower cost of accessing the value.
Again, as with the PaaS offerings, the SaaS offerings carry their own risks. Stallman’s words should ring prophetic in terms of losing potential data you will not be able to get out if an application service provider goes out of business, or loses access to or control over their systems. There was a small Gmail outage recently, that provided service interruption for a short period and probably got people thinking about the reliance on someone/something else for important activities.
If you have a business that depends upon minimizing the latency of information access, for example, various financial service calculations, you simply cannot afford to inject additional seconds and minutes of latency into a process. There is a finite risk of substantially more latency. There is even an non-infinitesimal risk of infinite latency. You send a query and never get anything back. As with the PaaS, you have to mitigate this risk. Similar techniques are indicated here, though if your business or work depends critically upon a single vendor application, you may not have the ability to get the identical service elsewhere. So data mobility is needed. You need to be able to pull data in and out, so that (at worst case) you can programmatically push it into a new system. In the case of open source apps such as SugarCRM, and vTiger, you don’t have to rely only upon the hosted system, you can install local versions of the tool. Data motion is still an issue though.