Do we need software standard for clusters? Future software may make standards obsolete.
Recently IBM announced the open availability of their xCAT software. For those that are not familiar with xCAT, I quote from the website, “xCAT is a scalable distributed computing management and provisioning tool that provides a unified interface for hardware control, discovery, and OS diskful/diskfree deployment”. In more official terms xCAT is referred to as The IBM HPC Open Software Stack®, which indicates that it is integrated and tested by IBM. In a nutshell, xCAT makes it easier to create, provision, and manage your cluster.
Many years ago, I started a “good new/bad news” joke about Linux HPC clusters. The good news is Linux HPC clusters give you the freedom to design and build your HPC machine to your needs. The bad news is Linux HPC clusters give you the freedom to design and build your HPC machine to your needs. One needs to be care with that freedom thing. In particular, there are two areas where this “freedom feature” can hurt the market place. Tools like xCAT go a long way in helping manage the “freedom” we’re all looking for.
One area of hurt in the cluster HPC space are the ISVs (Independent Software Vendors). Imagine trying to design your software for the myriad of cluster configuration options available to the end user. Of course, there are the Rocks and Oscar kits, but even with these there is ample room for users and administrators to change things just enough so that an ISV code has some issues. (Note: Rocks and Oscar are full fixed cluster distributions where xCAT is way to manage a cluster software environment.) Variations in a software environment places the ISV in tough position. If they want to sell their product, they may end up performing cluster support for their customer (which is entirely different than product support). The attraction to a Microsoft type of solution is evident in this case. To the ISV, having a firm line between the application and the cluster is very appealing in a it is not my problem sort of way. Targeting a reference cluster environment is one solution and the things like Intel Cluster Ready certainly help the situation.
A second area of hurt is the in the end user camp. I can speak to this from personal experience. In many cases, end users of many smaller clusters, have managed to get their cluster working just fine. Except when it comes time to upgrade some aspect of the cluster. There is often a rather intricate dependency tree that is easily broken by upgrading something. At times it is easier to just reinstall the whole cluster with a new (more manageable) distribution or framework. When it is upgrade time I often get a call. It seems that the original person(s) who did the install is not around any more, or no one can remember what they did, or the vendor got it working somehow. In any case, the user wants to do science or engineering and not learn the nuances of creating RPMS or installing software. Managing all this can be quite daunting.
At this point one would conclude that we need an “Open Cluster Standard” that is vendor neutral and can be used by ISVs and end users alike. Much like the Linux Standard Base (LSB). Certainly a good idea, but I don’t think it will ever happen. Or maybe I should say it does not need to happen. Some explanation may help here.
If you think about it, when an application runs on a group of nodes (ignoring multi-core issues for the moment) it needs very little. The Linux operating system, some libraries, and the executable program (of course some scheduling and monitoring software maybe needed). From an application support prospective it is pretty simple. Even with these modest requirements, most ISVs still qualify there application against a particular distribution of Linux and only support those installations. This puts a rather constraint on some clusters.
One way to handle this situation is to continue developing tools, like xCAT, that manage from a higher level. Instead of managing just one cluster, such tool could be designed to manage various application clusters. An application cluster could be spawned on various cluster nodes by any number of queuing systems. Of course, when one provisions nodes by pushing disk-full images this solution sounds a bit inefficient. However, when one considers disk-less provisioning in the form of virtualization, ramdisk methods, NFS, and NBD (Network Block Devices) one may just start thinking differently about the a truly flexible run-time provisioning environment.
For example, imagine that you have an ISV code that requires RHEL4. While at the same time you have users that want to use RHEL5. If you had a way to boot nodes in either of these two environments at run-time, many of your problems would simply disappear. In other words, just as cluster hardware is built around a problem set, we could build a software environments around the application sets. The application will lead the way and not the operating system, the MPI libraries, or any other software aspect of the cluster. Instead of trying to write a standard that covers all the cluster minutia, why not write a standard that allows you to encapsulate and run your application cluster. The application cluster can contain any standard or custom software environment the user or ISV desires. From a community/market standpoint all you need is standard way of launching application clusters on the hardware.
Of course such technology does not exist just yet, but, instead of sitting around the table and arguing about where the MPI libraries should live, why not figure out a way to shove it all into a virtual pneumatic tube system. Tools like xCAT are pointing in this direction. In the end, we may all get what we really want, the freedom to do it our way and not be concerned with your way.