How can the community help create better HPC software?
Tuesday, July 28th, 2009
Editor’s Note: This article was originally published in ClusterWorld Magazine, March 2004
Is cluster software any good? Just looking at the applications, from Google to fluid modeling, the answer is clearly yes. Clusters and the software that enables them have made vast amounts of computing power available to thousands of people. But there is a selection effect here. People who can use clusters to get their work done have, by definition, found the software at least adequate. What about everyone else? What about the people who talk of the software crisis or the scientists that have tried clusters and given up on them? Can we avoid becoming another elite group, composed of those people stubborn or influential enough to get the software that they want? Is cluster software good or bad?
The answer, as with so many things, is both and neither. Portable, commodity software standards such as MPI (and MPI-2) and the UNIX API allow users to develop software on their laptop, move it to a cluster, and even move it from the cluster to a dedicated, high-performance system like the Earth Simulator. Libraries of parallel software have simplified the creation of parallel software, allowing at least some computational scientists to go back to focusing on the science instead of the software. But the software available still has many problems. These range from the usual bugs, mismatches with the abilities of cluster hardware, missing capabilities, and fragmentation of the defacto standards to missing or inappropriate standards.
Solving the software issue is not easy. The various problems are in tension with each other. New features often get precedence over bug fixes (when did you last see a software package proudly advertise I). Added features may make the software easier for some users but may also introduce new bugs. User applications that depend on these new features sacrifice portability once they become dependent on those unique features. And ,no one ever does enough testing; clusters complicate the problem even more because now all combinations of hardware and software may be used in the same run of an application.
Part of the solution is to emphasize commodity software. That is, software that is written to an agreed upon standard. Applications that use commodity software can pick and choose their software platform in much the same way that commodity hardware makes it possible to pick and choose the hardware platform. But there is danger here too. If we insist on the current set of standards, we stifle innovation and prevent the development of better standards.
I/O for clusters provides an example of this danger. Many users believe that POSIX I/O is the relevant standard and a high-performance, parallel file system should support POSIX. Yet few users use a POSIX file system. The NFS is not POSIX; yes, it (or rather, the API that allows use of it) has POSIX syntax, but NFS does not fully implement the POSIX semantics. And the reason is that to do so would cripple the performance of the file system. Nevertheless, users are (usually) happy with NFS and many of their complaints are related to the reason for the POSIX semantics. Using POSIX as the appropriate I/O standard for a cluster file system would force us to accept higher complexity and lower performance without any real gain. Instead, what we need is a file system that provides a sensible set of semantics and that fits the dominant programming models. At a high level, parallel I/O model in MPI-2 provides a good starting point, but it is not sufficient. At a low level, parallel file systems such as PVFS provide some of the necessary support but are also not sufficient. This area is where the cluster community must begin developing the appropriate standards, using an open process that includes everyone including researchers, vendors, and users.
Other dangers exist. Too many software projects provide a small improvement in usability or performance at the cost of adopting a unique API. This process is like making a small improvement in the CPUs instruction set that changes the behavior of a basic operation like “LOAD.” It might seem like a good idea, but you’re no longer part of the commodity path. For any change to be worthwhile, the benefit must be enormous.
As a community, we can work together to make our commodity software as strong an advantage as is our commodity hardware. As users, we should embrace standards and eschew changes that give minor benefit (don’t be seduced by the dark side of proprietary features). We should make our needs known to software developers and participate in standardization processes. As software developers, we should stick to standards and avoid gratuitous differences, developing internal interfaces to interact with other components and sharing those interfaces with friends and competitors. Developers must also be users.
Elite supercomputing is not the answer. We must insist on commodity software!
Fatal error: Call to undefined function aa_author_bios() in /opt/apache/dms/b2b/linux-mag.com/site/www/htdocs/wp-content/themes/linuxmag/single.php on line 62