Information about high-performance computing (HPC) cluster use is hard to find. Not only is the market for such systems relatively small, but clusters are often home-grown creations that fly under the radar screens of traditional market watchers.
How then can one find information on clusters? This very question prompted me to place small, single-question polls on the ClusterWorld.com web site over the last two years (see www.clusterworld.com/pollBooth.pl). The polls were entirely unscientific: a respondent could easily defeat the “one browser, one vote” rule by removing a cookie, and there were no demographics on the population of respondents. And while each poll had a minimum of 100 responses, the surveys nonetheless yielded some nuggets of information.
High-performance computing professionals often hear about a big (or really big) cluster and the machine’s ranking on the “Top 500” (http:// www.Top 500.org) list. Many believe that the Top 500 is really a “high tech pissing contest.” If you read the press releases, the list is certainly used that way.
To see just how relevant the Top 500 is to most cluster users, look at Figure One, where respondents were asked about cluster size. The most interesting result is the fact that clusters consisting of 32 nodes or less accounted for over 50 percent of all the systems represented in the survey.
As a follow-up, readers were then asked to measure the maximum scalability of their applications. In this poll, 50 percent of the respondents had applications that cannot scale above 64 nodes, which means a larger, faster cluster would do no good. Of course, such “small” cluster users could run more copies of their program on a larger cluster, but even that configuration fails to use the combined capacity of all of the processors, as the “Top 500” seems to measure.
With these results in hand, one has to wonder why all the interest in the “Top 500”? The processor count in the “Top Ten” is easily one or two orders of magnitude greater than most codes can use. Why then, do most people care about this list? It’s like using the results of a NASCAR race to help decide what kind of car I should buy. In addition, the large number of smaller clusters also supports the notion that scalability limits many applications. When you think about it in these terms, the Top 500 seems kind of silly.
The other interesting thing about the results is they seem to support the Blue-Collar Computing
). Indeed, at the high end, there seem to be plenty of “heroic” codes that use more than 1,024 CPUs. Then there’s a valley in the middle and another maximum at 16 processors or less. This trend can best be seen by looking at the data plotted as a bar graph on the ClusterWorld.com
Of course, there are applications that justifiably need larger numbers of processors, but the focus on large clusters is more about press releases and less about getting work done. For example, when the Virginal Tech PowerMac cluster, “System X”, was being built, much fanfare was made about its fourth place finish on the Top 500 list (in November 2003). Today, I would be very curious to see the how those 2,200 processors are being utilized. My guess is most of the jobs run on less then 64 processors. Alas, as of June of this year System X has dropped to 14th place on the Top 500. And the people that are actually using it probably really don’t really care.
So much for the contest.
Next, an interesting question about optimization was asked in the polls. Optimization can take on many dimensions, so to the poll simple it asked, “How do you optimize your cluster?” Indeed, how do you know the new kernel you just loaded is working like you would expect. Figure Three provides some data on how people handle this situation.
The majority of respondents do not do anything in this area. About 27 percent use a test suite or profiler. The rest just take what they can get. Interestingly, 18 percent said they do not know how to optimize a cluster. Cheap and fast hardware is definitely a disincentive to optimize, but with processor clocks stabilizing as multiple CPUs are packed on the die, optimization and scalability may take on a whole new meaning.
This situation is actually more troubling than one might imagine. Dual-core CPUs introduce yet another level of complexity to a cluster. For instance, is it better to run on 8, dual-core nodes or 16 single-core nodes? Indications are that there can be a big difference. If most cluster managers do not know how to optimize for current clusters, then clusters based on multi-core designs will present an even bigger challenge.
So, What’s Stopping You?
The final poll ask researchers to identify deterrents (or challenges, to be more politically correct) to using clusters. Figure Four shows these results.
The three big issues are people, programming, and management. You could lump software tools into the programming category, which pretty much makes that the clear winner (or loser depending how you look at things).
These results actually mirror what pretty much everyone in the cluster community knows: clusters are hard to manage and program and finding good people is even more difficult. If clusters are going to become more mainstream, these issues need to be addressed.
As stated, the polls shown here were not conducted with great scientific rigor, but do suggest some trends that probably should be the focus of further surveys. Any conclusions are tenuous at best. Next time, we’ll talk about some real surveys.