The HPC Software Conundrum

Can a solution for HPC software live within MPI, OpenMP, CUDA, OpenCL, and/or Ct?

In the computer world there are concepts that appear to be universal. The adage, “software moves hardware” is one example. Talk to anyone about the history of computers in this context and they will nod their head in agreement. Many of the same people, however, will also get whipped into a frenzy when new hardware is announced. Such enthusiasm is perfectly understandable when processor clocks were increasing each year. This is what I call the Free Lunch era of computing.

Lunchtime is over. Events in the processor market and particularly the HPC market have changed the game. Software applications that once enjoyed performance increases from increased clock speed have not seen any real significant bumps in recent years. Celebrating and ranking processors based on clock speed is therefore of little value. Indeed, the current hardware advances in both multi-core and streaming-core (GP-GPU) do not apply to many single threaded applications. And yet, there always celebration of these new hardware advances.

Looking closer there is a small subset of software that actually has been adapted to use these new hardware designs. Therefore, jumping up and down because your favorite processor vendor has added a core or two — or two hundred — is like getting all excited about hydrogen cars. Good idea, but not going to work until we get enough hydrogen stations. Likewise, multi-core and streaming-core needs software in a big way. And, there is no sense in bragging about how well the hardware works unless there is an adequate software base to drive the hardware to its intended performance.

As an aside, the other analogy I was considering was two neighbors having a bragging contest about the performance of the fighter jets they both kept in their back yards. What does it matter if can’t fly the darn things.

The above argument is why I don’t put much credence in FLOPS numbers for multi-core and streaming-core. Of course the numbers are correct for that benchmark, but until that kind of performance is generally available to the Joe Programmer, the extra hardware is, in a sense, superfluous.

Of course the hardware companies realize this fact and have made various tools available for programmers. For the HPC practitioner there is MPI, but MPI currently does not work in all cases. It has been the mainstay of “parallel computing” for years and does not map well to stream processors. Although this situation may be changing (see the MIMD On GPU work at the University of Kentucky).

There is also OpenMP, CUDA, and OpenCL. Of course there are other solutions hitting the market like Ct from Intel. None of these answer the big question, “How should I write/update my code to use all this great hardware in my cluster?” In order to answer this question a summary table may be helpful. I have included the current methods that are generally available. I should point out that there are edge cases for the table, but I have tried to indicate the general use of each method. Also I assume that a cluster is a collection of SMP multi-core servers.

SMP multi-core
SMP multi-core
server plus
NVidia Only

As you can see there is no one size fits all or no silver bullet as it were. This situation is both a problem and an opportunity and the answer to the question above is a bit unclear right now. There has been considerable work in the area of hybrid computing, that is combining the above methods in a single program. While this works in many cases, I still get the feel that it is a kludge of some sort and there should be a way to express what you want to do in a parallel cluster environment using a clear concise standard level of abstraction.

Thus, there is an opportunity for a solution. The hardware vendors are doing their part and of course pushing solutions that work best with their hardware. I also believe that any solution should be an open standard so that it can be freely implemented by anyone. Locking someone to your hardware though some software secret sauce is a recipe for failure. Open access to tools and standards are what grow hardware markets. Furthermore, I’m not so sure we can expect hardware companies to push solutions beyond their products, they are, after all, tasked with selling their stuff. And, I am not sure how it is all going to turn out. On the one hand, the HPC market has seen huge growth based on an open software infrastructure. On the other, the issue of multi-core and streaming-core seems to be developing solution fiefdoms that don’t necessary cooperate outside the hardware domain.

I started out wanting to write about CUDA, but somehow drifted into what I call the software conundrum, i.e. we need a solution that lifts us above all the hardware, but historically efforts that come from vendors cannot be expected to support this goal. Of course, a vendor (or vendors) who supports open cooperation as the way forward may actually have the fastest hardware from a practical sense, i.e. it is what gets used. By promising everyone a ride in their jet, they can get everybody to help them get it off the ground.

Fatal error: Call to undefined function aa_author_bios() in /opt/apache/dms/b2b/linux-mag.com/site/www/htdocs/wp-content/themes/linuxmag/single.php on line 62