The HPC Software Conundrum

Can a solution for HPC software live within MPI, OpenMP, CUDA, OpenCL, and/or Ct?

In the computer world there are concepts that appear to be universal. The adage, “software moves hardware” is one example. Talk to anyone about the history of computers in this context and they will nod their head in agreement. Many of the same people, however, will also get whipped into a frenzy when new hardware is announced. Such enthusiasm is perfectly understandable when processor clocks were increasing each year. This is what I call the Free Lunch era of computing.

Lunchtime is over. Events in the processor market and particularly the HPC market have changed the game. Software applications that once enjoyed performance increases from increased clock speed have not seen any real significant bumps in recent years. Celebrating and ranking processors based on clock speed is therefore of little value. Indeed, the current hardware advances in both multi-core and streaming-core (GP-GPU) do not apply to many single threaded applications. And yet, there always celebration of these new hardware advances.

Looking closer there is a small subset of software that actually has been adapted to use these new hardware designs. Therefore, jumping up and down because your favorite processor vendor has added a core or two — or two hundred — is like getting all excited about hydrogen cars. Good idea, but not going to work until we get enough hydrogen stations. Likewise, multi-core and streaming-core needs software in a big way. And, there is no sense in bragging about how well the hardware works unless there is an adequate software base to drive the hardware to its intended performance.

As an aside, the other analogy I was considering was two neighbors having a bragging contest about the performance of the fighter jets they both kept in their back yards. What does it matter if can’t fly the darn things.

The above argument is why I don’t put much credence in FLOPS numbers for multi-core and streaming-core. Of course the numbers are correct for that benchmark, but until that kind of performance is generally available to the Joe Programmer, the extra hardware is, in a sense, superfluous.

Of course the hardware companies realize this fact and have made various tools available for programmers. For the HPC practitioner there is MPI, but MPI currently does not work in all cases. It has been the mainstay of “parallel computing” for years and does not map well to stream processors. Although this situation may be changing (see the MIMD On GPU work at the University of Kentucky).

There is also OpenMP, CUDA, and OpenCL. Of course there are other solutions hitting the market like Ct from Intel. None of these answer the big question, “How should I write/update my code to use all this great hardware in my cluster?” In order to answer this question a summary table may be helpful. I have included the current methods that are generally available. I should point out that there are edge cases for the table, but I have tried to indicate the general use of each method. Also I assume that a cluster is a collection of SMP multi-core servers.

SMP multi-core
SMP multi-core
server plus
NVidia Only

As you can see there is no one size fits all or no silver bullet as it were. This situation is both a problem and an opportunity and the answer to the question above is a bit unclear right now. There has been considerable work in the area of hybrid computing, that is combining the above methods in a single program. While this works in many cases, I still get the feel that it is a kludge of some sort and there should be a way to express what you want to do in a parallel cluster environment using a clear concise standard level of abstraction.

Thus, there is an opportunity for a solution. The hardware vendors are doing their part and of course pushing solutions that work best with their hardware. I also believe that any solution should be an open standard so that it can be freely implemented by anyone. Locking someone to your hardware though some software secret sauce is a recipe for failure. Open access to tools and standards are what grow hardware markets. Furthermore, I’m not so sure we can expect hardware companies to push solutions beyond their products, they are, after all, tasked with selling their stuff. And, I am not sure how it is all going to turn out. On the one hand, the HPC market has seen huge growth based on an open software infrastructure. On the other, the issue of multi-core and streaming-core seems to be developing solution fiefdoms that don’t necessary cooperate outside the hardware domain.

I started out wanting to write about CUDA, but somehow drifted into what I call the software conundrum, i.e. we need a solution that lifts us above all the hardware, but historically efforts that come from vendors cannot be expected to support this goal. Of course, a vendor (or vendors) who supports open cooperation as the way forward may actually have the fastest hardware from a practical sense, i.e. it is what gets used. By promising everyone a ride in their jet, they can get everybody to help them get it off the ground.

Comments on "The HPC Software Conundrum"


Having spent a fair amount of time trying to solve that problem myself, I have to wonder whether it is really possible. The technical requirements of any such programming model (or language) would be that: (1) it has to be sufficiently expressive to be useful for describing a rich variety of algorithms, (2) it must be translatable into highly optimized machine code for architectures that are very different from each other, especially in their performance characteristics and trade-offs. These requirements are very much in conflict with each other.

I have seen (and designed languages for) functional, object-oriented, logic and procedural parallel models. About the most promising approach I have seen so far is a hybrid approach, where major computational elements are sewn together using a data-flow, or functional, or object-oriented high-level language. The individual computational elements can then be expressed in a language that best suits the architecture.

For example, your program might read in a set of matrices, perform an FFT to translate the data from the time to the frequency domain, pass the results through several high-pass and low-pass filters, then pass those results through an inverse FFT to translate it back to the time domain.

The high-level operations — FFT->filters->inverse FFT — are invariant. They, along with descriptions of your data, represent the semantic content of your program. The individual implementations of the FFT, etc., may vary from one system to the next to take advantage of specific hardware. Think of it as a really fast implementation of LAPACK or math.h if you like.

The idea that you can specify the low level details of an algorithm (e.g., matrix inversion) in a machine independent form that can be executed efficiently on architectures as different as GPGPUs and large-scale distributed clusters and BlueGene and Larrabee and TMC\’s CM-1 (remember them?) is … well … a nice idea but, IMHO, not much more than that. In order for the compiler to have enough flexibility to make efficient choices for the architecture in question, the algorithm *must* be expressed at a fairly high level, the higher the better.


\”…until that kind of performance is generally available to the Joe Programmer, the extra hardware is, in a sense, superfluous.\” – Why would you think something like that? HPC programming work was never, and never will be something that Joe Programmer could handle; there will be libraries and alike stuff built for this poor guy, but the core work would simply have to be handled by competent people. I see also no issue with that many APIs available: have been used all 4 you mentioned, and dozen more (anyone remember p4, the pre-cursor of MPI?), I can tell I liked each one I used, and I see that many APIs just as natural process of evolution towards eventual ultimate programming model(s). But until that point reached, there are always clients in demand of speeding-up their codes, so instead of whining over the sad state regarding the incompatibility of current batch of tools, I prefer simply to enjoy in coding.


I think it\’s a question of time. First the hardware development, and always lagging behind the software development. Remember SGI\’s C$DOACROSS? Ideal for the main loops but also for small ones (I vaguely remember a break-even point of 400 clock cycles or 100 multiplications). Now we\’ve got OpenMP with many features for finetuning. And Portland has already an Accelerator at directive/pragma level for GP-GPUs like the Tesla, so no need to program at CUDA level unless you want to squeeze the last drop out of the Tesla. And from what I\’ve read you can mix OpenMP with Accelerator directives. What I still don\’t understand is the lack of high-level tools for MPI (remember BERT?). Why is there no initiative to develop high-level directives for hiding most if not all MPI stuff? With a set of pre-defined communication structures such that at least the most common applications can be parallelized in less than an hour or so. I myself invested a few weeks in my SPMDlib on top of MPI which provides a basis for SPMDdir, a small directive set, but, of course, I am doing research in a specific area and don\’t have the time to develop generic tools.

I really think that directives can solve most problems efficiently, with a top-down structure MPI-OpenMP-Accelerator, but the bottom-up link, from Accelerator to MPI, might be even more important in order to decide at MPI level what the most efficient solution might be. If Accelerator is intelligent enough to do a break-even code analysis at CUDA level, it should be possible to to the same at MPI level, with or without the use of OpenMP and other tools. And now back to my DEadline :-)


In response to the concerns expressed by dmpase, I have to wonder if we\’re not at the same sort of crossroads as we were in the transition from assembly language to high-level languages. It wasn\’t that long ago that people thought it was unthinkable that compilers could produce code as tight and efficient as a programmer could do by hand when programming close to the hardware. CPU architectures were so vastly different as to seem irreconcilable: special purpose registers vs general purpose, different bit and byte orders, different memory and cache architectures. How could a compiler efficiently manage all those differences better than a human being, and still make loops tight enough to fully exploit the small instruction cache\’s locality of reference? Well, compilers got a whole lot better at dealing with all that, and today very few programmers seriously consider programming so close to the hardware themselves that they need to worry about these details. Certainly in the \’80s, though, there were a lot of die-hards who refused to program in anything but assembly language because they thought all high-level languages were resource hogs.

Maybe in another decade or two, processor cores will be managed automatically by higher-level language compilers the way registers are now, and we won\’t care how many general or special purpose cores (or nodes!) a system has. It will probably mean moving to a higher-level language than what\’s commonly used now, so we\’re not programming as close to the hardware as we are now, and leaving the details of what depends on what results and what can be done in parallel (and on which cores) to the compiler and/or OS. Worrying about such details may seem as quaint as worrying about CPU registers today. Of course, the big question is how to get to that point. A lot of hard work went into making today\’s compilers as sophisticated and efficient as they are, and this next challenge seems even more daunting. But I wouldn\’t doubt that it\’s possible.


In our 1600 core cluster, ~60% of the jobs use a single core. Researchers make a choice… they can spend months or years to take old working code and try to parallelize it, or they can run the old working code on a single processor, and stage a few hundred jobs on iterations of the data set.

Given the cost of modifying working code, and the lack of talented programmers who know how to do so, many researchers prefer to just run lots of single core jobs to get their work done. Sure, many of the savvy researchers with grant based funding have gotten their code to scale on multi cores, but none are even considering the daunting task of migrating to a GPU, if they even understood how to do it.

So your observations are very correct. In reality, very large supercomputers, like those at Los Alamos and Livermore, are very efficient at generating heat, but very inefficient at generating results due to the enormous programming effort required to take advantage of such scale.


I (and the company I work for) believe that some form of heterogeneous computing is the only way to solve this dilemma (we call it \”circumventing the laws of physics\”). Take the same number of transistors used to implement a general-purpose instruction set (e.g. x86_64) and use those transistors to implement a specific algorithm. It\’ll *always* be more efficient (not counting infrastructure changes). That\’s what GPGPUs are all about, as well as where other hardware-based solutions like FPGAs come in.

The challenge becomes integrating a heterogeneous computing solution into your current favorite computing platform. If you have to use some new programming language or a dialect that\’s harder than actually restructuring your app to take advantage of a multi-core processor, it\’s probably not worth it. If it\’s relatively easy, then you get performance/watt increases beyond what\’s possible with off-the-shelf processors, without totally rewriting your application.

I compare it to using an attached array processor back in the 80\’s. Worked great, but most of the time wasn\’t worth it. It wasn\’t until minisupercomputers came along, with integrated vector instructions, that you could get performance increases without working in two different development & runtime environments.


I appreciate the comments of grdetil. They make me pause and think about what allowed us to transition from assembler to higher level languages (HLLs). I believe it is that the HLLs transitioned to useful abstractions that compilers could work with across the variety of machines they needed to support, and at the same time reduce the effort of programming. I also think that it helps a great deal that the current architectures are still relatively similar (i.e., von Neumann load-store architectures).

So, if the HLL for parallel programming were to support higher abstractions for operators, such as reductions, partial reductions, etc. — operators that could be parallelized across a wide variety of parallel architectures — then I can see how that might be a path for success. It takes advantage of the best features of the hybrid approach without some of its complexity.

The success of HLLs over assembly came about because they reduced the complexity of programming. It allowed the programmer to stop thinking about low level details that were incidental to the algorithm. Examples of those details might include the number of registers available, whether the data to be operated on was in this register or that, whether branch conditions were handled through direct operators or by testing a condition in one instruction and branching based on a condition register in the next.

In short, the level of abstraction was raised and this helped the programmer, while at the same time it stayed fairly close to the architectures they needed to support. The range in architectures is also pretty narrow, so that helps a lot.

So, could a parallel language be created that would span the spectrum of parallel architectures? I\’m still a bit skeptical, because there are huge differences in parallel architectures, much larger than what we see separating, say, CISC from RISC. But if there is hope it is in identifying a useful set of operators, expressive enough to easily describe a large set of programs, while at the same time compilable into low level code for the available spectrum of parallel architectures. (The hybrid approach assumes no such operators exist and allows the user to define their own.)

Hello. excellent job. I did not imagine this. This is a excellent story. Thanks!

That is really fascinating, You are an overly skilled blogger. I have joined your rss feed and look ahead to in the hunt for more of your magnificent post. Additionally, I have shared your web site in my social networks!

Here are some hyperlinks to websites that we link to because we consider they’re worth visiting.

Usually posts some pretty interesting stuff like this. If you?re new to this site.

Just beneath, are numerous entirely not associated web sites to ours, even so, they’re surely worth going over.

That could be the finish of this report. Here you will locate some websites that we think you?ll appreciate, just click the hyperlinks.

Please pay a visit to the websites we follow, such as this 1, as it represents our picks from the web.

Every once inside a even though we pick blogs that we read. Listed beneath are the most current sites that we select.

Please visit the internet sites we comply with, including this 1, because it represents our picks through the web.

Sites of interest we’ve a link to.

We prefer to honor several other web web-sites on the web, even when they aren?t linked to us, by linking to them. Beneath are some webpages worth checking out.

Here are some links to websites that we link to simply because we feel they’re really worth visiting.

We came across a cool website that you simply may well delight in. Take a search in the event you want.

The time to study or stop by the content or web pages we’ve linked to beneath.

I think the article is very helpful for people,it has solved our problem,thanks!
Wholesale Oakley twoface sunglasses matte black and green violet iridium outlet http://www.fleetsale.ru/oakley-twoface-sunglasses-004.html

Wonderful story, reckoned we could combine some unrelated data, nonetheless seriously worth taking a search, whoa did a single understand about Mid East has got far more problerms as well.

Always a big fan of linking to bloggers that I like but do not get a great deal of link like from.

Please check out the web sites we stick to, which includes this one particular, as it represents our picks from the web.

Wonderful story, reckoned we could combine a number of unrelated information, nonetheless actually really worth taking a search, whoa did one particular discover about Mid East has got far more problerms as well.

dPjqDo ejlcycnxtmar, [url=http://gcwopnjbbmbg.com/]gcwopnjbbmbg[/url], [link=http://dorblejjvtyy.com/]dorblejjvtyy[/link], http://igzdydufzlbs.com/

Here are some links to sites that we link to for the reason that we believe they are really worth visiting.

Leave a Reply