Where’s the Parallel Beef?

Yet another parallel programing rant. Has the cluster market all but killed parallel tools development?

Years ago there was this ad campaign by the Wendy’s hamburger chain that asked the question Where’s the Beef?. The commercials were rather funny and “Where’s the beef?” has become a way to ask “where is the substance?” or to call attention to the lack thereof. Before GP-GPU, multi-core, and clusters, I have been asking a similar question about HPC development tools. In particular, “Where are the parallel programming tools?” This question has become fundamentally important to the future of computing and the answer is not quite clear.

In the past, vendors learned quickly that in order to sell hardware you need software and to create software you need tools. Every processor has a development platform of some sort. If your market is small, you may have to supply the environment, which might look like a machine code assembler, a C compiler, and a debugger. On the other hand if you sell into a large hardware market, you will be fortunate to have, in addition to your tools, many software vendors that supply various software tools for your hardware. In the x86 world for example, there are too many languages and vendors to list. There are also a huge pile of freely available software tools, of varying quality, from which to choose.

The HPC story has always been a bit different. Back in the day, when you purchased an “integrated” supercomputer (e.g. Cray, Convex) there was a set of sophisticated software tools and compilers that aided software development. These tools were usually part of the system purchase and represented some of the best optimization technology available. When parallel computers and eventually clusters entered the scene, the three key development tools were, a compiler (Fortran or C), MPI, and if you were lucky a parallel debugger. In a way, you were “on your own.”

The change from expensive integrated system to a multi-sourced cluster created a drastic reduction in price, often a factor of 10 or more, but pretty much removed any incentive for commercial parallel programming tools from the component vendors or integrators. Basically, a compiler and MPI libraries (and print statements) were how it was done. It has been well over ten years since clusters hit the scene and with an estimated annual market size getting close to $10 billion, you would think there would be a large incentive to create parallel programming tools. There is some progress, but little has changed over the past two decades.

Admittedly, software tools are a small market and by many standards, the HPC market is “not that big.” There are a small group of developers academic, government, and commercial who actually write and develop parallel software. Many of the top scientific applications have been ported to some form of parallel computer (MPI, multi-core, GP-GPU). The HPC sector is growing and now the rest of the market is going parallel in a big way.

So, “where’s the parallel HPC tools?” There are some. Intel certainly has a vested interest in parallel tools and much of their software focus has been in this direction in recent years. There are other companies working selling/providing tools in this area as well, however, you would be hard pressed to find “stand-alone” companies that exclusively sell HPC tools. I have run into many of these companies in the past, none seem to have survived despite some reasonably good ideas. In a search for current vendors I found the equivalent of a parallel computing ghost town page. It has generally been my experience that parallel software tools are a tough battle, which is partially why the whole parallel computing market is littered with dead companies (this link is from the Internet Wayback machine and may take some time to download).

I attribute the dearth of tools to three issues. The first is lack of economic incentive (i.e. the market is too small and the grad students are cheap). The second is more subtle. In order to sell programming tools you need hard ROI numbers. A good compiler, debugger, or profiler can show a pretty quick return because they work in virtually all cases. Automatic parallelization tools or languages usually work on some cases, but are not as universal and can be a tough sell. And finally, parallel programming is a really hard problem, which is what makes the previous point an issue.

Of course, there is OpenMP, CUDA, OpenCL, MPI, Pthreads, etc., but these tend to move the applications closer to the hardware than before. When I think of parallel tools, I want an application that helps me parallelize my code or allows me to easily express or imply parallelism. And, yes I know about UPC, Cilk, Co-array Fortran, Fortress, Chapel, not to mention, Linda, Haskell, Erlang and everything else. There are several real challenges facing all these efforts. Perhaps the biggest issue is the “type” of parallelism in the market today. There is multi-core, cluster, GP-GPU, or any combination thereof and no unified model for this situation. Although my belief is the GP-GPU will become an embedded “SIMD unit” like the FPU in future processors and handled in some fashion by the compiler.

Make no mistake, the biggest impediment to better tools is the difficulty of parallel programming. It is a tough nut to crack and there does not seem to be any real breakthroughs on the horizon and the “parallel beef” issue is getting bigger. In my estimation it has caused the demise of many companies and is probably the biggest “hold back” for HPC today. We need more ideas and money applied to this problem because very soon a whole lot of other people are going to be asking “Where is the parallel beef?”

Comments on "Where’s the Parallel Beef?"


Intel has a bunch of pretty decent looking stuff for C++ intel-sdp-products (TBB, IPP) Matlab is also finally getting their stuff better parallelized.

But basically you want a program (compiler?) to analyze your as-is (kind-of serial) code (any language) and spit out parallel instructions that exhibit the same behavior (I mean, correctness)? So would everyone else! The easiest thing to do, I would think, is to move straight to a parallel language (or one that makes it simple) when possible. Those languages exist, but I suspect most developers want their Java, C/C++, Python etc. to be parallel.


Parallelism, while an important part of HPC is only part of the picture. Intel has done a lot with TBB, ABB, and it’s Parallel Studio to help us optimize the parallel aspects of software. But there are memory and process affinities within and among compute components that become important. The compute fabric is a system – made more complicated by the dynamic nature of the software ‘task’ structure, and various data partitioning strategies. Often, there is more than one right answer, and looking exclusively at (massive) parallelization may sometimes be misguided.


IMHO, you’re looking in the wrong place. I write *HIGHLY* parallel code on a daily bases, in an obscure language called VHDL (yes, it’s a Hardware description language, not what most folks think of as “software”).

The hardware world is inherently parallel, and you cannot code for it using the same abstractions that have made simplistic serial programming all the rage. Until the SW world sheds the single execution paradigm and truly begins to think in parallel at the lowest level, I think the best hope for significant progress is going to come from advances already made in the realm of hardware design.

Today I write code that turns into millions of gates inside an FPGA. At the big chip vendors, they’re managing code that executes in parallel on *BILLIONS* of individual logic elements. It’s far less of a stretch to imagine that code running on millions of individual execution units (CPU/GPU/Whatever) than crafting a magic compiler that can effectively turn a sequential C program into millions of parallel instructions.


    I was going to mention data-flow programming (see VHDL) but figured I would just leave my comment more general. But yeah, going data flow makes a lot of problems simpler and more parallel, if a little bit unfamiliar (to the typical programmer)


Everything said in the article is correct. But where is the journalistic/novelty beef???


Exactly, where is the beef?

There are a few tools out there, names such Eclipse PTP, TotalView, DDT, Intel Amplifier, Intel Cluster Tools, VampireTrace, Scalasca and many others come to mind … it would ne nice if somebody tried to analyze these tools against each other.

I have been so far a very satisfied user of TotalView and Intel Amplifier, but I would appreciate an overview of the field.


ZeroMQ is an excellent parallel tool. By removing all locks, semaphores, memory barriers, etc. from an application you can speed up parallel applications and also get rid of concurrency bugs that appear when a system is under load.



These tools have some very good features. I think they are currently available only for the Xmos family of microprocessors, but perhaps they could be ported to others.



The XC concurrent programming language extends the sequential programming capabilities of the C language, providing explicit control of concurrency, communication, timing and input-output. It supports deterministic concurrent programming; compile-time checks eliminate race-conditions.

Development Tools and Software IP

The XMOS development tools are built on industry-standard platforms making use of open-source compiler technology. They enable the XC language to be combined with C and C++. Using XC, accurate timing of critical paths through programs guarantee that deadlines are met, enabling programs to behave as predictably as hardware.

A library of reference designs for popular applications and software modules makes it easy to start designing.


Although Ada was designed for multi-tasking, its possible use for parallel programming is discussed at this link.


However distribution of Ada tasks onto processors requires use of a rather complex “Distributed Systems Annex”. It would be easier if Ada had available something like the parallel extensions of XC.

But it could be a promising language and some high quality GPL Ada compilers are available here.


If you can’t use explicit parallelism via message passing (PVM/MPI) there’s OpenMP, which is essentially a set of compiler directives that you can sprinkle around regular C/Fortran code to enable parallelism by loop distribution, etc. GCC supported OpenMP since version 4.2 or thereabouts:


Interesting article with a lot of good points. What I have observed is that the shops who are interested in HPC parallelism like the National Labs have migrated their algorithms to data flow, SIMD, format a long time ago and are well positioned to take advantate of the multicore and GP-GPU architectures as they appear. The problem is those who are not already programming in a SIMD mode need to make a paradigm shift in their thinking and approach to convert their specific algorithm to SIMD. As noted vendors are providing some tools to alow serial programs to be converted to SIMD bet they are not often efficent. In fact if the wrong programming model is used on certain muticore processors a negative performance number can occure.


Hello there, just was aware of your weblog thru Google, and found that it is truly informative. I’m gonna watch out for brussels. I’ll be grateful if you proceed this in future. Numerous folks might be benefited from your writing. Cheers!


This can be a great publish. My partner and i loved the information great deal. I’m going to search for these pages. Appreciate your discussing this review.


Admiring the persistence you put into your site and detailed information you offer. It’s good to come across a blog every once in a while that isn’t the same out of date rehashed material. Excellent read! I’ve saved your site and I’m including your RSS feeds to my Google account.


Leave a Reply

Your email address will not be published. Required fields are marked *


You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>