Small HPC

Will multi-core split HPC into two programming camps? Which one will you join?

Later this week I am heading off to my 30 year college reunion (yes I am that old). I attended Juniata College in central Pennsylvania. Juniata is one of the small liberal arts colleges for which Pennsylvania and the northeast are well known. I recall at one point in my freshman year an upper classman telling me “You will never pass Organic Chemistry. It is were they weed out the pre-med students. Just forget it.”

As I was in my year of easy impression, I became suitable scared. Although, I had one thing in my favor. I was not pre-med. I wanted to learn chemistry. In any case, my sophomore year rolled around and I found myself sitting in “Organic” (as it was called) with a bunch of knee shaking pre-meds. As I learned organic chemistry, I found it not that hard and actually interesting. Good thing I did not take the advice of the upper classman about Organic or any other high level classes. As a matter of fact, to many pre-med students, I was somewhat of an anomaly because I passed organic chemistry, but never considered medical school. The things you learn in college.

I mention my past experiences with “opinions and advice”, because I have been reading comments about parallel computing on the web. I take discussion as a good sign, although, I judge by the comments I see on various sites that there are a bit more opinions than experience in what I read. For instance, parallel computing is way too hard for most people … or what’s the big deal, it rather simple …. Both of the comments are, in my opinion, far from the truth and don’t reflect actual experience. The web is a great repository for opinions of the unexperienced. I would caution those new to to HPC and parallel computing to be a bit skeptical and form your own opinions.

I have written my fair share of software (both sequential and parallel) and my opinion or experience goes like this. Writing good software is hard. Period. Writing good parallel software is harder still, but not impossible. Understanding the basics is essential in either case. The basics. You know the boring stuff, the “oh yea” stuff.

The advent of multi-core has added a bit of confusion to basic idea of parallel computing. A little history may help. For the most part, HPC parallel computing used to mean many processors, each with private memory, communicating with other processors. There were variations, and designs that looked like multi-core processors today, but they existed as discrete processors. The fundamental idea is that communication between processors happens by passing messages. When HPC clusters hit the scene the predominate design was a dual processor motherboard (single core processors), private memory, and some kind of interconnect (e.g. Ethernet or Myrinet).

Today the predominant HPC programming model is MPI. Recall that MPI is a programming API the allows multiple processes to communicate with one another. Mapping the processes to the processors was part of the spawning process. Early parallel computers assumed one process to each processor, but clusters changed that idea a bit. With clusters processors lived in “nodes” and in most cases each node had at least two single-core processors. It is possible to run an MPI job by using all the cores on a node, or some of the cores on the node, or even over subscribing the nodes.

MPI is a data copying protocol. It essentially, copies a chunk of memory from one MPI process to another. Each process has exclusive control of all its memory. i.e. no other process can touch it unless it sends a message. It is important to understand that when an 8-way MPI code is run on a 8-way multi-core node, memory is still copied from process to process through messages even though the transport mechanism may use shared memory. In a cluster, the transport mechanism between nodes is the interconnect (GigE, 10-Gig, InfiniBand). MPI programs can span, processors, nodes, and clusters. If you can send message, MPI can run. Notice I said run, but not run optimally, that is a bit trickier and is very application dependent, however it is under control of the programmer.

There is now a big interest in programming multi-core systems. For the non-HPC users these are multi-core workstations, servers, and even desktops. Writing a parallel program for these systems often uses a shared memory or threaded model. Because the number of these systems is increasing, one has to assume that multi-core programming tools will also expand. One of the more popular is tools is OpenMP (compiler directives for threaded based parallelism). One important distinction of shared memory models is they are designed for single memory domain machines (i.e. a single motherboard). As a result these codes do not work well on clusters unless you are using something that makes a bunch of nodes look like a large SMP system, for example see ScaleMP.

Given that the numbers of cores in a processor continues to grow (e.g the new six core processor from AMD) single memory domains (motherboards) may have anywhere between 12 and 32 cores in the near future. Here is an interesting scenario. Let’s assume that 12-32 cores systems become common place. If this is enough computing power for your tasks, then how will you approach HPC programming? Will you use MPI because you may want to scale the program to a cluster or will you use something like OpenMP or a new type of multi-core programming tool because it is easier or works better? Could a gulf in HPC programming develop?. Perhaps MPI will still be used for “big cluster HPC” and other methods may be used for “small motherboard HPC”. Of course MPI can always be used on small core counts, but will some point-and-click thread based tool attract more users because “MPI is too hard to program”.

If one assumes that cores counts will continue to increase, the 64 core workstation may not be that far off. Back in the day, a 64 processor cluster was something to behold. Many problems still do not scale beyond this limit. Could we see split in HPC? I don’t know, and don’t read to much in to my opinion because there are plenty of devils in the details. Just remember, no matter how you cut it, programming anything well takes diligence and hard work. And, don’t let the upperclassman tell you any different.

Comments on "Small HPC"



This is a great start on what’s going on in HPC. But there is much more that multi-core CPUs. Consider the MPI implications between clusters of multi-core CPUs, each core spawning MPP kernels on hundreds of GPU process cores. Some of those simple global or shared memory sections now have multiple routing options – such as xGigE or InfiniBand.

While I see a great paradigm shift in learning parallel programming, the compute fabrics that result from hybrid, heterogeneous compute clusters means that networked parallel/serial fabrics offer breathtaking power, but come with great complexity.

This is playing field for serious HPC software developers today.

Ken Lloyd
Director of Systems Science
Watt Systems Technologies Inc.


I expect that shared-memory, multi-threaded HPC will become quite democratic and ubiquitous, since multi-core will be everywhere and the programming tools for this (e.g., OpenMP) are relatively simple. I think of this as “supply-led” HPC.

However, the real heavyweight macho HPC world is more “demand led”, and will continue to exploit the power of distributed-memory systems, while still press-ganging multi-core processors, GPGPUs, FPGAs and whatever other accelerator or configuration will let them run ever bigger models, ever faster.

Difficulty with programming may limit the adoption (and business success) of some “accelerators”, but there will probably always be a hard core (pardon the pun…) of deviant weirdos for whom performance is everything, and who are prepared to go to heroic lengths to exploit parallelism on even the most unwieldy hardware.

Certainly, a “holy grail” for HPC would be a high-level programming language that could compile an executable to run “transparently” over distributed memory. Some such grails, of intermediate holiness, already exist. Otherwise, a tool to automatically decompose a serial application for MPI would be nice; something analogous to what OpenMP does for multi-threading.




One of the biggest challenge for any Distributed Shared Memory (DSM) system is ‘Thread Migration’. While most DSM systems have managed to migrate complete processes from one node to another (and thereby spread the load), almost all of them are struggling with individual thread migration. Now since most of today’s applications are multi-threaded as opposed to multi-process, the effectiveness of such a solution is limited. DSM developers are getting there, but it will take a while to perfect it.

In any case, App Developers will have to now look at writing Hybrid parallel applications, i.e. applications that are multi-process as well as multi-threaded (e.g. Apache HTTP Server, worker module) to really exploit the power of today’s systems, be it Hardware based SMP (mulit-core) or Software based SMP (DSM).

In short, both DSM and MPI developers face the same set of challenges.

In future, DSM will be used for day-to-day to applications while MPI, because of its sheer ability to scale massively will continue to be used for hardcore number crunching stuff.

Note: HPC is a relative term. What you call ‘Small HPC’ or as Microsoft calls it – ‘Personal SuperComputer’, will simply remain to be called the humble ‘Personal Computer’ or ‘PC’ in future.

- Indivar Nair


Memory on a per node basis will become the cost point. With the 6core AMD processor supporting upwards of 256GB of memory, one has to think of memory cost when building a large multi-core system.
This fact will come to play with application development as another possible divergence comes into play – small number of multi-core processors with large local memory or large number of multi-core processors with significantly smaller amounts of local memory.

OpenMP benefits from the large shared memory w/any number of multi-core processors. MPI can do both but will definitely succeed in a large multi-core with smaller individual node memory footprint.

There is a definite need for something to assist MPI programming along the lines of OpenMP pragmas or compiler smarts to hide all the details. Otherwise in the case of 10Ks of MPI tasks (one task per core) you wind up with the old adage of “not being able to see the forest because you’re lost with the individual trees”

Jerry Heyman


Hi! Do you know if they make any plugins to help with SEO?


Good post however , I was wondering if you could write a litte more on this topic? I’d be very grateful if you could elaborate a little bit further. Appreciate it!


Needed to compose you that little bit of observation so as to give many thanks over again relating to the awesome knowledge you have provided on this page. It is so open-handed with people like you to make freely exactly what a few individuals would have made available for an electronic book to earn some money for their own end, especially since you could have tried it if you decided. Those inspiring ideas additionally served as a great way to recognize that most people have the identical fervor really like mine to find out much more in respect of this condition. I know there are a lot more enjoyable instances ahead for many who check out your website.


Leave a Reply

Your email address will not be published. Required fields are marked *


You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>