In last month’s “Cluster Rant,” I explored the effects that multi-core systems will have on writing future applications. Unless your code is happy with a single CPU topping out at about 3 GHz, you’ll need to make a strategic commitment to adapt and embrace multi-core. Before taking a look at your options, though, let’s review.
A Tale of Two Architectures
Applying multiple processors to a computing task is something people in high-performance computing (HPC) have been doing for quite a while. The approach is called parallel computing and it allows your program to effectively use more than one processor at the same time.
In almost all cases, processors harnessed together into a “team” have to communicate with each other as your program runs. If each processor has its own memory (so-called local memory machines), interprocess communication is realized by passing messages between the processors. (A message can be thought of as a memory copy over some type of network to the remote processor.) This architecture is typical of most HPC clusters.
On the other hand, if the processors share a pool of memory (so-called shared memory machines), a shared memory location acts as the conduit for all communication. There is no memory copying, as all data is shared “in place” between processors, but there is a need to protect memory so that processors do not “stomp” on each other’s toes. This architecture is often called symmetrical multi-processor (SMP) systems, because they can run multiple, single CPU tasks at the same time.
In software, there are two ways to “express” concurrency in a program: messages and threads. (Other methods do exist, but aren’t widely used.)
And, as was stressed in the previous column, it’s important to remember that the expression of concurrency is not necessary dictated by the architecture of the underlying hardware. Both messages and threads can be implemented on shared memory machines (SMP) and local memory machines (clusters). The issue for the programmer is, as always, efficiently and portability.
Get The Message
Historically, message passing technology reflected the design of early, local memory, parallel computers. In other words, messages copy data from one system to another. Common message passing application programming interfaces (APIs) include Parallel Virtual Machine
(PVI) and the Message Passing Interface
(MPI). MPI is available from a variety of sources (see http://www.mpi-forum.org
, and http://www.mcs.anl.gov/mpi/mpich/
), both as open and proprietary source, and is used by most parallel programmers.
Message passing can work well both on SMP machines and on local memory clusters. (On an SMP machine, memory is still copied, but instead of sending it over a network, it is just a moved across system memory.) In fact, message passing applications can be moved between SMPs and clusters easily, and can even be developed and run on a single CPU (provided there is enough memory to hold each process). Even if you have a single-processor laptop, you are just one tarball away from using MPI. In addition, using message passing makes it is easy to add machines if you want to scale up your application.
However, because memory must be copied from one processor to another, the latency and speed at which messages can be transmitted and processed are often the limiting factors of message passing models.
Pulling On Threads
Operating system threads were developed because shared memory SMP designs allowed very fast, in-memory communication and synchronization between concurrent parts of a program. Threads work well on SMP systems because communication is at memory read/write speeds. As long as you isolate local data from global data, programs work efficiently and properly.
In contrast to messages, a large amount of copying can be eliminated with threads because data are shared between processes. Linux supports POSIX
threads, but instead, you’d be wise to consider the OpenMP
programming model (http://www.openmp.org
). Internally, OpenMP makes use of threads, but simplifies the process of writing code for shared memory systems.[ You can find a three-part, hands-on series about OpenMP beginning with the January 2004 “Extreme Linux” column, available online at http://www.linux-mag.com/2004-01/extreme_01.html
.] Versions of OpenMP are available from most commercial compiler vendors, and OpenMP is now finding its way into the GNU tools (See Gomp
The problem with threads — and hence OpenMP — is that it is difficult to extend the threading model beyond one SMP machine. Once you fill all the processor slots on your SMP, you cannot scale your program any further. OpenMP does not “spread codes across motherboards” like MPI.
Cache coherency can also be an issue. If data is in the cache of processor A and processor B needs to read from or write to that data from shared memory, then the shared memory becomes “dirty” and must be updated from processors A’s cache before the processor B can use it. In general, memory locality can be an issue with thread based programs. Indeed, there are cases where an MPI program can run faster on a cluster than an similarly written OpenMP application on a SMP machine.
One would assume sharing memory would be faster, but this may not necessarily be the case in all situations. Sharing memory sometimes introduces memory contention that doesn’t exist in a local memory MPI environment.
MPI or OpenMP
After all the discussion presented in this and the previous column, the question really boils down to this: When you’re sitting there, looking at your new quad desktop system (two dual-core CPUs), what are are you going to use, MPI or OpenMP?
Of course, it all depends on your application. If you think that you may never need more than four CPUs, then OpenMP would seem to be the best way to go. Conversely, if you think you may build an application that uses eight or more CPUs, you may want to start thinking about MPI. In particular, if you think you want to scale your application in the future, you may want to seriously consider MPI. And finally, if you want portability between SMP machines and clusters, MPI is the only API that can deliver on this promise. The situation is summarized in Table One.
||Shared memory SMP machine cluster performance
||Local memory scalability performance
There’s No Turning Back
No matter what method you choose, your program will have to be finely-tailored to that method. There is no #ifdef solution in this case. Indeed, the algorithm may be completely different when you are finished writing your code. This conclusion is clearly detailed by Gregory Pfister in his book In Search of Clusters.
If you choose OpenMP, you’re probably locked into the SMP world, unless you are willing to rewrite critical parts of your code in MPI at some point.
Your Head is Humming
Now that I have you totally convinced that this decision is not going to be easy, let me make it harder. First, writing anything in parallel is hard, debugging is harder still, and optimization can be almost impossible. If you have not run away yet, I’ll offer an some encouragement, at least for the message passing side of things.
If you were to optimize a message passing application, you would need to know the communication speed and latency of your interconnect and the amount of computation that needs to be done. In this way you can, in theory, determine the best way to decompose your program and run it in parallel. The main thing is that you have consistent parameters with which to work. If you have two communication speeds, one between multi-cores and one between nodes in a cluster, then the optimization gets harder, but chances are that using the slower of the two speeds will yield a good result. In general the problem is solvable.
With OpenMP applications, the situation gets a little more sticky. The key to optimizing these types of applications is data locality or making sure your processors caches are working for you instead of against you. This issue is particularly difficult when you have processes allowed to roam around on CPUs in an SMP environment.
The situation gets much worse when you have a Non-Uniform Memory Access (NUMA) system (like a Opteron SMP). The optimization problem is hard and not the kind of thing –O3 just does for you. The first thing you need is some way to keep the process on a particular core of a particular CPU so that you can then worry about the best way to feed it. The problem is not one that most developers like to think about, and it’s also why execution times on NUMA machines are sometimes inconsistent.
As You Wind On Down the Road…
There are no easy answers. Study your code, think about how and where you want to use it, consult your horoscope, and pick a path.
Personally, I’m betting on the cluster thing and messages. I like the idea that I can at least run my codes on an SMP system as well as a cluster. Besides, clusters have more blinking lights.