dcsimg

MPI on Multicore, an OpenMP Alternative?

No matter how you cut it, coding for multicore is really just parallel programming.Doug Eadline explains the differences between OpenMP and MPI, when it's smart to use existing code and when it's time to rewrite an application to scale better on multicore systems.

No matter how you cut it, coding for multicore is really just parallel programming. Once you’ve realized that, it’s time to look at the options, whether your existing codebase will scale, or if you need to rewrite your code and how.

As stated in The Multicore Programming Challenge, parallel programming can be difficult. It moves the programmer closer to the hardware and further from their application space or problem. Fortunately, people like rocket scientists have been writing parallel software for quite some time in the HPC (High Performance Computing) sector.

As any good programmer knows, an existing code base can be valuable to current programming projects. First, the possibility of re-using existing code is a major incentive. Also, learning how someone else attacked a similar problem is very valuable.

In the HPC sector, most parallel programs are written using Message Passing Interface (MPI). While MPI is normally used on large computing systems (clusters) it can be also be used on a multicore processor. The “MPI proposition” may seem counter to conventional wisdom as MPI was designed for distributed memory (i.e. each core/processor has it own private memory), whereas OpenMP was designed for shared memory.

The lazy assumption suggests that OpenMP is a better solution because it was designed for shared memory. However, the possibility of re-using an existing MPI code base is worth considering before you spend a month(s) re-inventing the software wheel. Ultimately, the question is really about efficiency. Namely, How does the performance of MPI compare to OpenMP on a multicore system?

The answer to this question is important. If I can re-use MPI codes that work well enough on multicore, then there is no need to (re)write my application using OpenMP. If, on the the other hand, OpenMP or threads provide scaling benefits sufficient enough to justify re-writing the code, then investing the time in re-coding might be in order.

Although your application(s) are always the ultimate test of hardware, a comparison of the same program written in MPI and OpenMP would be interesting. Fortunately for us, the people at NASA (the rocket science guys) have an interest in such things as well. The venerable NAS Parallel Suite is now available in MPI, OpenMP, Java, and HPF.

This enhancement means a head to head comparison of MPI and OpenMP is possible. (I’ll leave the Java and HPF runs as an exercise for the reader). Before we get to the main event however, some background on how OpenMP and MPI differ may be helpful.

OpenMP and MPI Primer

Because native Pthread programing can be cumbersome, a higher level of abstraction has been developed called OpenMP. As with all higher level approaches, OpenMP sacrifices flexibility for the ease of writing code. At its core, OpenMP uses threads, but the details are hidden from the programmer.

OpenMP is implemented as compiler directives in program comments. Typically, computationally heavy loops are augmented with OpenMP directives that the compiler uses to automatically “thread the loop”. This type of approach has the distinct advantage that it may be possible to leave the original program “untouched” (except for comment-directives) and provide simple recompilation for a sequential (non-threaded) version where the OpenMP directives are ignored. (Read the OpenMP Web site to get the complete picture.)

For those who don’t follow software trends, but instead rely on the crack Linux Magazine columnists to provide them with all the important advances, GCC 4.2 (and later) has support for OpenMP. This is important for the open source crowd, because OpenMP was only available in commercial compilers before GCC 4.2 was released.

GCC 4.2 has not found its way into all distributions, so you may need to download and build it from source if you want to play along with this article. Of course if you have a commercial compiler, it probably already has OpenMP support.

For gcc and gfortran, OpenMP programs can be compiled by including the -fopenmp option. In order to test this new capability, I found an OpenMP version of the ubiquitous matrix multiplication program. I built two versions of the program, one with OpenMP enabled and one without:

$ gfortran -fopenmp -o matmult_omp matmult.f
$ gfortran -o matmult matmult.f

Then I ran the sequential version on an Intel Core 2 Duo system (two cores):

$time ./matmult

real    0m9.079s
user    0m8.988s
sys     0m0.012s

The OpenMP version was run as well. Note that there is a environment variable called OMP_NUM_THREADS that will tell OpenMP binaries how many threads to use. If this is not defined, one thread per CPU (core) is used. Ultimately however, the maximum number of threads may be defined by the program. The OpenMP results for two cores is shown below.)

$ time ./matmult_omp

real    0m4.967s
user    0m9.783s
sys     0m0.018s

The OpenMP version reduced the wall clock time by forty five percent. Astute readers may be wondering, why the user time is almost double the real time. This effect is due to using two cores, i.e. your total CPU time is a sum of the cores your application is uses. As we will see below, the user time can be quite a bit higher than the real time for eight cores.

In contrast to OpenMP, MPI uses a software library to send data from one process to another. Each process has its own memory space and thus MPI is basically a message copying methodology. In addition, MPI makes no distinction where a process runs. It can run on the same machine or on another machine. If one were to time an 8-way OpenMP and MPI program, the following would result (OpenMP is run first.):

$time bin/cg.B
real    1m11.735s
user    9m23.287s
sys     0m2.012s

$time mpirun -np 8 bin/cg.B.8
real    1m16.138s
user    0m0.000s
sys     0m0.004s

In the first case, OpenMP shows a real time of about one minute with user time of almost 9 and a half minutes indicating a good speed up. In the second case, the MPI run shows a comparable real time, but zero user time. This result is easily understood in terms of how MPI jobs are run. The mpirun command starts each separate MPI process and then waits until they are finished, thus no user time. OpenMP jobs, however, share a process space which makes them tractable to the OS.

The Process View

While we are talking about OpenMP and MPI, there’s one big difference between these programming methods in terms of the OS process space. OpenMP programs run as a single process and the parallelism is expressed as threads. This behavior can be viewed quite clearly when using an eight core server (two quad-core processors). For instance, examining a running OpenMP program using top shows only a single process running. (See Figure One)

Figure One: OpenMP program (cg.B) running on eight cores.

In contrast to the OpenMP, MPI actually starts one process per core using the mpirun -np 8 ... command. This situation is shown in Figure Two where an MPI version of the same program is now running. Note the number of processes is now eight. The processor (core) loads are about the same for both, however.

Figure Two: MPI program (cg.B.8) running on eight cores.

One final and subtle point. In OpenMP communication is through shared memory, which means threads share access to a memory location. With MPI programs on SMP systems communication is also through shared memory, but processes send messages by writing from private to shared memory.

Obviously, sharing memory locations seems more efficient than sending copies of memory locations to other processes, but it all depends. In the MPI process model, single processes have exclusive access to all their process memory. For some programs this situation may be more efficient because it is better to copy data (send a message) than to wait for shared memory access. On the other hand, in the OpenMP model, threads can share access to all memory in the process space. In this case, some programs may be much more efficient as the large overhead of copying memory is not needed.

Looking at the Numbers

An eight-core Intel server (two four core Clovertown processors) was used to run the tests. The OpenMP tests used gcc/gfortran version 4.2. The MPI tests used LAM version 7.1.2. The OpenMP and MPI suites have six programs in common and each of these was run five times and averaged (Class B problem sizes were used). The results are given in Mops (million operations per second) in Table One. The percent difference is also shown.

Test OpenMP
gcc/gfortran 4.2
MPI
LAM 7.1.2
Percent
Difference
CG 790.6 739.1 7%
EP 166.5 162.8 2%
FT 3535.9 2090.8 69%
IS 51.1 122.5 139%
LU 5620.5 5168.8 9%
MG 1616.0 2046.2 27%

Table One: Results for the OpenMP/MPI benchmarks. (winning test is in bold)

Tests CG and EP are about the same. Indeed, EP is a good check as both methods should produce a similar result because there is very little communication. OpenMP is the clear winner with FT performance, but MPI does surprisingly better with the latency sensitive IS benchmark. In the fifth test, OpenMP does best with the LU benchmark, while MPI does best with MG. Overall the comparison is a bit of draw.

The results are clear on one point, there is not a definitive winner in this match-up. This result may come as a surprise to those who would assume, OpenMP would easily beat MPI on an multicore machine. (Or any SMP machine for that matter.) Maybe MPI is good enough to stand toe-to-toe with OpenMP for many applications.

In only one case (FT), did OpenMP run away from MPI. In other cases, MPI was a clear winner, and taking the time to convert your code to OpenMP would actually result in a performance loss. The story is far from over, more benchmarks are in order using other hardware and commercial compilers.

Other Things to Consider

Getting back to our question, “do I need to re-code my MPI programs for these multicore thingies?,” the answer is a resounding maybe not. MPI may just be good enough in many cases. Again, more data, and results for your application are needed for more solid recommendations.

Another important question to ask is how scalable your application is. As more processors are added, parallel execution will always hit a point of diminishing returns. This situation means that creating more threads or processes will not improve performance and it may actually hurt performance. The size of your data set may also come into play. One of the advantages of distributed MPI programs is the ability to distribute large data sets over many processors thereby solving problems that would never fit in an SMP memory space.

If you’re considering a writing a new application from scratch, the choice of OpenMP or MPI includes other considerations. OpenMP is designed for shared memory (SMP) machines. As multicore continues to grow the number of processors on an SMP will continue to grow, but OpenMP is not designed to run across multiple machines like MPI.

If you want your application to be portable on clusters and SMP machines, MPI might be the best solution. If, however, you do not envision using more than eight or sixteen cores, then OpenMP is probably one of your best choices if the benchmarks point in that direction. From a conceptual standpoint, those with experience in both paradigms state that using OpenMP and MPI provide a similar learning curve and nuance level. There are no shortcuts or free lunches with OpenMP, or MPI for that matter.

Comments on "MPI on Multicore, an OpenMP Alternative?"

hzmonte

Could you post (the link to) the OpenMP and MPI source code for OpenMP/MPI benchmark apps on the web?

Reply
deadline

There is a link in the article. Here it is again:

NAS Parallel Suite:

http://www.nas.nasa.gov/Resources/Software/npb.html

Reply
tutu

love this article, it helps me alot

Reply

Dear Douglas Eadline,

The results and the conclusions are very interesting. I do have two related queries:

1. Are the timings for the simulations obtained by running the programs in a Release (production) mode? We did some preliminary tests using the OpenMP constructs i.e. mainly breaking the loops into threads. The Debug version did give reasonable scaling. The Release version gave literally no scaling. Have you encountered such phenomena?

2. Do you have some corresponding results/observations for engineering based problems involving CFD/FEM codes (with domain decomposition). Is there some material (publications) on the above?

Thanks for your time,
Ravi.

Reply

You are my breathing in, I have few blogs and infrequently run out from to brand.

Reply

certainly like your website however you need to test the spelling on quite a few of your posts. Several of them are rife with spelling issues and I in finding it very bothersome to tell the truth however I will certainly come back again.

Reply

whoah this blog is fantastic i love reading your posts. Keep up the good work! You know, a lot of people are hunting around for this info, you could help them greatly.

Reply

You actually make it seem so easy along with your presentation however I in finding this matter to be actually something that I believe I would by no means understand. It kind of feels too complex and extremely wide for me. I am taking a look forward to your subsequent publish, I will try to get the hang of it!

Reply

I’ll gear this review to 2 types of people: current Zune owners who are considering an upgrade, and people trying to decide between a Zune and an iPod. (There are other players worth considering out there, like the Sony Walkman X, but I hope this gives you enough info to make an informed decision of the Zune vs players other than the iPod line as well.)

Reply

you are in point of fact a excellent webmaster. The website loading pace is incredible. It kind of feels that you’re doing any distinctive trick. Also, The contents are masterpiece. you’ve performed a fantastic task on this matter!

Reply

I just couldn’t go away your site before suggesting that I really loved the usual information an individual supply to your visitors? Is going to be back regularly to check up on new posts.

Reply

Hello would you mind letting me know which web host you’re utilizing? I’ve loaded your blog in 3 different internet browsers and I must say this blog loads a lot quicker then most. Can you recommend a good hosting provider at a reasonable price? Thanks, I appreciate it!

Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>