MPI in Thirty Minutes

Learn how to obtain, build, and use an MPI stack for Linux machines. This tutorial will take you from "hello world" to parallel matrix multiplication in a matter of minutes.

OpenMP has several strong points: it is a very simple system to use and it is widely available in compilers for most major platforms. There are however, other methods to express parallelism in your code. On distributed parallel systems, like Linux clusters, the Message Passing Interface (MPI) is widely used. MPI is not a programming language, but rather a standard library that is used to send messages between multiple processes. These processes can be located on the same system (a single multi-core SMP system) or on a collection of distributed servers. Unlike OpenMP, the distributed nature of MPI allows it to work on almost any parallel environment.

On a multi-core system a single shared address space is the basis for a shared memory programming model. OpenMP makes relatively easy work of this environment. Sharing nothing between “threads”, or independent processes running on possibly independent machines, is typically called a shared nothing programming model, or more succinctly, a distributed programming model. There is nothing that precludes a shared nothing or distributed program from running on a shared memory system.

Distributed Programming Models

Distributed memory memory models assume that every CPU has access to its own memory space, and can alter its own local variables. To alter memory or variables or another/remote process, you need to send that process a message.

Again, the remote process can be on the same machine or on another machine and has to agree to accept the message. In essence, distributed memory programming requires copying memory from one memory space to another, sending and receiving messages. Unlike the shared memory model, you the programmer, have to consciously move data between “threads” or processes. The compiler will not handle this for you as it does in OpenMP. This requirement often makes MPI programming a bit more difficult than using OpenMP.

In addition to using physically remote machines, MPI has the additional “advantage” of forcing you divide your memory use between processes. In essence, you have greater control over locality. On modern CPUs with multiple cores, caches and memory hierarchies, you want to enable as much data re-use as possible. The distributed model forces you to think about that, as you have to decide what to move, and when to move it.

The rest of this article will focus upon distributed memory programming using MPI (Message Passing Interface) on a single multi-core system. We will provide some basic examples similar to the OpenMP tutorial.

Ingredients

  1. Source Code to parallelize using MPI
  2. One multi-core (or cluster) computer
  3. A few hundred megabytes of disk space (for compiler)
  4. A few gigabytes of ram (for program examples)
  5. An editor (pick your favorite)
  6. A compiler
  7. An MPI stack

The preparation time is a little more than an hour before programming if you have to build your own compiler and MPI packages. A few minutes if you can install pre-built compiler and MPI packages.

If you have already checked your cores, memory size, file space, and installed the GNU compilers from the previous tutorials, then you can skip to the Installing Open MPI section.

Preliminary Configuration

We are going to assume a Linux based machine with a few cores. I am using a Pegasus many-core workstation. To see how many cores you have in your unit, try this:

grep 'processor.*:'  /proc/cpuinfo  | wc -l

The Pegasus system I’m using reports 8 CPUs. We need to see how much disk space we have to work with. This is fairly easy to do. The df command is your tool of choice. A quick df -h will tell us what we have available, though it might not tell us if we can use it.

df -h `pwd`
Filesystem            Size  Used Avail Use% Mounted on
/dev/sda3              62G   29G   33G  48% /

This tells us that our home directory has as much as 33GB available. So let’s make ourselves a workspace, and continue.

mkdir ~/workspace
cd ~/workspace

We would like to know how much physical RAM we have, so try this:

grep "MemTotal:" /proc/meminfo

The Pegasus machine has 8184008 kB, or about 8 GB ram. For my editor, I may use Vim or pico. Any text editor should work.

We need a compiler. The GNU Foundation’s GNU Compiler Collection has excellent free compilers, and commercial compilers from Intel, Portland Group, and others available for your use, and they all support building and using MPI stacks. This may represent a dilemma for you: why should you purchase the other compilers if you can get GCC for free? In short, the other compilers do offer more in the way of optimization, support, and added elements of value.

If you can get gcc packages for your distribution, it is usually easiest to have those installed. Unfortunately, your system administrators may not let you modify the machines , so you might be forced to install the packages yourself into your own directories. Luckily, it is not hard. To build GCC, you need GMP and MPFR. Get GMP, build it and install it as follows:

cd ~
wget http://ftp.sunet.se/pub/gnu/gmp/gmp-4.2.2.tar.bz2
cd workspace
tar -xjvf /home/joe/gmp-4.2.2.tar.bz2
cd gmp-4.2.2/
./configure 0x2013prefix=/home/joe/local
make -j4
make install

Get MPFR, built it, and install it:

cd ~
wget http://www.mpfr.org/mpfr-current/mpfr-2.3.0.tar.bz2
cd workspace
tar -xjvf /home/joe/mpfr-2.3.0.tar.bz2
cd mpfr-2.3.0
./configure --prefix=/home/joe/local --with-gmp=/home/joe/local
make -j4
make install

Next, get GCC, uncompress, configure, and build it as follows:

cd ~
wget ftp://ftp.gnu.org/gnu/gcc/gcc-4.2.2/gcc-4.2.2.tar.bz2
cd workspace
tar -xjf /home/joe/gcc-4.2.2.tar.bz2
mkdir build
cd build
../gcc-4.2.2/configure --prefix=/home/joe/local         \
        --with-gmp=/home/joe/local                              \
        --with-mpfr==/home/joe/local                            \
        0x2013-enable-languages=c,c++,fortran                        \
        --disable-multilib

time make -j8 bootstrap
make install

Compilation required 12.25 minutes on 8 processors. It would take about eight times longer on one processor, which is why we used the -j8 switch on the Makefile (parallel build).

Now to use this compiler, we need to set an environment variable. This setting enables the linker to find the libraries.

export LD_LIBRARY_PATH=/home/joe/local/lib64/lib

The compiler itself is located in /home/joe/bin/gcc which can be set in the CC variable in the Makefile. Of course your path will be different.

One more thing is needed to correct a bug in the installation for 64 bit platforms. (It shouldn’t be needed on 32 bit platforms.)

ln -s /home/joe/local/lib64/libgomp.spec  \
      /home/joe/local/lib/libgomp.spec

Without this, we might get errors that look like the following on 64 bit platforms:

gcc: libgomp.spec: No such file or directory

As a sanity check, lets compile a simple test case. Here’s the classic “hello world” C program.

#include "stdio.h"

int main(int argc, char *argv[])
{
  printf("hello MPI user!\n");
  return(0);
}

We will call this hello.c and place it in our home directory. To verify, let’s compile and run it:

/home/joe/local/bin/gcc hello.c -o hello.exe
./hello.exe

If everything worked, when you run it you should see:

hello MPI user!

Our new Fortran compiler should also work. Here is a quick translation of the hello.c code into fortran:

program hello

print *,"hello Fortran MPI user!"
end

And a quick compilation and execution:

/home/joe/local/bin/gfortran hello.f \
        -o hello-fortran.exe
./hello-fortran.exe

If everything worked you should see something like the following when you run this code:

hello Fortran MPI user!

Installing Open MPI

There are multiple open source versions of MPI available on the Web, but for the purposes of this article, we’ll use OpenMPI. Get OpenMPI, uncompress, configure, and build it.

cd ~
wget http://www.open-mpi.org/software/ompi/v1.2/downloads/openmpi-1.2.4.tar.bz2
cd workspace
tar -xjf /home/joe/openmpi-1.2.4.tar.bz2
cd openmpi-1.2.4
CC=/home/joe/local/bin/gcc                      \
CXX=/home/joe/local/bin/g++                     \
F77=/home/joe/local/bin/gfortran                \
FC=/home/joe/local/bin/gfortran                 \
LDFLAGS=-L/home/joe/local/lib                   \
 ./configure --prefix=/home/joe/local           \
        --enable-mpirun-prefix-by-default       \
        --enable-static
time make -j8
make install

Now you should have an operational MPI stack to work with.

Introduction to MPI

Now that we have an operational MPI stack, we can start to explore MPI. At its core, MPI is a set of function calls and libraries that implement a distributed execution of a program. Distributed does not necessarily mean that you must run your MPI job on many machines. In fact, you could run multiple MPI processes on a laptop.

MPI assumes you will work with processes, which do not share the same memory address space as the other processes that MPI launches. So if one process makes a change to a variable, then none of the other processes can see that change. If you want another process to be aware of the change, you have to explicitly pass this information over to the other process.

This model is somewhat harder to grasp. Unlike in OpenMP, you cannot imagine all your variables in a big block of memory, and all processes can see all the variables. You have to think about things in terms of N independent programs running through your code all at the same time.

All the processes can perform IO operations, file, print, and so on. So things as simple as our “hello world” application are able to run in parallel. Serialization (or non-parallel execution) depends upon, whether or not you want to serialize (time ordered file output), or if external factors enforce serialization (non-parallel IO file system).

MPI basics

MPI operates via function calls. Fortran users should use something like this to make use of MPI:

call MPI_Bcast( ... )

In C and C++, it’s quite similar:

#include <mpi.h>
// ...
int rc;
// ...
rc = MPI_Bcast( ... )

Unlike OpenMP, you shouldn’t think of “parallel regions.” Your entire code is running in parallel. You should exploit this. To initialize the MPI environment, you need to include some function calls in your code. The following are typically used:

    /* add in MPI startup routines */
    /* 1st: launch the MPI processes on each node */
    MPI_Init(&argc,&argv);

    /* 2nd: request a thread id, sometimes called a "rank" from
            the MPI master process, which has rank or tid == 0
     */
    MPI_Comm_rank(MPI_COMM_WORLD, &tid);

    /* 3rd: this is often useful, get the number of threads
            or processes launched by MPI, this should be NCPUs-1
     */
    MPI_Comm_size(MPI_COMM_WORLD, &nthreads);

When the MPI program reaches that first MPI_Init, it passes the arguments it received on the command line to the MPI launcher process, which starts new copies of the binary running on the processors.

At this point in time, you have gone from a single process to NCPU (or N Cores) coordinated processes. These processes do not share variables. Your job is to figure out what process needs what data, when they need it, and get it to them.

The MPI_Comm_rank function call returns data on a specific process. A process id of 0 is the “master” process, the original process which did in fact launch all the others. The process id or “rank” is used to identify the different processes in you program.

The MPI_Comm_size function call returns the number of threads or processes being used for this program.

With this information, your program is now aware of the parallel environment. This is explicitly the case. Note that you could run your code on NCPU=1, in which case you have effectively serial operation.

A way to visualize this is to imagine an implicit loop around your program, where you have NCPU iterations of the loop. These iterations all occur at the same time, unlike an explicit loop. The number of CPUs or cores is controlled by specifying a value to the mpirun or mpiexec command using the -np argument. If it is not set, it could default to 1 or the number of CPUs/Cores on your machine.

That done, lets parallelize hello.c. (Note: all source code used in this article can be downloaded here.) Let’s put in the explicit function calls for the MPI processes, and add some additional “sanity checking”, so we know where and how the code ran. The original code looks like this:

#include "stdio.h"

int main(int argc, char *argv[])
         {
           printf("hello MPI user!\n");
   return(0);
 }

The MPI version of the code looks like this:

#include "stdio.h"
#include <stdlib.h>

#include <mpi.h>
int main(int argc, char *argv[])
{
 int tid,nthreads;
 char *cpu_name;

  /* add in MPI startup routines */
  /* 1st: launch the MPI processes on each node */
  MPI_Init(&argc,&argv);

  /* 2nd: request a thread id, sometimes called a "rank" from
          the MPI master process, which has rank or tid == 0
   */
  MPI_Comm_rank(MPI_COMM_WORLD, &tid);

  /* 3rd: this is often useful, get the number of threads
          or processes launched by MPI, this should be NCPUs-1
   */
  MPI_Comm_size(MPI_COMM_WORLD, &nthreads);

  /* on EVERY process, allocate space for the machine name */
  cpu_name    = (char *)calloc(80,sizeof(char));

  /* get the machine name of this particular host ... well
     at least the first 80 characters of it ... */
  gethostname(cpu_name,80);

  printf("hello MPI user: from process = %i on machine=%s, of NCPU=%i processes\n",
         tid, cpu_name, nthreads);
  MPI_Finalize();
  return(0);
}

Unlike OpenMP, there is no “parallel region” to enclose, the entire code is a “parallel region”, but nothing is shared. Let us create a simple Makefile to handle building this. The Makefiles are available with the Source Code. You may need to adjust the CC and LDFLAGS to reflect your environment.

### Basic Makefile for MPI

CC      = /home/joe/local/bin/mpicc
CFLAGS  = -g -O0
LD      = /home/joe/local/bin/mpicc
LDFLAGS = -g

PROGRAM = hello-mpi

all:    ${PROGRAM}.exe

${PROGRAM}.exe:         ${PROGRAM}.o
        ${LD} ${LDFLAGS} $< -o ${PROGRAM}.exe

${PROGRAM}.o:           ${PROGRAM}.c
        ${CC} ${CFLAGS} -c $< -o ${PROGRAM}.o

clean:
        rm -f ${PROGRAM}.o ${PROGRAM}.exe

The large initial spaces are tabs, ordinary spaces wont work with make. With this Makefile, we can automate the build and rebuild of these programs.

make -f Makefile.hello-mpi

This should result in:

/home/joe/local/bin/mpicc  -g -O0  -c hello-mpi.c -o hello-mpi.o
/home/joe/local/bin/mpicc  -g  hello-mpi.o -o hello-mpi.exe

and a working binary executable called hello-mpi.exe. To run the binary, use the mpirun command below. In this example we are starting eight processes. Each process will be called hello-mpi.exe and they will communicate with each other using MPI.

/home/joe/local/bin/mpirun -np 8 -hostfile hostfile ./hello-mpi.exe
hello MPI user: from process = 0 on machine=pegasus-i, of NCPU=8 processes
hello MPI user: from process = 1 on machine=pegasus-i, of NCPU=8 processes
hello MPI user: from process = 3 on machine=pegasus-i, of NCPU=8 processes
hello MPI user: from process = 4 on machine=pegasus-i, of NCPU=8 processes
hello MPI user: from process = 5 on machine=pegasus-i, of NCPU=8 processes
hello MPI user: from process = 2 on machine=pegasus-i, of NCPU=8 processes
hello MPI user: from process = 6 on machine=pegasus-i, of NCPU=8 processes
hello MPI user: from process = 7 on machine=pegasus-i, of NCPU=8 processes

There are some additional arguments worth discussing. First, the -hostfile argument provides a filename that tells mpirun the host or hosts on which to run the program. For this tutorial, we will use a single hostfile with “localhost” as our only host. You can use any name for this file. Second, since we have multiple cores we can adjust the number of processes to match the number of cores by using the -np parameter.

Note: -np is number of MPI processes to run. If multiple hosts are in the hostfile, then mpirun will attempt to run processes on these hosts. You will need to have MPI properly installed on all the listed hosts. It is also possible to designate the number of cores for each host in the hostfile. Mostmpirun commands will traverse the list of hosts and assign processes on each host. If the number or processes -np is greater than the number of hosts, then mpirun will loop to the first host in the hostfile and continue until all processes have been assigned. In our case we are using one host (localhost) so all processes are run on a single machine. Consult the man page for mpirun for more information.

By adjusting the value of the -np … command line argument, we can adjust the number of execution processes. If we set 1 process we get one print statement:

/home/joe/local/bin/mpirun -np 1 -hostfile hostfile ./hello-mpi.exe
hello MPI user: from process = 0 on machine=pegasus-i, of NCPU=1 processes

We can set more processes than CPUs/Cores, but this just means they are “oversubscribed.”

/home/joe/local/bin/mpirun -np 16 -hostfile hostfile ./hello-mpi.exe
hello MPI user: from process = 0 on machine=pegasus-i, of NCPU=16 processes
hello MPI user: from process = 1 on machine=pegasus-i, of NCPU=16 processes
hello MPI user: from process = 2 on machine=pegasus-i, of NCPU=16 processes
hello MPI user: from process = 3 on machine=pegasus-i, of NCPU=16 processes
hello MPI user: from process = 4 on machine=pegasus-i, of NCPU=16 processes
hello MPI user: from process = 5 on machine=pegasus-i, of NCPU=16 processes
hello MPI user: from process = 6 on machine=pegasus-i, of NCPU=16 processes
hello MPI user: from process = 7 on machine=pegasus-i, of NCPU=16 processes
hello MPI user: from process = 8 on machine=pegasus-i, of NCPU=16 processes
hello MPI user: from process = 9 on machine=pegasus-i, of NCPU=16 processes
hello MPI user: from process = 10 on machine=pegasus-i, of NCPU=16 processes
hello MPI user: from process = 12 on machine=pegasus-i, of NCPU=16 processes
hello MPI user: from process = 13 on machine=pegasus-i, of NCPU=16 processes
hello MPI user: from process = 14 on machine=pegasus-i, of NCPU=16 processes
hello MPI user: from process = 15 on machine=pegasus-i, of NCPU=16 processes
hello MPI user: from process = 11 on machine=pegasus-i, of NCPU=16 processes

This result is nice. We have a working MPI. It is worth noting at this point that you only need a few MPI functions for most of the programs you will write. MPI has many function calls available to it, and you can make use of extremely fine grain control and data transfer elements. In many cases, you probably would not need these, certainly not for initial ports to MPI. Subsequent rewrites of various algorithms may be redone with the more complex functions.

Before we do something useful with this, lets explore a few techniques that might be helpful. We can use several MPI function calls to query and control our environment. Our new program incorporates several of these, along with a few “tricks” I have found useful over the years.

The new hello-mpi code now looks like this (named hello-mpi-2.c):

#include "stdio.h"
#include <stdlib.h>

#include <mpi.h>
int main(int argc, char *argv[])
{
 int tid,nthreads;
 char *cpu_name;
 double time_initial,time_current,time;

  /* add in MPI startup routines */
  /* 1st: launch the MPI processes on each node */
  MPI_Init(&argc,&argv);
  time_initial  = MPI_Wtime();
  /* 2nd: request a thread id, sometimes called a "rank" from
          the MPI master process, which has rank or tid == 0
   */
  MPI_Comm_rank(MPI_COMM_WORLD, &tid);

  /* 3rd: this is often useful, get the number of threads
          or processes launched by MPI, this should be NCPUs-1
   */
  MPI_Comm_size(MPI_COMM_WORLD, &nthreads);

  /* on EVERY process, allocate space for the machine name */
  cpu_name    = (char *)calloc(80,sizeof(char));

  /* get the machine name of this particular host ... well
     at least the first 80 characters of it ... */
  gethostname(cpu_name,80);
  time_current  = MPI_Wtime();
  time  = time_current - time_initial;
  printf("%.3f tid=%i : hello MPI user: machine=%s [NCPU=%i]\n",
         time, tid, cpu_name, nthreads);
  MPI_Finalize();
  return(0);
}

We can compile and run it with 8 threads.

make -f Makefile.hello-mpi-2
/home/joe/local/bin/mpicc  -g -O0  -c hello-mpi-2.c \
                -o hello-mpi-2.o
/home/joe/local/bin/mpicc  -g  hello-mpi-2.o \
                -o hello-mpi-2.exe

/home/joe/local/bin/mpirun -np 8 -hostfile hostfile ./hello-mpi-2.exe
0.000 tid=0 : hello MPI user: machine=pegasus-i [NCPU=8]
0.000 tid=1 : hello MPI user: machine=pegasus-i [NCPU=8]
0.000 tid=3 : hello MPI user: machine=pegasus-i [NCPU=8]
0.000 tid=4 : hello MPI user: machine=pegasus-i [NCPU=8]
0.000 tid=5 : hello MPI user: machine=pegasus-i [NCPU=8]
0.000 tid=2 : hello MPI user: machine=pegasus-i [NCPU=8]
0.000 tid=6 : hello MPI user: machine=pegasus-i [NCPU=8]
0.000 tid=7 : hello MPI user: machine=pegasus-i [NCPU=8]

The first number you see is the elapsed time since the start of the program. The second number is the process or thread number/ID (tid variable in the program). Notice that the output does not come out necessarily in thread or process order. It doesn’t necessarily come out in time order either, though it is hard to tell from this example. One of the tricks that I have learned and use is to tag each line with either the thread id or the time, and then sort it.

/home/joe/local/bin/mpirun -np 8 -hostfile hostfile ./hello-mpi-2.exe | sort -n
0.000 tid=0 : hello MPI user: machine=pegasus-i [NCPU=8]
0.000 tid=1 : hello MPI user: machine=pegasus-i [NCPU=8]
0.000 tid=2 : hello MPI user: machine=pegasus-i [NCPU=8]
0.000 tid=3 : hello MPI user: machine=pegasus-i [NCPU=8]
0.000 tid=4 : hello MPI user: machine=pegasus-i [NCPU=8]
0.000 tid=5 : hello MPI user: machine=pegasus-i [NCPU=8]
0.000 tid=6 : hello MPI user: machine=pegasus-i [NCPU=8]
0.000 tid=7 : hello MPI user: machine=pegasus-i [NCPU=8]

Now we can tell what thread did what, and when it did it. Though the thread ordering is still off. Items like this are helpful when debugging parallel programs. Now it is time to move on to where a majority of the power of MPI becomes apparent to end users.

Programs With Loops

MPI helps you in a number of clever ways, allowing you to add multiple processes to your program without thinking through all the details of process setup and tear-down, and the details of inter-process communication. You can focus on your code, hand data to function calls, and not worry about opening sockets, or moving data yourself — unless you wish to do this.

As usual, it helps to start out with an understanding of where your program is spending its time. Without that, you could wind up porting code that takes very little time in the overall execution path, and leave the time expensive code unchanged. Additionally, it is important to test various sizes of runs, so you understand which portions of your code scale well, and which portions may need assistance.

Profiling is the best way to do this, we will use a “poor mans profiler,” basically timing calipers around sections of the code. This technique will give us approximate millisecond resolution data (well, it is limited to the clock timer tick interrupt rate, and could be anywhere from 1 to 10 milliseconds). You will not get more accurate single shot data than the timer resolution. In order to get better resolution, you need to iterate enough so that you can calculate an average time per iteration. It is also worth noting that OS jitter (management interrupts for disk I/O, network I/O, and running an OS in general) will also impact your measurements some, so please take this into account if you would like more precise data.

We implement calipers in the code using some library routines, a data structure, a counter, and a loop at the end. Our test code is named rzf.c. It computes the Riemann Zeta Function of order n. It does this by computing a sum using a loop. Take a look at the comments for more details, but it computes the zeta function:


Source: Wikipedia.

To run the serial version of the code, first make the executable (make -f Makefile.rzf) type and run the executable as follows:

./rzf.exe -l INFINITY -n n

Where INFINITY is an integer value (e.g. 1000000) of how many terms you would like to use for your sum, and n is the argument to the Riemann Zeta Function. If you use 2 for n, then it will calculate 0x03C0 for you as well. The sum is basically a loop that looks like the following:

sum     = 0.0;
for( k=1 ; k<=inf ; k++)
 {
  sum += pow(k,(double)n);
 }

As it turns out, for numerical stability (roundoff error accumulation) reasons, you run the sum in the other order (from inf to 1), which we do in the code.

Unlike OpenMP, we cannot think of simply parallelizing this at a loop level. More about this in a moment. This loop is an example of something called a reduction operation. Specifically, the loop is a sum reduction. A reduction operation occurs (in a technical sense) when you reduce the rank or number of dimensions of something. In this case, imagine that we have a big long string (or vector) of numbers.

[1-2, 2-2, 3-2, ... , inf-2]

We are going to reduce these numbers to a single number by summing (taking a dot product with them [1, 1, 1, ..., 1]). First things first, how does the code perform, and specifically, where is the slow area? Lets run it and find out.

time ./rzf.exe -n 2 -l 1000000000
D: checking arguments: N_args=5
D: arg[0] = ./rzf.exe
D: arg[1] = -n
D: N found to be = 2
D: should be 2
D: arg[2] = 2
D: arg[3] = -l
D: infinity found to be = 2
D: should be 1000000000
D: arg[4] = 1000000000
D: running on machine = pegasus-i
zeta(2)  = 1.644934065848226
pi = 3.141592652634863
error in pi = 0.000000000954930
relative error in pi = 0.000000000303964
Milestone 0 to 1: time = 0.000s
Milestone 1 to 2: time = 50.946s

real    0m50.949s
user    0m50.911s
sys     0m0.040s

Using -n 2 and -l 1000000000, the loop took about 51 seconds to execute. The loop is the most time consuming portion of this program. It makes sense to pay attention to it, and see what we can do with it.

The wallclock and user times are important as well. In this case, the wallclock reports being 3 milliseconds more than the loop time, and the user CPU time reports being a little less than the wallclock. The sum of user and sys times does in fact, pretty nearly equal the wallclock time. This result is good.

It is important to note that as you scale up your parallel application, this time sum should remain very nearly constant. This goal may sound counter-intuitive, but it means that you are effectively partitioning work among the N COU/Cores you are using. If this sum increases significantly with increasing processor count, you may not be operating as effectively as you could be, and your code won’t scale.

Can we get any benefit out of running this in parallel? And how hard is it? Well, lets add in the MPI bits as we did before, and then think about how to run this loop in parallel. First make a copy of the program and Makefile (Note: the source code contains the program and Makefile with the changes we are making), and rename them rzf_2.c and Makefile.rzf_2, adding in the MPI elements. The rzf_2.c is more complex, and requires some explanation.

Remember, the loop is the time expensive portion of this code, so after setting up the MPI processes, wouldn’t it be nice to partition or decompose the loop such that the calculations occurred in parallel? Imagine that our loop had an “outer loop” around it, with some additional helper bits. The “outer loop” sets the lower and upper bounds of the index for the inner loop.

    loop_min    = 1 +  (int)((tid + 0)  *  (inf-1)/NCPUs);
    loop_max    =      (int)((tid + 1)  *  (inf-1)/NCPUs);

    for(i=loop_max-1;i>=loop_min;i--)
       {
          sum += 1.0/pow(i,(double)n);
       }

The “outer loop” is indexed by the value of tid. Imagine (though you don’t see it there) that you have:

for(tid=0; tid < NCPU; tid++)
 {
   /*  the above inner loop */
 }

This is effectively what MPI provides you, the caveat being that MPI is running these loop bodies in parallel (simultaneously) on different processes, and you cannot trivially share data between loops (tid) without explicitly moving it.

Before we continue, look at the construction of loop_min and loop_max. Note that we have something that looks like tid * inf, and for large enough inf and tid we will overflow the integer maximum (2^31-1). Specifically, lets use Octave to investigate (users with Matlab should be able to perform the same calculations without changing the code):

octave:1> inf=1000000000
inf =  1.0000e+09
octave:2> tid=[0:7]
tid =

   0   1   2   3   4   5   6   7

octave:3> temp = tid .* (inf - 1)
temp =

 Columns 1 through 6:

   0.0000e+00   1.0000e+09   2.0000e+09   3.0000e+09   4.0000e+09   5.0000e+09

 Columns 7 and 8:

   6.0000e+09   7.0000e+09

octave:4> dec2bin([temp,2^31-1])
ans =

000000000000000000000000000000000
000111011100110101100100111111111
001110111001101011001001111111110
010110010110100000101110111111101
011101110011010110010011111111100
100101010000001011111000111111011
101100101101000001011101111111010
110100001001110111000010111111001
001111111111111111111111111111111

The second bit in from the left is the sign bit for 32 bit integers. If it is "1," then the quantity is negative (and the 32 bit integer will happily throw away the bits above it). That is, you will get a wrap-around effect. Which you will see reflected in the index calculation, when some come out negative.

This is an issue in that you want to perform the multiplication before the division, so as not to lose bits of precision. What you need to do is to operate in a higher precision, in order to do this. The rzf_3.c code has been modified for this reason, so we use tis version so you can see the difference between this and the naive rzf_2.c version.

When we run this version, each process sets its own value of sum based upon the calculation over the index range it had. These are partial sums. We have to further reduce them to a sum. Hence we will use a sum reduction. First, it is worth your time to relabel the sum in the inner loop as p_sum to reflect its nature as a partial sum. Second, we have to add an explicit sum reduction call to move the data back to the main process. It turns out this is fairly easy. We add in a line after the summation finishes, that looks like this:

MPI_Reduce(&p_sum,&sum,1,MPI_DOUBLE,MPI_SUM,0,MPI_COMM_WORLD);

This calls the MPI_Reduce function, taking values from every processes double precision p_sum variable, and combining them using sum reduction into the tid == 0 variable named sum.

That's our message passing. Not hard at all, but you still have to think about it.

Now, recall that the sequential loop took about 50.95 seconds. For 8 processors, if the work is evenly divided among processes, we should get about 50.95/8 second run time for the loop. We estimate that this would be 6.37 seconds to traverse this loop. Don't forget to make the program ( make -f Makefile.rzf_3.) The output will look something like this (though the output has been trimmed for space.)

time /home/joe/local/bin/mpirun -np 8 -hostfile hostfile ./rzf_3.exe -n 2 -l 1000000000
D[tid=0]: checking arguments: N_args=5
D[tid=0]: arg[0] = ./rzf_3.exe
D[tid=0]: NCPUs found to be = 8
D[tid=0]: arg[1] = -n
D[tid=0]: n found to be = 2
D[tid=0]: should be 2
D[tid=0]: NCPUs found to be = 8
D[tid=0]: arg[2] = 2
D[tid=0]: NCPUs found to be = 8
D[tid=0]: arg[3] = -l
D[tid=0]: infinity found to be = 2
D[tid=0]: should be 1000000000
D[tid=0]: NCPUs found to be = 8
D[tid=0]: arg[4] = 1000000000
D[tid=0]: NCPUs found to be = 8
D[tid=0]: running on machine = pegasus-i
D[tid=0]: loop_max=124999999, loop_min=1
D[tid=1]: checking arguments: N_args=5
D[tid=1]: arg[0] = ./rzf_3.exe
D[tid=1]: NCPUs found to be = 8
D[tid=1]: arg[1] = -n
D[tid=1]: n found to be = 2
D[tid=1]: should be 2
D[tid=1]: NCPUs found to be = 8
D[tid=1]: arg[2] = 2
D[tid=1]: NCPUs found to be = 8
D[tid=1]: arg[3] = -l
D[tid=1]: infinity found to be = 2
D[tid=1]: should be 1000000000
D[tid=1]: NCPUs found to be = 8
D[tid=3]: checking arguments: N_args=5
D[tid=3]: arg[0] = ./rzf_3.exe
D[tid=3]: NCPUs found to be = 8
D[tid=3]: arg[1] = -n
D[tid=3]: n found to be = 2
D[tid=3]: should be 2
D[tid=3]: NCPUs found to be = 8
D[tid=3]: arg[2] = 2
D[tid=3]: NCPUs found to be = 8
...
D[tid=7]: NCPUs found to be = 8
D[tid=7]: arg[2] = 2
D[tid=7]: NCPUs found to be = 8
D[tid=7]: arg[3] = -l
D[tid=7]: infinity found to be = 2
D[tid=7]: should be 1000000000
D[tid=7]: NCPUs found to be = 8
D[tid=7]: arg[4] = 1000000000
D[tid=7]: NCPUs found to be = 8
D[tid=7]: running on machine = pegasus-i
D[tid=7]: loop_max=999999999, loop_min=875000000
[tid=5] Milestone 1 to 2: time = 6.335s
[tid=1] Milestone 1 to 2: time = 6.336s
[tid=2] Milestone 1 to 2: time = 6.341s
[tid=7] Milestone 1 to 2: time = 6.337s
[tid=4] Milestone 1 to 2: time = 6.344s
[tid=6] Milestone 1 to 2: time = 6.349s
[tid=3] Milestone 1 to 2: time = 6.421s
zeta(2)  = 1.644934065848226
pi = 3.141592652634863
error in pi = 0.000000000954930
relative error in pi = 0.000000000303964
[tid=0] Milestone 0 to 1: time = 0.000s
[tid=0] Milestone 1 to 2: time = 6.421s

real    0m8.520s
user    0m0.008s
sys     0m0.008s

What we measure is 6.33-6.41 seconds, quite similar to the 6.35 we predicted. Notice that the estimated error is different than in the single core run. This result leads into a whole other discussion, which we will talk about very briefly below. The point is that our code, or at least that loop, is now about 8 times faster on 8 CPUs than on one CPU. The result is quite good. Especially considering how little work this required.

Notice also that the user time is about the same as a single processes user time. Slightly more in fact, and this is due to the cost to set up and launch processes on this system. About 2 seconds appear to be needed to launch 8 processes on this system, averaging about 0.25 seconds per process. This time is expensive relative to the calculation time for this particular example, but in general, the launch time should be a small fraction of the overall run time.

Very briefly, the way roundoff error accumulates in these calculations does have an impact upon the final results. Running this sum in the other direction, which 99+% of people would normally do, accumulates error in a different order, leading to a different result. That is, when you change the order of your floating point computations, you will change the roundoff error accumulation. It it possible to observe a catastrophic loss of accuracy, impacting first and second digits of a summation like this, by not being aware of these issues.

Application: Matrix Multiply

We now have had a fairly "trivial" set of examples, and we have used up about one half of our 30 minutes. The hello case is great to use as a sanity check, specifically something to use to verify that MPI is installed and operational. The rzf code is an example of an effectively embarrassingly parallel code, there is no communication between parallel processes until the very last iteration where a sum reduction is performed.

A more complicated problem is matrix multiplication. The core algorithm in the matmul code is the triply nested for loop. You can see what happens when we run it with reasonable values of DIM. Use make -f Makefile.matmul to make the program for yourself. You can easily change the value of DIM by using matmul.exe -n DIM where DIM is the size of the array. If no arguments are given, then the default DIM is 4000.

Table One: Matrix size vs run time

N (matrix dimension) T(N) in seconds
100 0.003
200 0.023
400 0.553
1000 14.46
1500 57.46
2000 137.1
4000 1107

A 4000 x 4000 matrix requires 16 million (16Mi) elements. Each element is 8 bytes (double precision floating point). This means we have each array being 128Mi bytes, or 122.07 MB (remember a MB is different than an MiB).

Looking over the calipers for the 4000x4000 run (enter ./matmul.exe -n 4000), we see:

milestone 0 to 1 time=0.337 seconds
milestone 1 to 2 time=0.819 seconds
milestone 2 to 3 time=1.004 seconds
milestone 3 to 4 time=1106.836 seconds
milestone 4 to 5 time=0.004 seconds

This suggests that the matrix fill time (calipers 1 to 2) and the normalization time (calipers 2 to 3) are under 0.1% of the execution time. So we should focus our effort upon the matrix multiplication loops.

Now that we have established what takes the most time, how do we parallelize this? As it turns out, we use similar methods to what was used before. The main multiplication loop looks like this:

/* matrix multiply
    *
    *   c[i][j]= a_row[i] dot b_col[j]  for all i,j
    *           a_row[i] -< a[i][0 .. DIM-1]
    *            b_col[j] -< b[0 .. DIM-1][j]
    *
    */
    for(i=0;i<DIM;i++)
     {
      for(j=0;j<DIM;j++)
       {
         dot=0.0;
         for(k=0;k<DIM;k++)
           dot += a[i][k]*b[k][j];
         c[i][j]=dot;
       }

     }

You will likely notice that the interior loop is a sum reduction. It is basically a dot product between two vectors. That inner loop is executed N2 times. This loop is not the appropriate place to perform our decomposition.

In general, for most parallelization efforts, you want to enclose the maximum amount of work within the parallel region. This goal suggests you really want to put the directives outside the outer most loop, or as high up in the loop hierarchy as possible.

So we use a similar technique to what was used in rzf.c. Precompute the loop indexes. We will compute "panels0x201D, that is, outer loop index ranges, of the c matrix on each CPU.

    loop_min    =(int)((long)(tid + 0) *
                        (long)(DIM)/(long)NCPUs);
    loop_max    =(int)((long)(tid + 1) *
                        (long)(DIM)/(long)NCPUs);

    for(i=loop_min;i<loop_max;i++)
     {
      for(j=0;j<DIM;j++)
       {
         dot=0.0;
         for(k=0;k&;lt;DIM;k++)
           dot += a[i][k]*b[k][j];
         c[i][j]=dot;
       }

     }

Again, each process computes its own minimum and maximum for the outer loop index. These computations only require knowing the total number of processes in the MPI session, and the particular process number.

Before we discuss the rest of the code, please note that we did not restrict the argument parsing to the first process. There is no need to. It would take more effort to parse on the master node, then distribute the arguments than it would to parse on all nodes at once. Also note that we construct the initial matrices (a, b) on all of the nodes, the same way, simultaneously. In real codes, we would probably not want to do this as the matrix will be passed (sent) to the nodes, or constructed in some other manner. However, if you can divide and conquer the construction, it could be worth the effort in some cases.

At the end of the calculation, we use MPI_Send and MPI_Recv pairs to send every row of the computed panel back to the master process. This method is not meant to be efficient. It is designed solely to show that you need to pair MPI_Send with MPI_Recv. If one does not correctly complete a posted Send with a Receive), your code will block.

To move the data back to the master process, our loops look like this:

    for(i=1;i<NCPUs;i++)
     {
       /* have the master process receive the row */
       loop_min =(int)((long)(i + 0) *
                                (long)(DIM)/(long)NCPUs);
       loop_max =(int)((long)(i + 1) *
                                (long)(DIM)/(long)NCPUs);

       /* receive the rows from the i_th_ process to the master process */
       if (tid == 0)
             {
            for(k=loop_min;k<loop_max;k++)
             {
           MPI_Recv(c[k],DIM,MPI_DOUBLE,i,i*k,
                                MPI_COMM_WORLD,&stat);
             }

        }

       /* send the rows from the i_th_ process to the master process */
       if (tid == i)
        {
            for(k=loop_min;k<loop_max;k++)
             {
                   MPI_Send(c[k],DIM,MPI_DOUBLE,0,i*k,
                                        MPI_COMM_WORLD);
             }
        }

As mentioned, this code is not efficient, and it does not make good use of MPI. Its entire purpose is to illustrate MPI Send and Recieve pairs. You can view the code in matmult_mpi.c. To build the code, enter, make -f Makefile.matmul_mpi, then to run on four cores mpirun -np 4 -hostfile hostfile ./matmul_mpi -n 1000. (Adjust the np argument to reflect your number of cores.)

We could effectively remove the inner loop over k with a larger single Send/Recieve pair. This way, instead of DIM Send/Recieve pairs, we have NCPU Send/Recieve pairs. If this is one of the optimizations you would like to do, first make it work correctly, then apply the optimization.

As for performance, this is what we observe for the N=4000 case:

Table Two: Speed-up vs number of cores for first version of matmul_mpi.c

Number Cores Time Speed-up Efficiency
1 1111.0 1 100.0%
2 573.40 1.94 96.9%
4 302.60 3.67 91.8%
8 197.60 5.62 70.3%

This isn't bad. We measure 5.62x speedup on 8 CPUs. About 70% efficient. The parallel version takes 197.6 seconds as compared to over 1111 seconds.

Can we make it go any faster? There are two ways, either get better speedup using MPI or tweak the code slightly, and get good parallel performance as well as absolute performance?

The answer is yes to both. With a relatively simple code adjustment (that some compilers might do for you if you can coax them to), you can see significantly better performance in parallel. What we do is increase the amount of work done in parallel. We do this by unrolling a loop. Our code now looks like this (see matmult_mpi2.c):

    loop_min    =(int)((long)(tid + 0) *
                        (long)(DIM)/(long)NCPUs);
    loop_max    =(int)((long)(tid + 1) *
                        (long)(DIM)/(long)NCPUs);

    for(i=loop_min;i<loop_max;i+=4)
     {
      for(j=0;j<DIM;j++)
       {
         dot[0]=dot[1]=dot[2]=dot[3]=0.0;
         for(k=0;k<DIM;k++)
             {
           dot[0] += a[i+0][k]*b[k][j];
           dot[1] += a[i+1][k]*b[k][j];
           dot[2] += a[i+2][k]*b[k][j];
           dot[3] += a[i+3][k]*b[k][j];
             }
         c[i+0][j]=dot[0];
         c[i+1][j]=dot[1];
         c[i+2][j]=dot[2];
         c[i+3][j]=dot[3];

       }

     }

With this modification, and a few additional ones to the definition and use of the variable dot (now an array), we now measure the performance for the N=4000 case. To build the code, enter, make -f Makefile.matmul_mpi2, then to run on four cores mpirun -np 4 -hostfile hostfile ./matmul_mpi -n 4000. Also note that since we "unrolled the loop" by 4, we need to make sure the array dimension (-n value) divided by the number of processes (-np value) is divisible by 4. Otherwise the program will crash, fixing it is an exercise for the reader.

Table Three: Speed-up vs number of cores for second version of matmul_mpi2.c

Number Cores Time Speed-up Efficiency
1 311.18 1 100.0%
2 160.22 1.94 97.1%
4 87.70 3.55 88.7%
8 60.87 5.11 63.9%

This version of the program operates upon 4 rows at a time, thus increasing the amount of work done per iteration. We have also reduced the cache miss penalty per iteration by reusing some of the more expensive elements (the b[k][j]). If we were sufficiently clever, we would unroll by 2 on the j loop, which would allow us to use a cache-line at a time of b[k][j .. j+1], which could lead to even better overall performance. Of course, after all of these modifications, we check the results to make sure that the addition of parallelization has not also caused the addition of bugs.

In the final version of this code (matmult_3.c), we fix the MPI_Send/MPI_Recv pairs so that they transfer a panel at a time. This improves the performance by not having so much work focused upon data transfer at the tail end of the computation. We could further overlap communication and computation by using MPI_Isend/MPI_Irecv pairs so that after we finish each panel row or column, we start a data transfer operation. This is one of the more interesting aspects of MPI, and allows you to overlap operations, effectively hiding the latency associated with longer delay operations.

Summary

We have shown in this article how to obtain, build, and use an MPI stack for Linux machines. The MPI examples shown range from simple "hello" examples, through parallel matrix multiplication. All codes demonstrate excellent performance. The exercise takes slightly more than 30 minutes and allows one to develop and run MPI codes on a multi-core server or on a HPC cluster.

Fatal error: Call to undefined function aa_author_bios() in /opt/apache/dms/b2b/linux-mag.com/site/www/htdocs/wp-content/themes/linuxmag/single.php on line 62