MPI-2: The Future of Message Passing

The Message Passing Interface (MPI) has become the application programming interface (API) of choice for data exchange among processes in parallel scientific programs. While Parallel Virtual Machine (PVM) is still a viable message passing system offering features not available in MPI, it's often not the first choice for developers seeking vendor-supported APIs based on open standards. Of course, standards evolve, and the MPI standard is no different.

The Message Passing Interface (MPI) has become the application programming interface (API) of choice for data exchange among processes in parallel scientific programs. While Parallel Virtual Machine (PVM) is still a viable message passing system offering features not available in MPI, it’s often not the first choice for developers seeking vendor-supported APIs based on open standards. Of course, standards evolve, and the MPI standard is no different.

The evolving series of MPI standards, available on the web at http://www.mpi-forum.org, describe the features of the programming interface and serve as guidelines for those developing MPI implementations for various computer platforms. The MPI-1.0 standard was produced way back in May 1994, and the MPI-1.1 standard was released in June 1995. Within a few years, the MPI-1.2 standard and the MPI-2 standard were produced. The MPI-1.2 standard consists primarily of clarifications and corrections to MPI-1.1 and one new routine. Forward compatibility is preserved so that a valid MPI-1.1 program is both a valid MPI-1.2 program and a valid MPI-2 program.

Most MPI implementations in common use — including the two most popular ones for Linux clusters, called MPICH and LAM/MPI — comply with the MPI-1.2 standards. LAM/MPI and a new version of MPICH available in beta, called MPICH2, already support many of the features new to MPI-2. The most important MPI-2 features include process creation and management; one-sided communications; collective operations on intercommunicators; external interfaces to error handlers, data types, and requests; parallel input/output (also known as MPI-IO); bindings for C, C++, Fortran 77, and Fortran 90; and a variety of miscellaneous new routines and data types.

While LAM/MPI and MPICH2 in their present forms support only subsets of the new MPI-2 features, one or both may already implement the code needed for your programming project. Both provide MPI-I/O support using ROMIO, offer the new, portable mpiexec command line startup, provide dynamic process management, offer C++ bindings, and implement basic one-sided communication. Even if LAM and MPICH2 don’t provide all the MPI-2 features you may ultimately want, they have enough of the new stuff to make it worth taking them our for a spin.

Trying Out LAM and MPICH2

At the time of this writing, LAM/MPI is at version 7.0.6, and MPICH2 is at 0.96beta2. LAM can be downloaded from http://www.lam-mpi.org, and MPICH2 is available at http://www-unix.mcs.anl.gov/mpi/mpich2. Both use autoconf and are built in the standard ./configure; make; make install fashion. For initial testing of these MPI implementations, you may wish to install one or both in your home directory until you’re ready to make them accessible to all users.

Figure One and Figure Two show sample output from installing and testing LAM and MPICH2, respectively, in a typical home directory. (These tests were performed on a cluster running Red Hat Linux 7.3.) Both installations begin with commands ./configure –prefix=, where –prefix specifies the directory prefix to use for installation. After that, running make builds the software, and typing make install installs the software in the directory specified earlier with –prefix.




Figure One: Installing and testing LAM/MPI 7.0.6

Unpack, build, and install the LAM source code into your home directory.


[node01 src]$ tar xzf lam-7.0.6.tar.gz
[node01 src]$ cd lam-7.0.6
[node01 lam-7.0.6]$ ./configure –prefix=$HOME/lam-7.0.6
[node01 lam-7.0.6]$ make
[node01 lam-7.0.6]$ make install
[node01 lam-7.0.6]$ cd

Next, add the LAM installation to your path by adding the line…


export PATH=$HOME/lam-7.0.6/bin:$PATH

… to your .bashrc.

Change your shell settings by loading .bashrc. You should now be able to run LAM commands such as laminfo.


[node01 forrest]$ source .bashrc
[node01 forrest]$ laminfo
LAM/MPI: 7.0.6
Prefix: /home/forrest/lam-7.0.6
Architecture: i686-pc-linux-gnu
Configured by: forrest
Configured on: Tue Jun 1 22:50:18 EDT 2004
Configure host: node01.cluster.ornl.gov
C bindings: yes
C++ bindings: yes
Fortran bindings: yes
C profiling: yes
C++ profiling: yes
Fortran profiling: yes
ROMIO support: yes
IMPI support: no
Debug support: no
Purify clean: no
SSI boot: globus (Module v0.5)
SSI boot: rsh (Module v1.0)
SSI coll: lam_basic (Module v7.0)
SSI coll: smp (Module v1.0)
SSI rpi: crtcp (Module v1.0.1)
SSI rpi: lamd (Module v7.0)
SSI rpi: sysv (Module v7.0)
SSI rpi: tcp (Module v7.0)
SSI rpi: usysv (Module v7.0)

Now, create a list of hosts that will run LAM.


[node01 forrest]$ cat >hostfile
node01
node02
node03
node04
node05
^D

And start LAM on all of those hosts with lamboot.


[node01 forrest]$ lamboot -v hostfile
LAM 7.0.6/MPI 2 C++/ROMIO – Indiana University
n-1<29549> ssi:boot:base:linear: booting n0 (node01)
n-1<29549> ssi:boot:base:linear: booting n1 (node02)
n-1<29549> ssi:boot:base:linear: booting n2 (node03)
n-1<29549> ssi:boot:base:linear: booting n3 (node04)
n-1<29549> ssi:boot:base:linear: booting n4 (node05)
n-1<29549> ssi:boot:base:linear: finished

Finally, show the list of available nodes with lamnodes, run an example program (shown in Listing One, and then shut down LAM with lamhalt.


[node01 forrest]$ lamnodes
n0 node01.cluster.ornl.gov:1:origin,this_node
n1 node02.cluster.ornl.gov:1:
n2 node03.cluster.ornl.gov:1:
n3 node04.cluster.ornl.gov:1:
n4 node05.cluster.ornl.gov:1:

[node01 forrest]$ mpicc -O -o mpi2_get.lam mpi2_get.c
[node01 forrest]$ mpiexec -n 5 ./mpi2_get.lam
Hello world! I’m rank 0 of 5 on
node01.cluster.ornl.gov running MPI 1.2
Hello world! I’m rank 2 of 5 on node03 running MPI 1.2
Hello world! I’m rank 1 of 5 on node02 running MPI 1.2
Hello world! I’m rank 4 of 5 on node05 running MPI 1.2
Hello world! I’m rank 3 of 5 on node04 running MPI 1.2
Process 0 has neighbor 4
Process 1 has neighbor 0
Process 2 has neighbor 1
Process 4 has neighbor 3
Process 3 has neighbor 2

[node01 forrest]$ lamhalt -v
LAM 7.0.6/MPI 2 C++/ROMIO – Indiana University
Shutting down LAM
hreq: waiting for HALT ACKs from remote LAM daemons
hreq: received HALT_ACK from n1 (node02.cluster.ornl.gov)
hreq: received HALT_ACK from n2 (node03.cluster.ornl.gov)
hreq: received HALT_ACK from n3 (node04.cluster.ornl.gov)
hreq: received HALT_ACK from n4 (node05.cluster.ornl.gov)
hreq: received HALT_ACK from n0 (node01.cluster.ornl.gov)
LAM halted

LAM Installation

Once LAM is built and installed, the path to its executable files (in the example $HOME/lam-7.0.6/bin) should be added to your path. This is best done by editing your .bashrc file, assuming you use bash as your login shell, so that processes can be spawned on other cluster nodes. Next, source your .bashrc file so that LAM is added to your path. Once your path is modified, the laminfo command can be used to find out which version of LAM is available, where it’s located, and which options were compiled into the software.

LAM and MPICH2 both now use daemons running on cluster nodes for starting up MPI programs. This method provides faster startup of codes, particularly on clusters with many nodes. These daemons must be started separately prior to executing the MPI program. The LAM developers don’t allow their daemon to run as the root user for security reasons. As a result, each user must start his or her own set of daemon processes on the desired cluster nodes.

The MPICH2 developers, on the other hand, encourage sites to run a single daemon as the root user, started at system boot time, to service all users’ MPICH2 jobs. This prevents the users from having to start their own daemons and shut them down when their jobs are complete. To test the MPI implementations shown here, individual user daemons are started, used, and halted.

LAM daemons are started using the lamboot command. This command needs a list of node hostnames on which daemons should be started. As shown in Figure One, a list of nodes is written to hostfile, and lamboot is executed with -v (for verbose output) and hostfile as arguments. After the daemons are started, executing lamnodes shows the list of nodes on which the daemons are running.

MPI-2 Test Code

To test the basic features of LAM and MPICH2, a simple MPI program is compiled and executed. The program, called mpi2_get.c, is shown in Listing One. The code starts off just like the typical “Hello World!” program that we usually use. MPI is initialized by calling MPI_Init(); the process rank is obtained from MPI_Comm_rank(); the number of processes running the code in parallel (the size) is obtained by calling MPI_Comm_size(); and the hostname of the node on which the process is running is obtained from MPI_ Get_processor_name().




Listing One: mpi2_get.c


#include <stdio.h>
#include “mpi.h”

int main(int argc, char** argv)
{
int rank, size, namelen, version, subversion, neighbor;
char processor_name[MPI_MAX_PROCESSOR_NAME];
MPI_Win win;

MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Get_processor_name(processor_name, &namelen);
MPI_Get_version(&version, &subversion);
printf(“Hello world! I’m rank %d of %d on %s
running MPI %d.%d\n”,
rank, size, processor_name, version, subversion);
MPI_Win_create(&rank, sizeof(rank), sizeof(int),
MPI_INFO_NULL,
MPI_COMM_WORLD, &win);
MPI_Win_fence(0, win);
MPI_Get(&neighbor, 1, MPI_INT, (rank ? rank-1 :
size-1), 0, 1,
MPI_INT, win);
MPI_Win_fence(0, win);
printf(“Process %d has neighbor %d\n”, rank,
neighbor);
MPI_Win_free(&win);
MPI_Finalize();
}

Next, MPI_Get_version() inquires what version of the MPI standard the running implementation claims to fully support. This routine was added in the MPI-1.2 standard so that developers can tell which version and subversion of the ever-evolving standard is being run with their code. This provides a mechanism for using either old or new communications routines for message passing at run time. As usual, all of this information, including the version and subversion numbers, is printed by every process.

To see if the simplest MPI-2 one-sided communication routine works, the subsequent MPI calls are used to “grab” the process rank of each processes’ left neighbor. The zeroth process reaches around to the last process to obtain the rank of its neighbor. This sort of remote memory access (RMA) is new to MPI-2. After a “window” of memory is made available, another MPI process can get data from or put data into the memory of another MPI process.

This new one-sided communication mechanism can be used to avoid global computations and the periodic polling for messages that often occurs in traditional MPI programs.

First, a window of memory is created by calling MPI_Win_ create(). This is a collective operation, meaning it must be called on every process in the given communicator. In this case, the communicator is MPI_COMM_WORLD, which refers to all processes running the program.

The memory window on each process is then accessible to all of the other processes. MPI_Win_create() is provided the base address of the memory segment (&rank); the size of the window in bytes (sizeof(rank)); the local unit size for displacements into the window in bytes (sizeof(int)); an info argument (MPI_INFO_NULL); a communicator (MPI_ COMM_WORLD); and a pointer to an opaque object that is filled in by the call (&win). This window object represents the group of processes that own and access the window and the attributes of the window.

Then MPI_Win_fence() is called to synchronize the processes. This collective call forces the completion of any previous RMA communications, if there were any. In this case, the fence starts a new series of RMA communications (called an access epoch) for the specified window (win), since it’s followed by a subsequent MPI_Win_fence() call with an RMA communication between them.

Next, MPI_Get() is called to obtain the rank value for the left neighbor of each MPI process. Passed to MPI_Get() is: the address on the local process where the data are to be stored (&neighbor, called the origin address); the number of entries in the buffer (1); the origin data type (MPI_INT); the rank of the process whose memory is to be accessed (called the target rank); the displacement into the window of the target buffer (0); the number of entries in the target buffer (1); the target data type (MPI_INT); and the window object for communication (win).

After the MPI_Get() call returns, MPI_Win_fence() is called again to complete the one-sided communication, thereby ending the access epoch. Now, we can be assured that neighbor is filled with the desired value, so the process rank and neighbor rank are printed. Finally, the memory window is freed by calling MPI_Win_free() with a pointer to the window object, and MPI is shutdown with a call to MPI_ Finalize().

When compiled and run with LAM using mpiexec, as shown in Figure One, you should see correct output. Each process prints out its rank, the total number of processes being used, the hostnames of the nodes, and the MPI version and subversion numbers. LAM claims to be fully MPI-1.2 compliant. (Although some MPI-2 features obviously work in LAM, it can’t claim to be MPI-2 compliant until all its features are implemented and tested.) Lastly, each process prints the results of its RMA: the process rank of its left neighbor.

Once you’re finished running MPI programs, your personal LAM daemons should be stopped using lamhalt.

Installing MPICH2

Just like LAM, MPICH2 is built by running ./configure –prefix=directory (specifying a directory for the installation), make, and make install. As before, the directory of executable files must be added to your path in .bashrc or some other file if bash isn’t your login shell. Be careful to have only the LAM or the MPICH2 paths active at once. Since both provide mpicc and mpiexec, it could be confusing which you are using.

The startup daemon under MPICH2 is written in Python 2 and is called mpd. Daemon processes must be spawned prior to running an MPICH2 program using mpdboot.

Under Red Hat Linux 7.3, mpd fails to start because of a syntax error. The problem is that mpd and other Python 2 scripts in the mpich2-0.96p2/bin/ directory refer to python as the interpreter, but Python 2 is called python2 under Red Hat 7.3.

To circumvent this problem, simply edit all of the scripts in this directory and change python to python2.

As shown in Figure Two, a file called .mpd.conf must be created containing secretword=XXXX, where XXXX is some secret word or phrase.




Figure Two: Installing and testing MPICH2 0.96p2

Unpack, build, and install the MPICH2 source code into your home directory.


[node01 src]$ tar xzf mpich2-beta.tar.gz
[node01 src]$ cd mpich2-0.96p2
[node01 mpich2-0.96p2]$ ./configure
–prefix=$HOME/mpich2-0.96p2
[node01 mpich2-0.96p2]$ make
[node01 mpich2-0.96p2]$ make install
[node01 mpich2-0.96p2]$ cd

Add the MPICH2 installation to your path by placing…


export PATH=$HOME/mpich2-0.96p2/bin:$PATH

to your .bashrc or other login shell startup file.

Set the path by loading .bashrc, create the secret word and hosts files, boot MPICH2 on all of the nodes, verify the nodes, run the sample code in Listing One, and terminate all of the daemons.


[node01 forrest]$ source .bashrc
[node01 forrest]$ echo “secretword=MyLittleSecret”
>
.mpd.conf
[node01 forrest]$ chmod 600 .mpd.conf
[node01 forrest]$ cat > mpd.hosts
node01
node02
node03
node04
node05
^D
[node01 forrest]$ mpdboot -r rsh -v
starting local mpd on node01.cluster.ornl.gov
starting remote mpd on node02
starting remote mpd on node03
starting remote mpd on node04
starting remote mpd on node05
1 out of 5 mpds started; waiting for more …
5 out of 5 mpds started
[sci1-1 forrest]$ mpdtrace
node01
node04
node02
node05
node03
[node01 forrest]$ mpicc -O -o mpi2_get.mpich2 mpi2_get.c
[node01 forrest]$ mpiexec -n 5 ./mpi2_get.mpich2
Hello world! I’m rank 2 of 5 on node02 running MPI 1.2
Hello world! I’m rank 0 of 5 on node01.cluster.ornl.gov
running MPI 1.2
Hello world! I’m rank 1 of 5 on node04 running MPI 1.2
Hello world! I’m rank 3 of 5 on node05 running MPI 1.2
Hello world! I’m rank 4 of 5 on node03 running MPI 1.2
Process 0 has neighbor 4
Process 4 has neighbor 3
Process 2 has neighbor 1
Process 1 has neighbor 0
Process 3 has neighbor 2
[node01 forrest]$ mpdallexit

In addition, a list of node hostnames are needed to specify where daemons should be started. This list is stored in mpd.hosts. Once that file is created, mpdboot can spawn the daemons on the hosts listed in mpd.hosts. By default, mpdboot starts daemons using ssh, but can use (the insecure) rsh if you specify -r rsh, as shown in Figure Two. The -v argument turns on verbose output. Running mpdtrace shows a list of nodes running mpd.

After recompiling the mpi2_get.c program using MPICH2′s mpicc, the code is run with mpiexec. The output of mpi2_get, as shown in Figure Two, is the same as was produced with LAM. Finally, the MPICH2 daemons are halted by running mpdallexit.

The Tip of the Iceberg

Now that you have LAM and/or MPICH2 installed, you can explore many of the new MPI-2 features and start using them in your own codes.

With both vendor-supported and free MPI-2 implementations quickly becoming available, it’s time to see if some of its features can be used to speed up or improve the scaling of your programs.

Look for discussions and demonstrations of more MPI-2 features in upcoming columns. Stay tuned!



Forrest Hoffman is a computer modeling and simulation researcher at Oak Ridge National Laboratory. He can be reached at forrest@climate.ornl.gov.

Fatal error: Call to undefined function aa_author_bios() in /opt/apache/dms/b2b/linux-mag.com/site/www/htdocs/wp-content/themes/linuxmag/single.php on line 62