Using MPI-2

Last month's "Extreme Linux" introduced MPI-2, the latest Message Passing Interface (MPI) standard. MPI has become the preferred programming interface for data exchange -- called message passing -- for parallel, scientific programs. MPI has evolved since the MPI-1.0 standard was released in May 1994. The MPI-1.1 standard, produced in 1995, was a significant advance, and the MPI-2 standard clarifies and corrects the MPI-1.1 standard while preserving forward compatibility with MPI-1.1. A valid MPI-1.1 program is a valid MPI-2 program.

Last month’s “Extreme Linux” introduced MPI-2, the latest Message Passing Interface (MPI) standard. MPI has become the preferred programming interface for data exchange — called message passing — for parallel, scientific programs. MPI has evolved since the MPI-1.0 standard was released in May 1994. The MPI-1.1 standard, produced in 1995, was a significant advance, and the MPI-2 standard clarifies and corrects the MPI-1.1 standard while preserving forward compatibility with MPI-1.1. A valid MPI-1.1 program is a valid MPI-2 program.

Both MPICH (http://www-unix.mcs.anl.gov/mpi/mpich2) and LAM/MPI (http://www.lam-mpi.org), the two most popular MPI implementations for Linux clusters, comply fully with the MPI-1.2 standard, which is described in the MPI-2 standard document. Additionally, LAM/MPI and a new version of MPICH called MPICH2, available in beta, already support subsets of the features new to MPI-2. Installation and testing instructions for both of these implementations and a short MPI-2 program demonstrating one-sided communication were included in last month’s column. This month, let’s discuss some of the new MPI-2 features and test them with some sample code.

The most important MPI-2 features include process creation and management; one-sided communications; collective operations on intercommunicators; external interfaces to error handlers, data types, and requests; parallel input/output (I/O) (MPI-IO); bindings for C, C++, Fortran 77, and Fortran 90; and a variety of miscellaneous new routines and data types. Both LAM/MPI and MPICH2 provide MPI I/O support using a package called ROMIO, offer the new portable mpiexec command line startup mechanism, provide dynamic process management, offer C++ bindings, and implement basic one-sided communication.

Process Management

MPI-1 applications are static, because the number of processes is specified at startup and cannot change as the application runs. Due to user demand, the MPI-2 specification allows for process creation and management after an MPI application has started. Parallel Virtual Machine (PVM) has offered such capabilities for years, and experiences with PVM (both good and bad) helped provide a context for the MPI-2 process model development.

In many ways, process management could naturally intrude upon job scheduling and resource allocation, which are functions of the operating system and job scheduling packages. Moreover, a specification for process creation and management must be general enough to apply across the wide range of parallel architectures, from custom supercomputers to networks of workstations. To avoid potential pitfalls, the MPI-2 process model doesn’t alter the parallel environment in which it executes and doesn’t change the deterministic definition of a communicator.

The MPI-2 process model provides a mechanism for creation of processes after an MPI application has started, as well as for the cooperative termination of processes at the end of a program. It establishes communication between the new processes and the existing application, and it provides a mechanism to establish communication between two existing MPI applications, even if one did not spawn the other.

New processes, called child processes, can be started in MPI-2 using calls to MPI_Comm_spawn() and MPI_Comm_spawn_ multiple(). An info argument to these calls may be used to tell the runtime environment where or how to spawn the child processes. The MPI_UNIVERSE_SIZE attribute on the MPI_COMM_WORLD communicator describes the size of the initial runtime environment (it usually contains the number of processors available for the job). By subtracting the size of MPI_COMM_WORLD from MPI_UNIVERSE_SIZE, one can figure out how many processes can be spawned for effective use of the allocated processors.

An example using the process management interface is included below, but first a word of caution is in order. While a single MPI process could routinely be used to spawn all the additional processes for an otherwise static MPI application, this practice is discouraged since it hinders the performance of application startup.

It’s best to start all processes at once in regular MPI-1 fashion, if possible. On the other hand, there may be good reasons why some applications need to be able to spawn various processes depending on the run-time environment or other factors. For example, a parallel library may need to spawn processes to solve a particular class of mathematical problem.

Spawning Child Processes

Listing One is a program that acts as a manager in a master-worker (or master/slave) parallel program. This master MPI-2 program spawns static worker processes, but uses different executables for the workers depending on the number of processes available in the run-time environment. That number is usually the number of processors requested when the job was submitted to the batch scheduling system, but on a network of workstations this value may be determinable.




Listing One: mpi2_manager.c

/* MPI2 Manager Code */
#include <stdio.h>
#include “mpi.h”

void select_worker_program(char *worker_program,
int num_procs)
{
if (num_procs < 128)
strcpy(worker_program, “./mpi2_worker”);
else if (num_procs < 256)
strcpy(worker_program, “./mpi2_worker_mid”);
else
strcpy(worker_program, “./mpi2_worker_big”);
return;
}

int main(int argc, char** argv)
{
int rank, size, namelen, version, subversion, iflag, fflag,
universe_size, *universe_sizep, attr_flag;
MPI_Comm family_comm;
char processor_name[MPI_MAX_PROCESSOR_NAME],
worker_program[100];

MPI_Initialized(&iflag); MPI_Finalized(&fflag);
printf(“MPI Initialized=%d, Finalized=%d\n”, iflag, fflag);

MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Get_processor_name(processor_name, &namelen);
MPI_Get_version(&version, &subversion);
printf(“I’m manager %d of %d on %s running MPI %d.%d\n”,
rank, size, processor_name, version, subversion);

universe_size = 1;
if (size != 1) {
if (!rank)
printf(“Error: Only one manager process should be \
running, but %d were started.\n”, size);
}
else {
MPI_Attr_get(MPI_COMM_WORLD, MPI_UNIVERSE_SIZE,
&universe_sizep, &attr_flag);
if (!attr_flag) {
printf(“Warning: This MPI implementation does not \
support the MPI_UNIVERSE_SIZE attribute.\n”);
printf(“Setting universe_size to 5.\n”);
universe_size = 5;
}
else
universe_size = *universe_sizep;
}
if (universe_size < 2)
printf(“Worker processes *not* spawned.\n”);
else {
select_worker_program(worker_program, universe_size-1);
printf(“Spawning %d worker processes running %s\n”,
universe_size-1, worker_program);
MPI_Comm_spawn(worker_program, MPI_ARGV_NULL,
universe_size-1,
MPI_INFO_NULL, 0, MPI_COMM_SELF, &family_comm,
MPI_ERRCODES_IGNORE);
}

MPI_Initialized(&iflag); MPI_Finalized(&fflag);
printf(“%d: MPI Initialized=%d, Finalized=%d\n”,
rank, iflag, fflag);

MPI_Finalize();

MPI_Initialized(&iflag); MPI_Finalized(&fflag);
printf(“%d: MPI Initialized=%d, Finalized=%d\n”,
rank, iflag, fflag);
return 0;
}

The manager code demonstrates two functions that describe the state of MPI. MPI_Initialized(), included in the MPI-1 specification, returns a value of true if MPI_Init() has already been called. This allows a parallel library to determine if MPI has already been initialized. Unfortunately, it’s possible that MPI was finalized before the library was called. So, in MPI-2, a new routine, MPI_Finalized(), was added. It returns true if MPI_Finalize() has already been called. The manager code calls both of these routines and prints their results prior to initialization, after finalization, and between initialization and finalization.

As usual, the program initializes MPI by calling MPI_Init(). Next, the rank of the process, the size of the process group, and the processor name are obtained by calling MPI_ Comm_rank(), MPI_Comm_size(), and MPI_Get_processor _name(), respectively. At this point, the program queries the MPI version in use by calling MPI_Get_version() and then prints out all of the information.

To ensure that only a single manager process exists, the code checks the value of size (obtained from MPI_Comm _size()). If it’s not equal to one, an error is printed by the first process (which has rank zero). Otherwise, MPI_Attr_get() is called to get the value for the MPI_UNIVERSE_SIZE attribute associated with the MPI_COMM_WORLD communicator.

The MPI_Attr_get() routine is passed the relevant communicator (MPI_COMM_WORLD), the name of the attribute to query (MPI_UNIVERSE_SIZE), a pointer to an integer where the result should be stored (&universe_sizep), and a pointer to an integer where the resulting flag value should be stored (&attr_flag). If the attribute is not supported by the MPI implementation, MPI_Attr_get() will set the flag value passed as its last argument to zero, and our code will then use five as a default value for universe_size.

In the next block of code, the universe_size value is checked. If it’s less than two (so that no child processes can be started), then an error message is printed. Otherwise, select_worker _program() is called to determine the name of the executable code to spawn based on the number of processes to be started. Then MPI_Comm_spawn() is called to actually startup worker processes.

MPI_Comm_spawn() is passed the name of the executable program to start (contained in worker_program), the list of command-line arguments (in this case MPI_ARGV_NULL, since there aren’t any), the number of processes to spawn (universe_size-1), an info object containing additional parameters (in this case MPI_INFO_NULL, since there aren’t any), the rank of the process that will evaluate these arguments (0), the intracommunicator spawning the processes (MPI_ COMM_SELF), a pointer to a communicator object in which the new intercommunicator will be stored (&family_comm), and an array of error codes (in this case MPI_ERRCODES_ IGNORE, since we want to ignore all errors).

When MPI_Comm_spawn() is called, it starts the worker processes and establishes an intercommunicator through which the manager process and the new worker processes may communicate. Once the child processes are spawned, MPI is finalized and the program exits.

The corresponding worker program code is shown in Listing Two. This program initializes MPI, obtains the process rank, the size of its MPI_COMM_WORLD communicator, the processor name, and the MPI version and sub-version numbers respectively. It reports its information, and then calls MPI_Comm_get_parent() to obtain the intercommunicator for message exchange with the spawning process, which it stores in parent. If a NULL communicator is returned, the program knows that no parent MPI process spawned it, so it prints an error and exits.




Listing Two: mpi2_worker.c


/* MPI2 Worker Code */
#include <stdio.h>
#include “mpi.h”

int main(int argc, char** argv)
{
int rank, size, namelen, version, subversion, psize;
MPI_Comm parent;
char processor_name[MPI_MAX_PROCESSOR_NAME];

MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Get_processor_name(processor_name, &namelen);
MPI_Get_version(&version, &subversion);
printf(“I’m worker %d of %d on %s running MPI %d.%d\n”,
rank, size, processor_name, version, subversion);

MPI_Comm_get_parent(&parent);
if (parent == MPI_COMM_NULL) {
printf(“Error: no parent process found!\n”);
exit(1);
}
MPI_Comm_remote_size(parent, &psize);
if (psize != 1) {
printf(“Error: number of parents (%d) should be
1.\n”, psize);
exit(2);
}

/* Lots of interesting parallel computations go
here… */
printf(“Worker %d: Success!\n”, rank);

MPI_Finalize();
return 0;
}

Next, the program calls MPI_Comm_remote_size() to discover the number of MPI processes participating in the execution of the parent program. If this number is more than one, the program exits with an error. At this point, the worker code should actually do some parallel computations, but for this example it simply prints a Success! message, finalizes MPI, and exits.

Output One shows the results of compiling and running these programs using LAM/MPI 7.0.6. The mpi2_manager.c and mpi2_worker.c programs are separately compiled using mpicc, and lamboot is used to start up five LAM daemons on the machines listed in the file hostfile. Next, the mpi2_manager program is run as a single process using mpiexec -n 1.




Output One: Running the MPI-2 sample code


[node01 MPI2]$ mpicc -O -o mpi2_manager mpi2_manager.c
[node01 MPI2]$ mpicc -O -o mpi2_worker mpi2_worker.c
[node01 MPI2]$ lamboot -v hostfile

LAM 7.0.6/MPI 2 C++/ROMIO – Indiana University

n-1<32744> ssi:boot:base:linear: booting n0 (node01)
n-1<32744> ssi:boot:base:linear: booting n1 (node02)
n-1<32744> ssi:boot:base:linear: booting n2 (node04)
n-1<32744> ssi:boot:base:linear: booting n3 (node03)
n-1<32744> ssi:boot:base:linear: booting n4 (node05)
n-1<32744> ssi:boot:base:linear: finished

[node01 MPI2]$ mpiexec -n 1 ./mpi2_manager

MPI Initialized=0, Finalized=0
I’m manager 0 of 1 on node01.cluster running MPI 1.2
Spawning 4 worker processes running ./mpi2_worker
0: MPI Initialized=1, Finalized=0
I’m worker 3 of 4 on node03 running MPI 1.2
Worker 3: Success!
I’m worker 2 of 4 on node04 running MPI 1.2
Worker 2: Success!
I’m worker 1 of 4 on node02 running MPI 1.2
Worker 1: Success!
0: MPI Initialized=1, Finalized=1
I’m worker 0 of 4 on node01.cluster running MPI 1.2
Worker 0: Success!

[node01 MPI2]$ lamhalt -v

LAM 7.0.6/MPI 2 C++/ROMIO – Indiana University

Shutting down LAM
hreq: waiting for HALT ACKs from remote LAM daemons
hreq: received HALT_ACK from n1 (node02.cluster)
hreq: received HALT_ACK from n3 (node03.cluster)
hreq: received HALT_ACK from n2 (node04.cluster)
hreq: received HALT_ACK from n4 (node05.cluster)
hreq: received HALT_ACK from n0 (node01.cluster)
LAM halted

Prior to calling MPI_Init(), both MPI_Initialized() and MPI_ Finalized() report false (0). The manager program prints out its rank, processor name, and MPI version number before spawning four worker processes. At this point, the MPI_Initialized() call returns true (1) and MPI_Finalized() reports false (0). Next, output from the four worker child processes can be seen.

Finally, the manager program calls MPI_Finalize(), then reports the values reported from MPI_Initialized() and MPI_Finalized(), both of which are true (1). Once this test is complete, the LAM daemons are halted using lamhalt.

While this worked as expected using LAM/MPI, this behavior cannot necessarily be expected from other MPI implementations. Since users typically must start their own LAM daemons to essentially create a virtual machine, it’s reasonable that this MPI implementation would correctly define the MPI_UNIVERSE _SIZE attribute based upon the number of available daemons.

The MPI_UNIVERSE_SIZE attribute is not defined under MPICH2 version 0.96p2, and this manager/worker code doesn’t run as expected. This is probably because single MPICH2 daemons are usually run by the root user on each node. User MPICH2 codes typically use these always-running daemons, and the notion of an individual user’s virtual machine doesn’t exist. So it’s reasonable for MPICH2 not to support the MPI_UNIVERSE_SIZE attribute.

More to Come!

The process creation and management features of MPI-2 are useful features for some parallel applications, but remember that if you know the run-time configuration ahead of time, better performance is obtained by starting up MPI processes using the MPI-1 scheme without resorting to calls to MPI_Comm_spawn().

More MPI-2 features will be discussed in future columns, so stay tuned!



Forrest Hoffman is a computer modeling and simulation researcher at Oak Ridge National Laboratory. He can be reached at forrest@climate.ornl.gov.

Fatal error: Call to undefined function aa_author_bios() in /opt/apache/dms/b2b/linux-mag.com/site/www/htdocs/wp-content/themes/linuxmag/single.php on line 62