dcsimg

Programming with MPI Communicators and Groups, Part 2

Last month's issue of Linux Magazine was dedicated to cluster computing, allowing leaders in the field to present a wide range of topics about Beowulf-style clusters. Last month's issue also introduced model coupling with some example code. This month we return our attention to advanced Message Passing Interface (MPI) features by continuing the discussion of MPI groups and communicators begun in May.

Last month’s issue of Linux Magazine was dedicated to cluster computing, allowing leaders in the field to present a wide range of topics about Beowulf-style clusters. Last month’s issue also introduced model coupling with some example code. This month we return our attention to advanced Message Passing Interface (MPI) features by continuing the discussion of MPI groups and communicators begun in May.

Remember that MPI groups and communicators provide a mechanism for parallel code modularity and offer communication contexts for groups of processes. The default communicator, MPI_COMM_WORLD, refers to all processes involved in the MPI computation and defines the default global context for message passing. Additional communicators — which provide a handle for a communications “channel” like on a CB radio — can be created and destroyed to suite the needs of a parallel code or a coupled model.

Communicators are significant because they provide a group scope for collective operations, such as broadcasts, gathers/scatters, and reductions. If these operations are relevant only on a subset of processes, communicators provide a communications channel for one or more process groups. Intracommunicators, which provide for point-to-point and collective communications within a group of processes, were introduced in Part 1. That discussion continues this month along with an introduction to intercommunicators, which provide point-to-point communication between two groups of processes.

As in previous columns, MPI routines and descriptions are based on the MPI-1.1 Standard. MPI-2 provides additional features, and some MPI-2 implementations are beginning to appear on clusters and supercomputers. The MPICH implementation (available online at http://www.unix.mcs.anl.gov/ mpi/mpich/index. html) is used for the examples below. Error checking is absent in the examples due to space limitations.

Partitioning Tasks Among Processes

In Part 1, we learned that multiple contexts (by having multiple communicators) provide a safe way to overlap communications and to overlap message tags. This sort of modularity is required when combining models or when implementing a parallel library, so that message passing within the library routines does not interfere with that performed in the user’s code. This feature of communicators was demonstrated with a library call that performed a simple matrix transpose.

We’ve already seen two communicator constructors: MPI_Comm_dup(), which duplicates an existing communicator, and MPI_Comm_create(), which creates a new communicator from a process group definition. The third constructor, MPI_Comm_split(), partitions a group of processes associated with an existing communicator into non-overlapping subgroups.

While some parallel applications are designed to create new processes dynamically as needed to perform discrete parallel tasks, standard MPI codes typically initiate all processes at the beginning of a run. Then, at some point within the code, those processes are partitioned into subgroups that perform specific tasks or run certain submodels. The MPI_Comm_split() routine provides a powerful mechanism for splitting up processes into subgroups based on a requested subgroup number or “color.”

The function (see its syntax in Table One) partitions the group associated with comm into disjoint subgroups, one for each value of color. Each subgroup contains all of the processes of the same color, and the processes are ranked in the order defined by the value of the key argument. Ties among processes with the same color and key arguments are broken according to their rank in the old group. The function automatically creates a new communicator for each subgroup.




Table One: MPI routines for manipulating communicators

ACCESSORS
int MPI_Comm_size(MPI_Comm comm, int *size)
int MPI_Comm_rank(MPI_Comm comm, int *rank)
int MPI_Comm_compare(MPI_Comm comm1, MPI_Comm comm2, int *result)


CONSTRUCTORS
int MPI_Comm_dup(MPI_Comm comm, MPI_Comm *newcomm)
int MPI_Comm_create(MPI_Comm comm, MPI_Group group,
MPI_Comm *newcomm)
int MPI_Comm_split(MPI_Comm comm, int color, int key,
MPI_Comm *newcomm)


DESTRUCTORS
int MPI_Comm_free(MPI_Comm *comm)

Listing One contains a C program, split.c, that demonstrates the use of the MPI_Comm_split() routine. As usual, the MPI header file (mpi.h) is included at the top. Inside main(), MPI is initialized with MPI_Init(), and the rank, size, and processor name are obtained from calls to MPI_Comm_rank(), MPI_Comm_size(), and MPI_Get_processor_name(), respectively. Like previous examples, this information is printed to the display beginning with “Hello world!” Notice that this time, rank and size are 2-element arrays; that’s because another value for process rank and communicator size will be obtained after the processes are partitioned.




Listing One: split.c demonstrates MPI_Comm_split()


#include <stdio.h>
#include “mpi.h”

int main(int argc, char **argv)
{
int rank[2], size[2], namelen, color;
char processor_name[MPI_MAX_PROCESSOR_NAME];
char *cname[] = { “BLACK”, “WHITE”, “BLUE” };
MPI_Comm comm_work;

MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank[0]);
MPI_Comm_size(MPI_COMM_WORLD, &size[0]);
MPI_Get_processor_name(processor_name, &namelen);
printf(“Hello world! I’m rank %d of %d on %s\n”,
rank[0], size[0],
processor_name);

color = rank[0]%3;
MPI_Comm_split(MPI_COMM_WORLD, color, rank[0],
&comm_work);
MPI_Comm_rank(comm_work, &rank[1]);
MPI_Comm_size(comm_work, &size[1]);
printf(“%d: I’m rank %d of %d in the %s context\n”, rank[0], rank[1],
size[1], cname[color]);

MPI_Comm_free(&comm_work);
MPI_Finalize();
}

Next, a color value (which must be a non-negative integer) is set to the rank of the process in the global context modulo 3. As a result, three different color values are established and three groups are defined as long as at least three processes are available. The cname[] array contains the names for these three groups: BLACK, WHITE, and BLUE. Now that each process has a color, the three new communicators can be easily and simultaneously created by calling MPI_Comm_split().








extreme_01
Figure One: The eight MPI processes are split into three groups: Black, White, and Blue. Each has a comm_work intracommunicator, defining its local context.

The first argument, MPI_COMM_WORLD, is the originating communicator. The color argument is 1, 2, or 3. The key value is rank[0], the rank of the process in the originating communicator. By using the original rank as the key, the rank order is preserved in the new context. Finally, the fourth argument, &comm_work, is a pointer to the new communicator on each process. This call actually generates three unique new communicators, one for each new context as shown in Figure One.

The new rank and size are obtained by calls to MPI_Comm_ rank() and MPI_Comm_size(), using the new communicator comm_work. Each process then prints out its new rank, size, and color/context name. At this point in a real model, the processes in each context would work collectively on some computation, and afterward may return to the global context to summarize results. At the end of the code, the new communicator is destroyed using MPI_Comm_free(), and MPI is shut down with MPI_Finalize().

The results of compiling and running split.c on eight processes is shown in Figure Two. As usual, the output is in no particular order. Notice that process zero becomes rank 0 in the BLACK context, one becomes rank 0 in the WHITE context, two becomes rank 0 in the BLUE context, and so on, as shown in Figure One.




Figure Two: Running split.c


[forrest@node01 comm]$ mpicc -O -o split split.c
[forrest@node01 comm]$ mpirun -np 8 split
Hello world! I’m rank 0 of 8 on node01
0: I’m rank 0 of 3 in the BLACK context
Hello world! I’m rank 6 of 8 on node07
6: I’m rank 2 of 3 in the BLACK context
Hello world! I’m rank 3 of 8 on node04
3: I’m rank 1 of 3 in the BLACK context
Hello world! I’m rank 4 of 8 on node05
4: I’m rank 1 of 3 in the WHITE context
Hello world! I’m rank 7 of 8 on node08
7: I’m rank 2 of 3 in the WHITE context
Hello world! I’m rank 2 of 8 on node03
2: I’m rank 0 of 2 in the BLUE context
Hello world! I’m rank 5 of 8 on node06
5: I’m rank 1 of 2 in the BLUE context
Hello world! I’m rank 1 of 8 on node02
1: I’m rank 0 of 3 in the WHITE context

For completeness, a table of all the MPI routines used to manipulate communicators (Table One) and a table of all the routines for manipulating groups (Table Two) are included here. All three communicator constructors have been demonstrated in examples. Only a couple of the group constructors were included in the examples, but it turns out that a wide variety of group constructors is available for defining process groups.




Table Two: MPI routines for manipulating groups


ACCESSORS
int MPI_Group_size(MPI_Group group, int *size)
int MPI_Group_rank(MPI_Group group, int *rank)
int MPI_Group_translate_ranks (MPI_Group group1,
int n, int *ranks1,
MPI_Group group2, int *ranks2)
int MPI_Group_compare(MPI_Group group1,MPI_Group
group2, int *result)

CONSTRUCTORS
int MPI_Comm_group(MPI_Comm comm, MPI_Group *group)
int MPI_Group_union(MPI_Group group1, MPI_Group
group2, MPI_Group
*newgroup)
int MPI_Group_intersection(MPI_Group group1,
MPI_Group group2,
MPI_Group *newgroup)
int MPI_Group_difference(MPI_Group group1,
MPI_Group group2,
MPI_Group *newgroup)
int MPI_Group_incl(MPI_Group group, int n,
int *ranks, MPI_Group
*newgroup)
int MPI_Group_excl(MPI_Group group, int n,
int *ranks, MPI_Group
*newgroup)
int MPI_Group_range_incl(MPI_Group group, int n,
int ranges[][3],
MPI_Group *newgroup)
int MPI_Group_range_excl(MPI_Group group, int n,
int ranges[][3],
MPI_Group *newgroup)

DESTRUCTORS
int MPI_Group_free(MPI_Group *group)

Using Intercommunicators to Pass Messages Between Groups

Up to now, all the communicators discussed have been intracommunicators. These are used to communicate within a group of processes. Another type of communicator, called an intercommunicator, makes it possible to communicate between two groups of processes. Intercommunicators can be used for all point-to-point communications (MPI_Send(), MPI_Recv(), etc.), but may not be used to perform collective operations. This limitation goes away in MPI-2, but for those using MPI-1, these collective operations can be implemented by the programmer using combinations of MPI-1 routines.

An intercommunicator is created by a collective call, MPI_Intercomm_create(), executed in the two groups to be connected. Arguments to this call (see Table Three) are the local intracommunicator, the local process leader (i.e., the rank of a process in the local context), a peer or parent communicator, the rank of the remote group’s leader in the peer/parent communicator context, a “safe” message tag to be used for communications between the process leaders, and a pointer to the new intercommunicator.




Table Three: MPI routines for manipulating intercommunicators


ACCESSORS
int MPI_Comm_test_inter(MPI_Comm comm, int *flag)
int MPI_Comm_remote_size(MPI_Comm comm, int *size)
int MPI_Comm_remote_group(MPI_Comm comm, MPI_Group
*group)

CONSTRUCTORS
int MPI_Intercomm_create(MPI_Comm local_comm,
int local_leader,
MPI_Comm peer_comm, int remote_leader, int tag,
MPI_Comm *newintercomm)

OPERATIONS
int MPI_Intercomm_merge(MPI_Comm intercomm, int high,
MPI_Comm
*newintracomm)

DESTRUCTORS
int MPI_Comm_free(MPI_Comm *comm)

Both groups must select the same peer communicator, and the peer communicator must contain all the members of the two groups for which the intercommunicator is being created. Although the rank of the process leader in each group does not matter, all participants in the operation must nominate the same process. Intercommunicators can be destroyed using the same call used to destroy intracommunicators: MPI_Comm_free().

Listing Two contains a modified version of the split.c code called split-inter.c, that performs a point-to-point communication within the local group (after the processes are partitioned with MPI_Comm_split()) then links the BLACK and WHITE contexts with an intercommunicator. This intercommunicator is then used to pass a value from each of the BLACK group members to the WHITE group members with the corresponding local ranks. These operations are illustrated in Figure Three.




Listing Two: split-inter.c


#include <stdio.h>
#include “mpi.h”

int main(int argc, char **argv)
{
int rank[2], size[2], namelen, color;
char processor_name[MPI_MAX_PROCESSOR_NAME];
char *cname[] = { “BLACK”, “WHITE”, “BLUE” };
int i, buf, val;
MPI_Comm comm_work, intercomm;
MPI_Status status;

MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank[0]);
MPI_Comm_size(MPI_COMM_WORLD, &size[0]);
MPI_Get_processor_name(processor_name, &namelen);
printf(“Hello world! I’m rank %d of %d on %s\n”, rank[0],
size[0],
processor_name);

color = rank[0]%3;
MPI_Comm_split(MPI_COMM_WORLD, color, rank[0], &comm_work);
MPI_Comm_rank(comm_work, &rank[1]);
MPI_Comm_size(comm_work, &size[1]);
printf(“%d: I’m rank %d of %d in the %s context\n”,
rank[0], rank[1],
size[1], cname[color]);

val = rank[1];
if (rank[1]) {
/* Have every local worker send its value to its local
leader */
MPI_Send(&val, 1, MPI_INT, 0, 0, comm_work);
}
else {
/* Every local leader receives values from its workers */
for (i = 1; i < size[1]; i++) {
MPI_Recv(&buf, 1, MPI_INT, i, 0, comm_work, &status);
val += buf;
}
printf(“%d: Local %s leader sum = %d\n”, rank[0],
cname[color],
val);
}

/* Establish an intercommunicator for message passing
between the BLACK
and WHITE groups */
if (color < 2) {
if (color == 0) {
/* BLACK Group: create intercommunicator and send to
* corresponding member in WHITE group */
MPI_Intercomm_create(comm_work, 0, MPI_COMM_WORLD, 1,
99, &intercomm);
MPI_Send(&val, 1, MPI_INT, rank[1], 0, intercomm);
printf(“%d: %s member; sent value = %d\n”, rank[0],
cname[color], val);
}
else {
/* WHITE Group: create intercommunicator and receive
* from corresponding member in BLACK group */
MPI_Intercomm_create(comm_work, 0, MPI_COMM_WORLD, 0,
99, &intercomm);
MPI_Recv(&buf, 1, MPI_INT, rank[1], 0, intercomm,
&status);
printf(“%d: %s member; received value = %d\n”, rank[0],
cname[color], buf);
}
MPI_Comm_free(&intercomm);
}

MPI_Comm_free(&comm_work);
MPI_Finalize();
}

The first part of the code is the same as in split.c, but after the new rank, size, and color are printed, each process sets val to the value of its (local) rank in the new context. If that rank is not zero, the process sends this value to the process of rank 0 in the local context, using MPI_Send(). Correspondingly, the processes of rank 0 in the local context loop over each process in the local group, receiving the sent values using MPI_Recv(). The rank zero processes accumulate these values into val, and then print them.

Next, an intercommunicator is established between the BLACK context and the WHITE context. Only processes with a value of color less than 2 execute this block of code, because processes in the BLUE context are not involved in establishing this intercommunicator. The BLACK and WHITE groups establish the intercommunicator by calling MPI_Intercomm_ create() with consistent arguments. The first argument is the local communicator, called comm_work in both contexts. The second argument is the local process leader; all processes nominate the local process of rank 0 as leader.








extreme_02
Figure Three: A message is passed from each member of the BLACK group to its corresponding member in the WHITE group by way of an intercommunicator.

The peer intracommunicator in the third argument is the one to which all affected processes belong. Here MPI_COMM_ WORLD is used. The fourth argument is the rank of the remote leader in the context of the peer intracommunicator. For the BLACK group, the remote leader is process zero in the WHITE context, which is process one in MPI_COMM_WORLD. For the WHITE group, the remote leader is process zero in the BLACK context, which is process zero in MPI_COMM_WORLD.

A tag value of 99 is provided as the fifth argument. This value of tag must be the same in both calls to MPI_ Intercomm_create(). It should represent a “safe” tag for communications between the two process leaders in the MPI_COMM_WORLD context; therefore, this tag should not be used anywhere else in the code. Finally, a pointer to the new intercommunicator (&intercomm) is supplied as the sixth argument.

Once the intercommunicator has been established, the processes in the BLACK group send the contents of val to the processes with the corresponding rank in the WHITE context. This is accomplished by calling MPI_Send() using the local rank (rank[1]) as the destination and the new intercommunicator (intercomm) as the communicator. Likewise, members of the WHITE group call MPI_Recv(), using the local rank as the source along with the new intercommunicator just created.

After the point-to-point communication completes, processes in both groups report the values sent or received. Then the intercommunicator is destroyed using MPI_Comm_free(), the intracommunicator comm_work is freed, and MPI is finalized using MPI_Finalize().

Figure Four shows the results of compiling and running split-inter on 8 processes. As with split.c, each process prints a “Hello, world!” message followed by the rank and size of the global context along with the processor hostname. When the new intracommunicator is created, each process prints the rank, size, and color of the new context.




Figure Four: The output of split-inter.c


[forrest@node01 comm]$ mpicc -O -o split-inter split-inter.c
[forrest@node01 comm]$ mpirun -np 8 split-inter
Hello world! I’m rank 0 of 8 on node01
0: I’m rank 0 of 3 in the BLACK context
0: Local BLACK leader sum = 3
0: BLACK member; sent value = 3
Hello world! I’m rank 6 of 8 on node07
6: I’m rank 2 of 3 in the BLACK context
6: BLACK member; sent value = 2
Hello world! I’m rank 2 of 8 on node03
2: I’m rank 0 of 2 in the BLUE context
2: Local BLUE leader sum = 1
Hello world! I’m rank 1 of 8 on node02
1: I’m rank 0 of 3 in the WHITE context
1: Local WHITE leader sum = 3
1: WHITE member; received value = 3
Hello world! I’m rank 4 of 8 on node05
4: I’m rank 1 of 3 in the WHITE context
4: WHITE member; received value = 1
Hello world! I’m rank 7 of 8 on node08
7: I’m rank 2 of 3 in the WHITE context
7: WHITE member; received value = 2
Hello world! I’m rank 5 of 8 on node06
5: I’m rank 1 of 2 in the BLUE context
Hello world! I’m rank 3 of 8 on node04
3: I’m rank 1 of 3 in the BLACK context
3: BLACK member; sent value = 1

Next, local process leaders (those processes with a rank of zero in the new context) print out the sum of values received from the other process group members. The local process leaders have ranks of zero (BLACK leader), one (WHITE leader), and two (BLUE leader) in the global context. The sums for BLACK and WHITE are 3; the sum for BLUE is 1.

Finally, the BLACK members all report the values they sent over the intercommunicator; the WHITE members all report the values they received. It should be clear that BLACK 0 sent to WHITE 0, BLACK 1 sent to WHITE 1, and BLACK 2 sent to WHITE 2, which maps to 0 sending 1, 3 sending to 4, and 6 sending 7 in the global (MPI_COMM_WORLD) context.

A Lot of Power

As you can see, MPI offers a powerful set of tools for manipulating process groups and communicators. These constructs allow for good code modularity, portable parallel library creation, and easy coupling of multi-task models. In addition, intercommunicators can be established as needed to provide communications between process groups.

This article rounds out a set of columns on advanced features of MPI. When MPI is used for message passing, some pretty neat things can be done on parallel computers.



Forrest Hoffman (forrest@climate.ornl.gov) is a computer modeling and simulation researcher at Oak Ridge National Laboratory. Download the source code used in this article from http://www.linux-mag.com/downloads/2003-07/extreme.

Fatal error: Call to undefined function aa_author_bios() in /opt/apache/dms/b2b/linux-mag.com/site/www/htdocs/wp-content/themes/linuxmag/single.php on line 62