Avoiding Common MPI Programming Errors

Convenience and ease-of-use were NOT design goals of MPI. Learn how to avoid some common pitfalls.

The Message-Passing Interface (MPI) has been around since 1994. It defines an API that allows programmers to transfer data between concurrent processes. MPI is very versatile and gives programmers fine control over all aspects of inter-process communication and synchronization. Therefore, MPI generally performs better than higher-level parallel programming methods. However, the programmer is responsible for each message and must coordinate all of the communication between processes. Many users equate MPI programming to assembly language programming. Few will claim that MPI is easy to use. Indeed, in a retrospective on MPI (“Are We Stuck with MPI Forever?” Cluster World, March 2005), Rusty Lusk and Bill Gropp, two of the original authors of the MPI standard, wrote that “convenience and ease of use were not high-priority goals of the MPI design…” In spite of this, MPI is the lingua franca for HPC because of its portability, generality, and perhaps most important, attainable performance.

This article shows some of the MPI programming errors that are frequently found in customer programs during application projects at Intel. It is assumed that the reader is already familiar with the MPI distributed-memory programming model. (Programmers who are new to MPI should take a look at Joe Landman’s excellent “MPI in Thirty Minutes” tutorial.) It is assumed that the reader knows how to send messages in MPI, understands the difference between blocking and non-blocking communication, and is familiar with MPI collective operations. This article focuses on common programming errors related to the MPI message passing model, API, and implementation-specific details.

Pitfalls in Blocking Point-to-Point Communication

Communication and synchronization are tightly coupled in MPI programs. Nothing illustrates this more clearly than blocking, point-to-point communication. For example, an MPI process sends a message to another process. In blocking communication, both the sender and the receiver wait for the message to be transferred before continuing. (Later we’ll see that there are exceptions but for now let’s assume that this is always true.) Because blocking communication is entirely synchronous, deadlock will occur unless each send has a matching receive. In Listing 1A, even numbered ranks send data to the odd ranks but a logic error causes deadlock. [Note: Complete source for the examples can be found here.] Specifically, the send and receive operations are not properly matched because message sources and destinations are not computed correctly. Similarly, the code in Listing 1B only works correctly for an even number of MPI processes. While the receivers and senders have the correct source and destination ranks, an odd number of processes means one more receiver than there are senders.

Listing 1A: Deadlock due to unmatched sends and receives

   if (rank % 2 == 0)
   {
      dest = (rank + 1 < n_proc) ? rank + 1 : MPI_PROC_NULL;
      MPI_Send (buf, N, MPI_INT, dest, 0, MPI_COMM_WORLD);
   }
   else
   {
      source = (rank + 1 < n_proc) ? rank + 1 : MPI_PROC_NULL;
      MPI_Recv (buf, N, MPI_INT, source, 0, MPI_COMM_WORLD, &stat);
   }

Listing 1B: Another example of unmatched sends and receives when the number of MPI processes is odd

   if (rank % 2 == 0)
   {
      source = rank + 1;
      MPI_Recv (buf, N, MPI_INT, source, 0, MPI_COMM_WORLD, &stat);
   }
   else
   {
      dest = rank - 1;
      MPI_Send (buf, N, MPI_INT, dest, 0, MPI_COMM_WORLD);
   }

Unmatched sends and receives is the first, and most obvious, pitfall of point- to-point communication. Nonetheless, the code in Listing 2A will deadlock even though the sends and receives are properly matched. This is because each MPI process performs a blocking receive before a corresponding send is posted. By the same logic, the program in Listing 2B should also deadlock because each MPI process performs a blocking send before a corresponding receive is posted. However, this is not always the case. The program does deadlock with an informative error message when run with only one process. When run with more than one MPI process it surprisingly works.

Listing 2A: MPI ranks passing data in a ring but performing blocking receives before the corresponding sends are posted, resulting in deadlock

   right = (rank + 1 == n_proc) ? 0 : rank + 1;
   left = (rank - 1 < 0) ? n_proc - 1 : rank - 1;

   MPI_Recv (buf, N, MPI_INT, left, 0, MPI_COMM_WORLD, &stat);
   MPI_Send (buf, N, MPI_INT, right, 0, MPI_COMM_WORLD);

Listing 2B: Performing a blocking send before a corresponding receive should cause deadlock but sometimes it does not. The error output is from the Intel MPI Library, which is based on MPICH2.

#include "mpi.h"
#include 

#define N 65536

int main (int argc, char *argv[])
{
   int rank, n_proc, right, left, buf[N];
   MPI_Status stat;

   MPI_Init (&argc, &argv);
   MPI_Comm_rank (MPI_COMM_WORLD, &rank);
   MPI_Comm_size (MPI_COMM_WORLD, &n_proc);

   right = (rank + 1 == n_proc) ? 0 : rank + 1;
   left = (rank - 1 < 0) ? n_proc - 1 : rank - 1;

   MPI_Send (buf, N, MPI_INT, right, 0, MPI_COMM_WORLD);
   MPI_Recv (buf, N, MPI_INT, left, 0, MPI_COMM_WORLD, &stat);

   printf ("Rank %d sent message to rank %d\n", rank, right);
   printf ("Rank %d received message from rank %d\n", rank, left);

   MPI_Finalize ();
}

$ mpicc listing-2b.c -o listing-2b.exe
$ mpirun -np 1 ./listing-2b.exe
[cli_0]: aborting job:
Fatal error in MPI_Send: Other MPI error, error stack:
MPI_Send(175): MPI_Send(buf=0x7ffffffbde50, count=65536, MPI_INT, dest=0, tag=0, MPI_COMM_WORLD) failed
MPID_Send(57): DEADLOCK: attempting to send a message to the local process without a prior matching receive
rank 0 in job 51  compute-00-00_41645   caused collective abort of all ranks
  exit status of rank 0: killed by signal 9 

$ mpiexec -n 4 ./listing-2b.exe
Rank 2 sent message to rank 3
Rank 2 received message from rank 1
Rank 0 sent message to rank 1
Rank 0 received message from rank 3
Rank 3 sent message to rank 0
Rank 3 received message from rank 2
Rank 1 sent message to rank 2
Rank 1 received message from rank 0

Remember the previous assertion that blocking communication is entirely synchronous and that neither the sender nor receiver can continue until message transfer is complete? This is not entirely true for most MPI implementations, which brings us to the topic of internal buffering. Notice the size of the messages being sent in Listing 2B. In most MPI implementations, MPI_Send actually behaves like a non-blocking send when the message size is below a certain threshold. However, the MPI specification is very clear that programs must not rely on this behavior:

"The reluctance of MPI to mandate whether standard sends are buffering or not stems from the desire to achieve portable programs. Since any system will run out of buffer resources as message sizes are increased, and some implementations may want to provide little buffering, MPI takes the position that correct (and therefore, portable) programs do not rely on system buffering in standard mode." [source]

Since this threshold can vary, incorrect programs may work fine with one MPI library but deadlock with another. Such problems often escape detection because debugging workloads tend to have smaller messages than production workloads or the MPI library used during development uses a larger internal buffer than an end-user's library. This error is also common in MPI programs that were translated from PVM because programmers mistakenly assume that because PVM_Send is a non-blocking function, MPI_Send is also non-blocking. Seeing MPI_Send behave like a non-blocking function confirms their assumption. Nevertheless, their program now contains a dormant deadlock.

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>