Avoiding Common MPI Programming Errors

Convenience and ease-of-use were NOT design goals of MPI. Learn how to avoid some common pitfalls.

Pitfalls of Non-Blocking Point-to-Point Communication

Non-blocking point-to-point communication allows a program to initiate a message (send or receive) then proceed without waiting for the message to complete. This enables the overlap of communication with computation in order to hide communication latency. In other words, the program can do useful work while messages are in transit. (At least this is the theory. Most MPI implementations, even modern ones, are weak in this regard. The state-of-the- art hasn’t changed much since White and Bova published “Where’s the Overlap? An Analysis of Popular MPI Implementations” almost ten years ago.) The only restriction is that the program must not access the send or receive buffers while the communication is in progress. Listing 3A illustrates an obvious data race on a send-buffer. The program must verify that the point-to-point communication is finished before it is safe to reuse the message buffers. Strictly speaking, overwriting the send-buffer with the same data or even reading the send-buffer while the message is in transit violates the MPI standard so the code in Listing 3B and 3C is also incorrect.

It’s easy to understand why modifying the send-buffer during message transfer is unsafe but it’s not immediately apparent why reading the send-buffer is unsafe. A real-world example helps to explain. Let’s say an MPI program is running on a heterogeneous cluster containing big-endian and little-endian systems. An MPI library that supports this environment may convert the send- buffer from big-endian to little-endian and vice versa. This would make the value of the variable “var” indeterminate in Listing 3C. The MPI-2.2 standard may relax this restriction, but for now, it is unsafe to even read the send- buffer while the message is in transit.

Listing 3A: This code contains a data race because the send-buffer is modified while a message is still in transit.

   buf = 1;
   var = 2;
   MPI_Isend (&buf, 1, MPI_INT, 1, 0, MPI_COMM_WORLD, &req);
   buf = var;
   MPI_Wait (&req, &stat);

Listing 3B: Overwriting the send-buffer, even with the same data, is not allowed while a message is in transit.

   buf = var = 1;
   MPI_Isend (&buf, 1, MPI_INT, 1, 0, MPI_COMM_WORLD, &req);
   buf = 1;
   MPI_Wait (&req, &stat);

Listing 3C: Even reading the send-buffer while a message is in transit violates the MPI standard. The MPI-2.2 standard may relax this restriction.

   MPI_Isend (&buf, 1, MPI_INT, 1, 0, MPI_COMM_WORLD, &req);
   var = buf;
   MPI_Wait (&req, &stat);

Listing 3D shows how race conditions are also possible when using buffered communication. The program in Listing 3D runs correctly most of the time because the attached buffer is almost large enough to accommodate all of the data being sent. “Almost” is never good enough in parallel programming. The program is prone to failure if the previously buffered sends have not completed by the time the final message is sent. Attaching and reattaching the send-buffer inside the loop solves the race condition but hurts performance. Increasing the buffer size is a better solution.

Listing 3D: Performing buffered sends inside of a loop can also cause a race condition if the attached buffer is too small. This example is adapted from the book Using MPI (2nd Edition) by W. Gropp, E. Lusk, and A. Skjellum.

#include "mpi.h"
#include 

int main (int argc, char *argv[])
{
   int i, rank, bsize, s, r;
   int sbuf[1*MPI_BSEND_OVERHEAD];
   MPI_Status stat;

   MPI_Init (&argc, &argv);
   MPI_Comm_rank (MPI_COMM_WORLD, &rank);

   if (rank == 0)
   {
      bsize = 4 + MPI_BSEND_OVERHEAD;
      MPI_Buffer_attach (&sbuf, bsize);

      for (i = 0; i < 25; i++)
      {
         s = i;
         MPI_Bsend (&s, 1, MPI_INT, 1, 0, MPI_COMM_WORLD);
      }
      MPI_Buffer_detach (&sbuf, &bsize);
   }
   else if (rank == 1)
   {
      for (i = 0; i < 25; i++)
      {
         MPI_Recv (&r, 1, MPI_INT, 0, 0, MPI_COMM_WORLD, &stat);
         printf ("Rank %d received %d\n", rank, r);
      }
   }

   MPI_Finalize ();
}

$ mpiexec -n 2 ./a.out

[cli_0]: aborting job:
Fatal error in MPI_Bsend: Invalid buffer pointer, error stack:
MPI_Bsend(184).......: MPI_Bsend(buf=0x7fffffffde98, count=1, MPI_INT, dest=1, tag=0, MPI_COMM_WORLD) failed
MPIR_Bsend_isend(338): Insufficient space in Bsend buffer; requested 4; total buffer size is 99
rank 0 in job 130  compute-00-00_41645   caused collective abort of all ranks
  exit status of rank 0: return code 13 

When using non-blocking communication, good MPI programming practice dictates that every pending message should be checked for completion. Failure to do so can create data races, as shown above, but it can also lead to another problem: overflowing the message queue. Most MPI implementations can handle a large number of pending messages so this problem often escapes detection during development. Over time, however, production-scale workloads fill the queue with pending messages and prevent further communication, ultimately causing the program to fail. This is akin to a memory leak.

MPI Resource Leaks

In addition to overflowing the message queue, other "leaks" are possible in the MPI library. Two such leaks concern communicators and attached message buffers.

The default communicator, MPI_COMM_WORLD, is sufficient for many MPI programs. Some applications, however, create new communicators to logically divide work. Consider a weather forecasting application, for example, that uses different communicators for its ocean and atmospheric models. An application using boss- worker parallelism can create distinct communicators for groups of worker processes assigned to different tasks. Parallel math libraries often create new communicators to keep their messages separate from the parent application. There are many good reasons for using communicators other than MPI_COMM_WORLD but it is important to realize that the MPI library can only support a finite number of communicators. A parallel math library that is called multiple times during a long-running simulation could potentially exceed the number of available communicators.

Similarly, the MPI_Bsend function requires that extra storage be allocated for the message buffer and attached with MPI_Buffer_attach. In the same way that sequential programs should free dynamically allocated memory when it is no longer needed, MPI programs must be careful to free communicators and detach message buffers (with MPI_Buffer_detach) when they are no longer needed.

Incomplete Collective Operations

A common misconception about MPI collective functions is that they behave like barriers. In the case of MPI_Bcast, for example, this would mean that no rank in the communicator can proceed until all ranks have completed the broadcast operation. This is actually implementation-dependent. (The obvious exception is MPI_Barrier, which always behaves like a barrier.) Some MPI implementations may have synchronous collectives but others may use asynchronous collectives to achieve better performance. However, an MPI program must not rely on a particular behavior. It must be correct regardless of whether the implementation uses synchronous or asynchronous collective operations. The MPI Forum provides an example to illustrate this point (Listing 4):

Listing 4: This program is incorrect because the broadcast operations are executed in reverse order. Assume that MPI_COMM_WORLD has only two MPI processes.

   switch (rank)
   {
      case 0:
         buf1 = 1;
         MPI_Bcast (&buf1, 1, MPI_INT, 0, MPI_COMM_WORLD);
         MPI_Bcast (&buf2, 1, MPI_INT, 1, MPI_COMM_WORLD);
         break;
      case 1:
         buf2 = 2;
         MPI_Bcast (&buf2, 1, MPI_INT, 1, MPI_COMM_WORLD);
         MPI_Bcast (&buf1, 1, MPI_INT, 0, MPI_COMM_WORLD);
         break;
   }

Correctness dictates that all ranks of a communicator must execute a collective operation in the same order. Deadlock will occur in Listing 4 if the MPI implementation uses a synchronous broadcast. This problem can escape detection during development if the developer uses an MPI implementation with asynchronous collectives but the end-user's MPI library has synchronous collectives. The MPI Forum provides several good examples of incorrect use of collective communication.

Concluding remarks

Though convenience and ease-of-use were not design goals of MPI, it is still the most effective and widely-used method for achieving scalable, parallel performance. The MPI Forum has begun working on the MPI 3.0 standard, which could make future MPI programs more fault-tolerant and easier to debug. In the meantime, avoiding common mistakes and pitfalls can go a long way to making MPI programming easier. If all else fails, several good tools are available for debugging and profiling MPI applications.

References


  • J.B. White III and Steve W. Bova, "Where's the Overlap? An Analysis of Popular MPI Implementations" MPI Developers Conference 1999.

  • Joe Landman, MPI in Thirty Minutes Linux Magazine, April 2008.

  • William Gropp, Ewing Lusk, and Anthony Skjellum, Using MPI: Portable Parallel Programming with the Message-Passing Interface (2nd Edition), The MIT Press, 1999.

  • Rusty Lusk and Bill Gropp, "Are We Stuck with MPI Forever?" ClusterWorld, March 2005.

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>