HPC Guru Don Becker: Why MPI Is Inadequate

To date, the Message Passing Interface has been instrumental in simplifying application development for clusters. But as clusters change to embrace multiple cores, multiple platforms, and multiple advanced interconnects, MPI is no longer adequate. What can replace it? Donald Becker asks, “How about Unix?

The well-known Message Passing Interface (MPI) is widely used in parallel programming and was intrumental in simplifying application development for clusters. However, as clustering evolves to leverage virtualization, MPI is becoming irrelevant: MPI doesn’t prevent version skews, cannot handle failovers, and lacks an efficient way for applications to communicate with each other — a critical capability for virtualization to be successful. Fortunately, there are emerging alternatives to MPI that are better suited to meet industry demands for cluster virtualization without adding unnecessary complexity.

A Narrow World View

For all of its power, MPI purveys a static view of cooperating machines. MPI assumes that software is installed once, at the same time, on all machines, and furthermore assumes that all machines are known and named when a cluster is installed or an application is started. By design, the MPI “cluster size” (provided by MPI_Comm_size(MPI_COMM_WORLD)) remains fixed after initialization, making it impossible to either take advantage of new machines added to the cluster or to reduce the number of machines on which an application depends.

The MPI model holds that the world doesn’t evolve — that applications are never modified and improved. In the real world however, application programs, configuration files, and system components are updated (often), and the most common thing to do after an update is to immediately utilize the changes. But because MPI is based on an initialize-compute-terminate model, explicit support for checkpointing, executing as a service, and running “forever” is lost.

MPI doesn’t comprehend failures: there’s no mechanism to detect a failure, no way to report that a process has failed, and no way to recover from a fault. An MPI application may issue undone work to other ranks and complete the actual task, but it cannot terminate cleanly with missing nodes. Even a model with a static view of the working set of machines can include recovery for unexpectedly failed processes — yet MPI programs are susceptible to cascading failures, where a single, failed machine triggers the eventual failure of the whole system.

A better model utilizes a single point for program initiation, where there is a single consistency point, and it’s easy to handle and report initialization failures. But after a distributed program is spread across multiple nodes it shouldn’t depend on a specific node to continue independent operation.

MPI’s strength is collective mathematically-oriented operations, not communication. Although “message passing” may imply it, stream communication is not the focus of the MPI and message passing is unnecessarily complex. Many clustering applications work well with a sockets-based messaging, which is just as functional and far less complex.

While cross-architecture jobs are theoretically supported by MPI, implementation is extremely difficult. Cross-architecture support adds complexity without benefit. Furthermore, application deployment is inconsistent, which leads to versions skews of executables and libraries.

Dynamic cluster programming is a much better model for most applications, but it has its prerequisites. For example, dynamic cluster programming requires information and scheduling interfaces, true dynamic sizing with process creation and termination primitives, and status signaling external to application messages. In turn, these features require comprehensive reporting capabilities and interfaces that can identify and start usable nodes, provide capacities (such as speed, total memory, and so on), and state availability (including current load, available memory, and other system parameters). That information is captured differently on each platform, often using unique libraries or underlying subsystems. Dynamic cluster programming also requires explicit scheduler or mapper interfaces that can utilize an external scheduler or internally create a list of usable compute nodes.

The Need for New Technology

As an API, MPI provides maximum freedom for library implementors. And while recompilation is needed to move from one implementation and platform to another, in a world where most users compile their own codes, recompilation isn’t terribly onerous.

However, compilation isn’t always possible. For some open source packages, a domain expert (a non-computer scientist) is required to build the code. At the other extreme, commercial packages aren’t distributed as source. In the former case, the usefulness of the code is limited; in the latter case, having to recompile the code for each combination of platform and MPI version limits the portability of the software. Indeed, independent software vendors (ISVs) are typically choosing to only test and support one MPI implementation, hence supporting only a limited number of today’s high-speed cluster interconnects and cluster environments.

MPI is also an “all or nothing” solution: an application cannot run single-threaded if it determines that multiple processes aren’t beneficial. An application should be able to use only a subset of the available processors. For example, an application that uses a regular grid might choose to use only 16 of 23 provided nodes, leaving the remaining seven truly unused, so if those nodes crash or are otherwise unexpectedly removed, the disappearance of the resources shouldn’t affect correct, error-free completion. Ideally, processes should never be started on those nodes.

New process creation primitives are needed. In fact, one well-tested model already exists: Unix process management. The only additional element needed is the ability to specify remote nodes when creating processes. Monitoring and handling terminating processes does not need extensions.

BProc (see http://www.linux-mag.com/2005-02/extreme_01.html) supports remote spawning, but the concept of remote_fork() existed long before BProc. HPC programming libraries should use exactly the Unix semantics so that the library need only wrap the actual system calls.

Asynchronous signaling methods are needed, but luckily, Unix signals support this already as well. Signals’ (still modest) complexity represents many years of experience and demonstrates that getting the semantics for handling asynchronous events is difficult.

New interfaces must also avoid cascading failures. Many cluster subsystems have been designed with a flaw that’s glaring after the system is deployed: the failure of one element makes large parts of the system unusable. Some examples are creating diskless nodes using a NFS server, and certain cluster file systems that must essentially crash the machine to do lock recovery.

Potential Solutions

Several vendors have offered to solve these problems – with varying degrees of success – by selling widely-portable MPI implementations that support a wide range of systems, without requiring recompilation or re-linking. However, as clustering moves towards virtualization, most of these MPI implementations are not suited for the level of integration, reliability, and ease of management required to take clustering to the next level — namely to capitalize on virtualization.

Creating an application binary interface (ABI) for MPI is one solution. An ABI would allow applications to run on the widest variety of interconnects and MPI implementations without re-linking or recompiling. An ABI could standardize items that aren’t standardized in the MPI API, thereby increasing application and test portability and improving the quality of MPI implementations.

However, because MPI was originally created as an API and not an ABI, there is considerable work to be done. There is a trend towards modularization that can address this issue, and the notion of an MPI ABI has already gained strong support in the developer community.

As mentioned above, there is always the option of using a sockets-based messaging interface. This would involve compiling an application to use TCP/IP sockets, but sockets are less complex than MPI and just as functional. However, direct sockets are not ideal, as there are still problems with vendor IP stack implementations, including interrupts, user-space/kernel-space context switching, and CPU utilization, among other areas.

The bottom line is that we need a new model for application support on a virtualized cluster environment. Unfortunately, MPI cannot support these changes. The speed with which clusters are moving to virtualization requires that cluster builders and users focus on the issue today if clustering is to keep up with industry demand.

Donald Becker is the chief scientist and founder of Scyld Software (http://www.scyld.com).

Comments are closed.