Clusters of every size experience failures: processors can die, hard disks often crash, and interface cards have been known to produce spurious errors. Of course, software can fail, too, for any number of reasons. Prevention is a necessity, but the next best option is to react and respond to faults as they occur. If you're a cluster developer, Fault Tolerant MPI (FT-MPI) can help keep your compute jobs humming.
Today’s users of high performance computing systems (HPC) have access to larger machines with more processors than ever before. Even discounting systems such as the Earth Simulator, the ASCI-Q machine, or IBM’s Blue Gene system — all of which consist of thousands or even tens of thousand of processors — everyday production clusters can easily consist of hundreds to a few thousand processors. Future systems composed of a hundred thousand processors are already on the drawing board and are expected to be in service within the next few years.
With such large systems, a critical issue is how to deal with hardware and software faults that lead to process failures. For instance, based on current experiences with high-end machines, in particular a model of the Blue Gene system located at the Oak Ridge National Laboratory, a 100,000-processor machine experiences a processor failure every few minutes. Smaller systems fail less often, but they still have failures.
While crashing nodes in earlier, pre-cluster, massively parallel processing systems (MPPs) often led to a crash of the whole system, current cluster architectures are much more robust. Typically, applications utilizing the failed processor have to abort, but the cluster, as an entity, is not affected by the failure. This robustness is the result of improvements in hardware and system software, as well as the very nature of the independent nodes that typically make up a cluster.
But failures aren’t limited to processes dying and hardware failing. In some extreme cases, even networking hardware delivers wrong or corrupted data, leading to anything from an incorrect result to what appears to be a sudden, non-repeatable software error. Fast interconnect cards do have bit errors, and many users switch off the CRC checks (if they exist) to improve performance. But even a very small bit error multiplied by hundreds of megabytes of data transferred multiplied again by thousands of nodes quickly leads to an actual error.
Ultimately, as systems get bigger, the chance of failure increases. And with failures inevitable, the next best option is reacting and responding to faults as they occur. If you’re a cluster developer, Fault-Tolerant MPI (FT-MPI) is like an insurance policy: it’s there in the event of a disaster.
Programming Clusters for Failure
As many clusters are made from single-, dual- or quad-CPU nodes interconnected by some networking infrastructure, the clusters are, in effect, distributed memory architectures (even if made from very small, shared memory machines). This generally has forced the use of message passing as the most effective means of programming nodes for parallel execution, where multiple nodes work on a problem together.
Even though software such as the Parallel Virtual Machine (PVM) is still used occasionally and has many advantages for some work loads, most systems use an implementation of the Message Passing Interface (MPI) standard — the most commonly used implementations being LAM/MPI from the University of Indiana, and MPICH from Argonne National Laboratory (ANL). But current MPI implementations have a problem: they cannot handle a failure. In the current MPI specification (see the sidebar “MPI Terminology” for a review of important MPI concepts), when a task fails, the implementation is supposed to follow the MPI error mode. By default, most systems use “Errors abort.” This means that the implementation simply aborts as soon as an error is detected.
* MPI task. A runtime entity that makes up a parallel program. Each task is given a unique address. The standard does not indicate if they are a process or thread or neither.
* Rank. A unique integer label used for addressing tasks. Ranks have meaning when associated with groups or communicators.
* Group. A set of tasks identified by a handle. The number of tasks are fixed in a group once it is formed. Ranks for the group number from 0 to size-1 for that group.
* Communicator. A special container that consists of a group (meaning it has tasks) and some storage (attributes). Communication can only occur within communicators. When an MPI job starts, all the initial tasks belong to a constant group MPI_COMM_WORLD.
* Object. An MPI group, communicator or non-blocking communication request are MPI managed objects. The user’s application manipulates these through references known as handles.
For example, if you ran a weather simulation job on a 1000-node cluster for 1000 hours before your MPI implementation detects an error, your one million hours of computing just disappears.
Fortunately, MPI has another error mode: “Errors return.” In this mode, the MPI library passes the MPI error codes back to your application. However, most implementations allow an application to only do something outside MPI, such as write output to a log file. When a task leaves (or fails) before MPI_Finalize() is called (the last MPI function you must call to shutdown the runtime system), the MPI standard says the communicators containing the failed tasks are now “undefined,” which, in effect, makes the communicators “invalid,” so no more message passing can occur.
The most common method for surviving failures is to periodically checkpoint the application by saving some program state onto reliable media. In this case, when the program fails, the survivors are killed off and the application is restarted from the last consistent and complete set of checkpoint files. Checkpointing works for MPI since there’s never any attempted use of an undefined communicator. When, how, and who starts the checkpointing operation and which data is checkpointed can be varied depending on the target application and cluster design.
For parallel jobs including MPI programs, having a complete set of checkpoint files is the most complex issue. Several systems support checkpointing for MPI applications. LAM/MPI performs coordinated checkpointing, where all tasks stop message passing then write to disk simultaneously before passing messages again. However, pausing work, then hitting the disk subsystems in parallel can be very expensive, depending on node counts and application size.
Uncoordinated checkpointing solves that potential problem by having nodes checkpoint at different times, retaining any in-flight messages via explicit logging. MPICH-V2/3 and MPICH-CL are examples of such a system. Depending on which message logging scheme is used, there are potential performance losses due to increased latency or reduced bandwidth caused by the collection of the message logs.
Both coordinated and uncoordinated checkpointing persist the task’s entire state, including the application, stack, allocated memory, and so on. Storing all of this data can be very expensive, and even if the you control what’s checkpointed, checkpointing on very large clusters may be completely impractical.
Users of earlier message passing systems used on clusters, such as PVM, were actually able to handle failures in a completely different way. In PVM, it was possible to write a message passing program that, when it detected a failure, either just restarted another process on another node and continued, or ignored the failure completely and continued with reduced capacity. Having a choice was very useful, but it also meant that applications had to be developed with fault tolerance in mind, unlike system-level checkpointing, which stores the whole application image and thus requires no changes to an existing application.
The Idea Behind Fault Tolerant MPI
Fault-Tolerant MPI allows an application to take action if a failure occurs (much like PVM). FT-MPI provides an integrated environment for running MPI jobs based on a distributed virtual machine (DVM). In short, you have full job control and can monitor your MPI jobs, I/O redirection, and so on.
FT-MPI doesn’t change any MPI API calls, so any conformant MPI program can run under FT-MPI. FT-MPI provides an efficient, full MPI-1.2 specification implementation, as well as over half of the MPI-2 specification. This means you can link and run most of your current MPI applications unchanged.
By design, FT-MPI performance is comparable to other open source MPI implementations (at least when no errors have occurred, which should be most of the time). And while FT-MPI isn’t an automatic checkpointing MPI, you can use external checkpointing libraries with it.
The FT-MPI Runtime and Architecture
FT-MPI is based on a meta-computing system known as HARNESS. HARNESS was a Department of Energy (DOE) project involving the three original PVM developers: ORNL, Emory University, and the Innovative Computing Laboratory (ICL) group at the University of Tennessee, Knoxville. (See the sidebar “The Innovative Computing Laboratory” for more about ICL).
University of Tennessee’s Innovative Computing Laboratory (ICL)
The University of Tennessee’s Innovative Computing Laboratory was started by Dr. Jack Dongarra in 1989. From its earliest days, it’s been a hot-bed of scientific research in a wide range of areas from numeric scientific libraries to distributed computing. Since its creation, ICL has produced a number of strategic and widely-used open source packages and projects including LAPACK, ScaLAPACK, High Performance Linpack, ATLAS, BLAS, PVM, Netsolve, NHSE, Netlib, and the Top500 list.
ICL currently employs over forty staff and students, and has active collaborations with dozens of leading research universities and institutes worldwide.
The HARNESS system was a modular, lightweight framework that allowed different application-specific modules to be loaded as needed on a per- user or per-run basis. One of these modules — FT-MPI — provided a message passing API.
HARNESS also featured a virtual machine abstraction that allowed each user to build a private DVM and merge them when they needed to cooperate. In many ways, HARNESS looks like and has the same flexibility as PVM without any of its shortcomings.
The FT-MPI runtime is built upon a daemon-based system (similar in concept to that of LAM and MPICH-2). The daemons come in four flavors:
* Startup daemons start jobs, monitor their execution, and collect and redirect output. They also do any signaling needed. Daemons are single threaded (useful before Linux had real threads), and hold very little state, so memory overhead is small. Unlike PVM, the daemons do not know about each other, thus they scale very well and do not create any central point of failure. Startup daemons detect MPI task failures by catching the SIGCHILD signal. In turn, the startup daemons generate death events.
* Name servers hold lists of daemons, plug-ins, and MPI job histories, and can be replicated to match the scale of your cluster. Name servers do not run jobs (they only store information and distribute events), and can run outside your cluster depending on your network configuration.
* Notifiers receive death events from the startup and watchdog daemons, and act as a broadcast medium.
* Watchdogs, if run, monitor the startup daemons using various forms of heartbeat communication. Without a watchdog, if a node goes offline, it might take as long as the TCP timeout to detect that an MPI task has vanished.
The FT-MPI daemons provide only runtime support and don’t provide any MPI features. As usual, the MPI runtime library the application links against performs all the MPI-to-MPI communication. And once an MPI task starts, it doesn’t contact the daemons until it exits, unless it needs to perform some system-level operation, such as spawning another MPI task elsewhere.
Using the FT-MPI runtime DVM
Installing and configuring the FT-MPI package for Linux clusters is straightforward and very similar to installing any other package. It creates a standard set of directories such as include, lib, and bin under a single HARNESS package directory. This directory has to be set with the environment variable HARNESS_ROOT. Executables (MPI applications and daemons) are stored in $HARNESS_ROOT/bin/LINUX, and libraries are stored in $HARNESS_ROOT/lib/ LINUX.
Once the system is configured, you can start the DVM either manually or via a dedicated software console (yet another idea borrowed from PVM). To start a system manually, you need to start a single name service and notifier, and then start a startup daemon (startup_d) on each node of your cluster. This could be done inside /etc/rc scripts, for example.
However, using the console is simpler, and an example run is shown in Output One. The console automatically starts all daemons and has a history function as well. (The console is complex enough to have its own README file.)
Output One: Using the dedicated HARNESS/FT-MPI console
Welcome to Harness/FT-MPI console.
No host available in “dvm”. Please see “add” command.
Description: Add hosts to virtual machine
Syntax: add <hostname> [hostname1 hostname2 ...]
or add -f <filename>
con> add -f hfile
Found 2 hosts
con> sp -np 2 -o -mpi -s bmtest
leader: doing a handshake then doing benchmark
size 0 bytes time 0.0460 mSec bw 0.0000 MB/Sec
size 1048576 bytes time 15.1836 mSec bw 65.8607 MB/Sec
The console offers a number of other advantages for users of large clusters:
* Each DVM has a unique name, and you can simultaneously control any number of virtual machines from the same console.
* A host can belong to more than one DVM, and you can divide your cluster into different DVMs for different uses, or even have one DVM for hosts using an Ethernet device, and have another, different DVM for the same hosts configured for a different networking device. When you want to run a job using Ethernet, say, run on one DVM. Then use the console to switch devices (using the console’s vmname command) when you want to run a job on the other.
* MPI jobs get a single handle in the form of a job ID (much like that from a batch queuing system). Using this single handle, you can control an entire job, be it one process or ten thousand. You can control individual processes via globally unique IDs (GIDs) that HARNESS allocates to each process.
* The launching of the DVM daemons is performed in a scalable fashion. However, since it involves ssh-ing to multiple hosts, it can still take several seconds. Being a daemon-based system means that this is the highest cost you ever pay for starting anything up. Once started, all MPI jobs are started by an efficient broadcast and then local fork()- exec() calls. MPI application startup is fast.
* Lastly, both the FT-MPI library and HARNESS runtime systems are fully heterogeneous. You can create a DVM across multiple architecture and endian classes, and the system handles the configuration seamlessly.
So, a mixed cluster of 32- and 64-bit Linux boxes, a few other Unix workstations, and even a Windows PC still creates a uniform DVM that correctly runs a single MPI job across all nodes, as long as you have compiled for all in advance.
Using FT-MPI to run MPI programs
Once you have a DVM, you can compile and run a FT-MPI application. Compiling is performed with supplied scripts depending on your application language: ftmpicc for standard C, ftmpiCC for C++, and ftmpif77 for Fortran.
The executable that’s produced should then be moved into the $HARNESS_ROOT/bin/LINUX directory. If your cluster doesn’t use shared file systems, the executable must also be copied to each node that needs it.
Once the executable is in the binary directory, it can be started by the startup daemons. Binaries outside the binary directory are ignored by the daemons as a basic level of security.
To run a FT-MPI application from the command line you would use:
Host% ftmpirun -np <num of processes> -o
<myexe> <arg1> <arg2> …
The -o indicates that you want I/O redirected back to this window. (Detailed information on managing the DVM and running applications is provided with the various README files in the FT-MPI distribution.)
The FT-MPI library is a tuned MPI implementation. A significant amount of effort went into making FT-MPI competitive with other open source implementations. Areas that received special attention were the user-derived data and buffer management routines, the collective (group) communications, and the point-to-point TCP engines.
FT-MPI does all the conversion on the receiver’s side, thus it doesn’t use an intermediate format such as XDR. This greatly improves performance in heterogeneous systems.
|Figure One: FT-MPI basic point to point performance on TCP-based gigabit Ethernet|
Figure One shows the ping-pong bandwidth of the last FT-MPI release versus LAM/MPI and MPICH 1 and 2 over a TCP-based Gigabit Ethernet interconnect.
What Does FT-MPI Do During Failures?
Handling fault-tolerance typically consists of three steps: failure detection, notification, and recovery. The first two steps are handled by FT-MPI’s runtime system, HARNESS. HARNESS discovers failures (death events) and informs all the remaining processes in the parallel job about those events.
The notification of failed processes is passed to the MPI application through the use of a special error code. As soon as an MPI task has received the notification of a death event through this error code, its general state is changing from no failures to failure recognized. While in this state, the task is only allowed to execute certain actions.
These actions depend on a set of user-defined failure modes.
The recovery procedure itself consists of three more steps: recovering the MPI library (specifically, getting back to a no-failures state), recovering the runtime environment, and recovering the application. The runtime environment automatically handles all bookkeeping about losses of daemons and user processes. Recovering the application is considered to be the responsibility of the application, and two examples are shown below.
Handling MPI During Failures
FT-MPI uses failure modes to decide how to handle MPI objects such as communicators, groups, and messages. These modes control the steps necessary to start the recovery procedure back to no failure and control the status of the MPI objects and messages during and after recovery. The three modes are: recovery mode, communicator mode, and message mode.
Recovery mode defines how the recovery procedure can be started. Currently, three options are defined: automatic recovery mode, where the recovery procedure is started automatically by the MPI library as soon as a failure event has been recognized; manual recovery mode, where the application has to start the recovery procedure through the use of a standard MPI function with specific arguments; and null recovery mode, where the recovery procedure does not have to be initiated at all. However, any communication to failed processes will raise an additional error.
The first option prevents an application from ignoring an error. The second gives a developer more control on when they handle the error, while the third allows the application to keep on going without pausing to recover.
The status of MPI objects after the recovery operation is dependant on whether they contain some global information or not. In MPI-1, the only objects containing global information are groups and communicators. These objects are “destroyed” during the recovery procedure and only the objects available after the MPI_Init() call are re-instantiated by the library (MPI_COMM_WORLD and MPI_COMM_SELF).
Communicators and groups can have different formats after the recovery operation depending on the communicator mode. Failed processes can either be replaced (FTMPI_COMM_MODE _REBUILD) or not. In case the failed processes are not replaced, the user still has two choices: the position of the failed process can be either left empty in groups and communicators (FTMPI _COMM_MODE_BLANK) or the groups and communicators can shrink such that no gap is left (FTMPI_COMM_MODE_SHRINK).
Figure Two shows how the different communicator modes affect communicator membership and MPI ranks.
|Figure Two: Communicator modes and how they affect MPI tasks and ranks|
Finally, FT-MPI handles the state of messages when errors occur based on the message mode. In FTMPI_MSG_MODE_ RESET, all messages in transit are cancelled by the system. This mode is mainly useful for applications, which, on error, rollback to the last consistent state in the application.
As an example, if an error occurs in iteration 423 and the last consistent state of the application is from iteration 400, then all on-going messages from iteration 423 would just confuse the application after the rollback. The second mode completes the transfer of all messages after the recovery operation, with the exception of the messages to and from the failed processes. This mode requires that the application keep detailed information of the state of each process, minimizing the rollback procedure. Similar modes are available for collective operations, which can either be executed in an atomic or a non-atomic fashion.
Writing Fault Tolerant Applications
Writing efficient and correct message passing applications is non-trivial. Writing a message passing application that handles random failures and correctly recovers is even harder. It’s a very specialized field, and not every application can be converted to being fault tolerant simply, although there are some classes of applications that naturally need very little changes to benefit. For all other applications, either major rewriting is needed, standard checkpointing is needed, or someone else needs to provide some fault tolerant libraries for you to use!
Let’s look quickly at at two different classes of fault tolerant applications. The first is a master-slave application that is common to parameter sweeps and Monte-Carlo simulations where state only exists at the master. The second is a closely-coupled, parallel, pre-conditioned conjugate gradient solver — a hard, performance-sensitive, HPC application.
Master-slave computations, where a single process known as the master farms out work to multiple slaves, is widely used for a large number of problems, including Monte-Carlo simulations, bioinformatics, and so on. This is also the easiest system to make fault tolerant, especially if it already uses dynamic load balancing. This is typically when work is only given out by the master when requested. This allows faster slaves to complete work more quickly and thus ask for more work. For dynamic load balancing, the master has to track what work was given to which slave.
In the case of fault tolerant MPI, we could simply keep track of which slave had which piece of work, and in the event one dies, just reissue its work to another slave by adding it back to the work-to-do list. Any of the communicator modes (rebuild, blank, shrink) would work, with rebuild and blank being the simplest to use.
|Figure Three: State transition diagram for a master/slave fault tolerant application|
If we needed the master process to also be fault tolerant, we need only replicate the work lists and partial results. If the main master dies and the replicas are not exact, we’d just reprocess some of the work already completed again. Figure Three shows the state diagram for the master task for each of the different communicator modes. (An example application is included in the FT-MPI examples directory, which can be easily adapted to different problems.)
Closely-coupled parallel applications are a harder problem to solve. They usually cannot lose computational nodes; instead, they must use the rebuild communicator mode. They also cannot survive with a loss of some of their working data. The data must be recovered somehow, be that through an explicit checkpoint on file or memory or via a reverse calculation of some kind.
The first step with this type of application is to add user-level checkpointing and then make sure the application can really restart from one of these checkpoints and still produce a correct answer. The developers of FT-MPI have experimented with various methods of checkpointing to both disk and in memory, be that other nodes’ memory or our own using some kind of redundant encoding (erasure codes) such as Reed-Solomon. These methods are effectively RAID 5, but for in-memory working and partial results. An advantage of these methods is that they can be customized to handle any expected number of concurrent failures.
Once the MPI application is working with user-level checkpointing, it can then be altered to use the FT-MPI rebuild mode of failure recovery. This keeps the application from failing, and then completely restarts it from a checkpoint. Moreover, the surviving tasks can start loading the last checkpoint immediately, while the FT-MPI runtime system restarts the failed nodes. These nodes then load their own checkpoints and the application continues with no user intervention.
Changing the existing MPI application to support this fault tolerance is assisted by an often-forgotten feature of the MPI API: error handlers. In MPI, an error handler can be attached to a communicator, so that when an error occurs, the MPI library calls back to a user-supplied error routine via a function pointer. The advantage of this system, if combined with long jumps, is that if an application is made of many levels of software, the lowest level layer that calls MPI can be left completely untouched!
|Figure Four: Structure of a fault tolerant closely coupled solver|
When an error occurs, the MPI library calls the user’s recover routine without altering any of the underlying structure. After the recovery routine has executed, it can then perform a long jump as to near the start of the program (outside the main iteration loop) without the user having to worry about passing additional errors up through all the layers of software.
Figure Four shows the outline of a fault tolerant closely-coupled MPI solver. Figure Five shows the performance of an example, parallel, pre-conditioned conjugate gradient solver developed for various levels of failure resistance.
The solver uses N tasks to perform a calculation and M additional tasks to hold redundant information used to re-compute lost data in the case there is a loss of one of the original N tasks. To be able to survive the loss of a single task, M needs to be 1. To survive 2 concurrent failures 2 additional tasks are needed, and so on.
|Figure Five: Performance of a fault tolerant closely coupled solver|
The test compares the time the application takes to execute without any user-level, in-memory checkpointing (i.e., the original application) with user level checkpointing but no failures and then with the specified number of concurrent failures at the 500th iteration. The checkpoint frequency is every 100 iterations (approximately 250 seconds). The initial matrix is of order 943,695 by 943,695 with 39,297,771 non-zeros.
FT-MPI is an Open Source, high performance MPI 1.2 implementation designed from the ground up to allow for process failures. It offers users a very flexible runtime environment aimed specifically at large clusters, and it has a number of different failure-recovery models to choose from for those wishing to experiment with fault tolerance.
FT-MPI continues to evolve slowly towards the MPI-2 specification, although a significant amount of effort is focused on developing new methods and algorithms for implementing fault tolerant numeric libraries.
Since the start of 2004, a new open source MPI has been in the making. Groups from Indiana University (makers of LAM/MPI), Los Alamos National Laboratory (LANL, makers of LA-MPI) and developers from ICL (FT-MPI) have formed a joint collaboration to create a new MPI based on all of our collective experiences (and mistakes). The new MPI, known as Open MPI, is modular, offers high performance and is very configurable. As you read this article, the fault tolerant recovery schemes from FT-MPI are being componentized for use with Open MPI. Watch this space for future updates on Open MPI!
Graham Fagg is a Research Associate Professor in the Computer Science Department of the University of Tennessee, Knoxville. He developed the first version of FT-MPI and a C-based Harness system. His prior work included working on
SNIPE, PVM, PVMPI, MPI_Connect, lots of parallel I/O, and breaking
Globus. He can be reached at firstname.lastname@example.org.