Graham E. Fagg Archive

Fault Tolerant MPI
Clusters of every size experience failures: processors can die, hard disks often crash, and interface cards have been known to produce spurious errors. Of course, software can fail, too, for any number of reasons. Prevention is a necessity, but the next best option is to react and respond to faults as they occur. If you're a cluster developer, Fault Tolerant MPI (FT-MPI) can help keep your compute jobs humming.