x
Loading
 Loading
Hello, Guest | Login | Register

Fault Tolerant MPI

Clusters of every size experience failures: processors can die, hard disks often crash, and interface cards have been known to produce spurious errors. Of course, software can fail, too, for any number of reasons. Prevention is a necessity, but the next best option is to react and respond to faults as they occur. If you’re a cluster developer, Fault Tolerant MPI (FT-MPI) can help keep your compute jobs humming.

Today’s users of high performance computing systems (HPC) have access to larger machines with more processors than ever before. Even discounting systems such as the Earth Simulator, the ASCI-Q machine, or IBM’s Blue Gene system — all of which consist of thousands or even tens of thousand of processors — everyday production clusters can easily consist of hundreds to a few thousand processors. Future systems composed of a hundred thousand processors are already on the drawing board and are expected to be in service within the next few years.

With such large systems, a critical issue is how to deal with hardware and software faults that lead to process failures. For instance, based on current experiences with high-end machines, in particular a model of the Blue Gene system located at the Oak Ridge National Laboratory, a 100,000-processor machine experiences a processor failure every few minutes. Smaller systems fail less often, but they still have failures.

While crashing nodes in earlier, pre-cluster, massively parallel processing systems (MPPs) often led to a crash of the whole system, current cluster architectures are much more robust. Typically, applications utilizing the failed processor have to abort, but the cluster, as an entity, is not affected by the failure. This robustness is the result of improvements in hardware and system software, as well as the very nature of the independent nodes that typically make up a cluster.

But failures aren’t limited to processes dying and hardware failing. In some extreme…

Please log in to view this content.

Not Yet a Member?

Register with LinuxMagazine.com and get free access to the entire archive, including:

  • Hands-on Content
  • White Papers
  • Community Features
  • And more.
Already a Member?
Log in!
Username

Password

Remember me

Forgotten your password?
Forgotten your username?
Read More
  1. Scheduling HPC In The Cloud
  2. GP-GPUs: OpenCL Is Ready For The Heavy Lifting
  3. HPC Madness: March Is More Cores Month
  4. HPC Turn-Offs: Power Control
  5. The Cost to Play: CUDA Programming
Follow Linux Magazine
Rackspace