How does one manage really big clusters? Perhaps nature can give us a clue.
As clusters continue to grow in size and complexity, I often think about the programming and management issues users and administrators must face. There are currently proposals floating around that require five digit node counts. That is a lot of cores, heat, power, cables, floor space, and coffee. If I were setting up a 25,000 node cluster, I would suddenly acquire a renewed interest in statistics. Not just the nodes, but also the networking. As you may know, infrastructure parts are rated in terms of MTBF (Mean Time Between Failure). This is a statistic that is often mis-understood. If a part has a MTBF of 10,000 hours, that does not mean the part will last 10,000 hours. It is a number that is used to determine the failure rate of a collection of similar parts.
An example will help illustrate. A compute node may have a MTBF of 250,000 hours (about 28 years), which sounds impressive. If you are considering 2,500 nodes in a cluster, then you can expect one node will fail every 100 hours (Divide the MTBF for a thingby the number of things and you get the expected time between failures.) If you take it up notch — using hypothetical 25,000 nodes — then you can expect a node to fail every 10 hours. Bummer. I call it the more stuff you have, the more stuff that can break rule.1
The nice thing about a cluster is that node failures are isolated. That is, no one node can cause the failure of any other node (well, it should not, in any case). Of course, if a node fails the job using that node generally fails. The node is usually taken offline, at which point the scheduler then rounds up another batch of nodes and re-runs the job (if configured to do so). Long running jobs often write checkpoint files so they can re-start from the check point and thus not have to re-do the entire program run. This situation is what I would call manageable pain and is what makes clusters so attractive.
What about a switch failure? If a switch (or part of switch) goes down, then all those nodes attached to the switch are not usable. As the cores per node grows, that single interconnect connection becomes more of a critical issue. Indeed, losing a single node, can take out 8 cores, but dropping a switch can easily take 256 or more cores offline. Ouch! Indeed, if switch or switch blade needs to be taken out of service, all nodes that route through the switch are effectively useless. Switches are becoming a much more critical in terms of overall system reliability than in the past.
Redundancy is one method to ensure there is no single point of failure in cluster systems. Adding a second interconnect can provide redundancy, more throughput, lower latency, and additional cost. As clusters continue to grow redundant networks may become a necessity above a certain node count.
Another significant failure point that rears its head with large clusters is job management. If the scheduler crashes, or has difficulties managing the number of jobs, then the cluster will not be able to run effectively. For instance, in our hypothetical 25,000 node cluster, there would be 200,000 cores. If an average job used 64 cores, then the scheduler would have to actively manage/monitor approximately 3,000 jobs at the the same time. Remember, the scheduler needs to manage the loads and job status on each node in real time. Inefficiencies or thrashing in the scheduler could easily leave jobs waiting in the queue or worse cause the scheduler to crash or run out of memory. Which leads me to wonder, what do we do when we hit clusters with nodes approaching 6 figures?
A few years ago, my daughter and I watched a show on ants. Of course, there were not the gentle but pesky ants that live in your backyard. These were South American army ants. Their colonies can reach over one million in size and they pretty much do what they want. (i.e. do not get in their way when they are foraging for food or you may be come a member of the colony — in a truly distributed sense)
I often think about ants, bees, and other colonizing insects when I think about really large clusters. Insect colonies are in the truest sense of the word parallel computers. Think about it. How do you control one million ants. They work together surprisingly well without MPI and Gigabit Ethernet. Entomologists will tell you that ants communicate using chemicals, but much of what they do is preprogrammed and governed by what I would assume are simple rules. An individual ant is useless on its own and expendable, but very good at doing its particular job when working with others.
As we approach the really big threshold, we may need to look at how nature has solved the scale problem. The first thing that hits me is redundancy and fault tolerance. Just like the ant colony, a large cluster and the software running on it will need to adapt to constant node and network failure. As clusters get bigger, nodes and networks are going to fail at increasing rates. Programs and networks may need to manage themselves without any reliance on a single point of control or at least minimizing those points as much as possible. That is, programs may need to find resources on their own and not wait for a central authority to allocate them. There is no one telling ant number 183,234 to go pick up that crumb over there. The colony adapts dynamically to its needs.
Regardless of how well we engineer our cluster, there will always be some areas that are more sensitive to failure than others. Ultimately, it comes down to a cost-benefit type of analysis. The ultimate redundant over-kill design is of course a RAID 1 kind of cluster (build two and mirror the entire cluster). And, like clusters, ant colonies are not immune to a central failure. In my experience, they seem to recover quite well from me stomping on them with my shoe, but take out the queen and it is game over. But queens actually create new queens, thus ensuring against total ant annihilation. (Note: No ants were hurt in the production of this document.)
As we continue to scale, I believe system management will become one of the more important issues. Centrally managing a large number of >anything is difficult, and I would suspect there is some kind of threshold where the amount of information required to manage a big-thing might be approaching the amount of information created by the big-thing. Perhaps at this point, a more dynamic approach would be actually work better. That seems to how the bugs manage things.
1 TMSYHTMSTCBR for short.