Increasing total cost of ownership (TCO) may make throw-away nodes an option in the future
In some of my past columns, I had discussed the reliability of clusters. The statistics tell us that the bigger things get, the more failures we can expect. Indeed, it is not uncommon for very large clusters to have a failure per day and this is totally expected! In my past musings on this topic, I have suggested dynamic parallelization and disposable nodes as ways to address the failure issue. I’ll expand on the disposable idea based on some new research projects I have been following.
In my last column, I mentioned that I ran into Chuck Sietz of Myricom. In his talk he mentioned a paper called Recovery Oriented Computing (ROC): Motivation, Definition, Techniques, and Case Studies where the authors cite a purchase price to three year Total Cost of Ownership (TCO) ratio of 3.6 to 18.5 for clustered *nix systems. That is, if I spend $100,000 to buy a cluster, I can expect my three year TCO to be somewhere between $360,000 to $1,850,000.
Their TCO estimates include administration, operations, network management, database management, and user support. Several costs typically associated with TCO were not included: space, power, backup media, communications, HW/SW support contracts, and downtime. In other words, just the person-power required to run these systems is expensive.
The authors suggest that research into Recovery Oriented Computing (ROC) will aid in the reduction of these TCO costs and propose that hardware faults, software bugs, and
operator errors are facts to be coped with, not problems to be solved. As explained in the paper: By concentrating on Mean Time to Repair (MTTR) rather than Mean Time to Failure (MTTF), ROC reduces recovery time and thus offers higher availability. Since a large portion of system administration is dealing with failures, ROC may also reduce total cost of ownership.
The paper discusses various ways that ROC may be achieved. Another point the paper made was the frustration that IT equipment causes in many situations. TCO often does not include Total Cost Of Frustration (TCOF). We can all relate to uncooperative computers and probably have at one time or another wanted to try this man’s solution. The term “computer rage” has also been used to describe the frustration of those having to deal with computer failures. TCOF is probably not a measurable quantity and it is real. Feeling defeated by silicon never helps the situation.
The ROC paper focuses on ways to recover quickly from problems. Big clusters are a good area where the ROC approach should be useful because with clusters failure is inevitable. Another important point from the paper are the huge TCO multipliers. All too often users focus on the hardware cost alone as a measure of price to performance and ignore the other TCO issues. In particular, when power and cooling is factored in, the TCO ratios can increase significantly above those for the person-power..
I have presented my take on dynamic parallelization and it certainly seems to share some similar ideas to the ROC approach. I also have written about disposable nodes and I want to expand on this topic a bit.
Recent advances in low power processors and embedded systems has piqued my interest in the use of a large number of slower processors vs a small number of faster processors. My interest is based on the assumption that slow implies cooler and lower cost and fast implies hotter and expensive. With frequencies approaching the 2 GHz range, some of the latest low power chips are actually not as slow as one would expect. The big multi-core chips are faster, but get most of their performance gain from multiple cores.
Designing a cluster with slower processors means that the node cost can be quite low. Of course there are interconnect issues, but for the sake of my argument, let’s assumed we are building a cluster out of small Mini ITX motherboards with on-board GigE (for instance the Intel DH57JG). These boards cost on the order of $100. If we assume disk-less nodes and add a processor, memory, and small case the node costs may reach $400 (or less). The question, I ask myself, at what price point would it make it too expensive to bother repairing? Of course it might be fixable, but given the TCO costs, is it possible that the cheapest approach might be to turn off the node and forget about it. For a small amount of nodes, this may seem extreme, but if you have a cluster with 10,000 nodes, then losing a small amount of computing power may not be that significant.
Of course, there may be times when dead nodes can be harvested and sent back to the vendor or replaced with fresh new ones, but again, the cost to spend more than an hour or two fixing something that costs $400 may not make sense. A similar situation exists with home electronics. It is cheaper to buy a new DVD player than to fix an old one. The same is true for many home computers. I have been told by many people that they found it cheaper and less frustrating to buy a new system than pay someone to clean malware off their old “slow” machine.
I’ll close by mentioning a recent effort to use low powered cluster nodes. The FAWN project (Fast Array Of Wimpy Nodes) is a good example of using small low power nodes to achieve a good price/power/performance metric. There are other wimpy node approaches being studied as well. At some point, the increasing TCO may make nodes in a FAWN disposable and result in better ROC and help reduce the TCOF. That has got to be the most acronyms I ever used on one sentence.
Douglas Eadline is the Senior HPC Editor for Linux Magazine.