dcsimg

Disposable HPC

Increasing total cost of ownership (TCO) may make throw-away nodes an option in the future

In some of my past columns, I had discussed the reliability of clusters. The statistics tell us that the bigger things get, the more failures we can expect. Indeed, it is not uncommon for very large clusters to have a failure per day and this is totally expected! In my past musings on this topic, I have suggested dynamic parallelization and disposable nodes as ways to address the failure issue. I’ll expand on the disposable idea based on some new research projects I have been following.

In my last column, I mentioned that I ran into Chuck Sietz of Myricom. In his talk he mentioned a paper called Recovery Oriented Computing (ROC): Motivation, Definition, Techniques, and Case Studies where the authors cite a purchase price to three year Total Cost of Ownership (TCO) ratio of 3.6 to 18.5 for clustered *nix systems. That is, if I spend $100,000 to buy a cluster, I can expect my three year TCO to be somewhere between $360,000 to $1,850,000.

Their TCO estimates include administration, operations, network management, database management, and user support. Several costs typically associated with TCO were not included: space, power, backup media, communications, HW/SW support contracts, and downtime. In other words, just the person-power required to run these systems is expensive.

The authors suggest that research into Recovery Oriented Computing (ROC) will aid in the reduction of these TCO costs and propose that hardware faults, software bugs, and
operator errors are facts to be coped with, not problems to be solved. As explained in the paper: By concentrating on Mean Time to Repair (MTTR) rather than Mean Time to Failure (MTTF), ROC reduces recovery time and thus offers higher availability. Since a large portion of system administration is dealing with failures, ROC may also reduce total cost of ownership.

The paper discusses various ways that ROC may be achieved. Another point the paper made was the frustration that IT equipment causes in many situations. TCO often does not include Total Cost Of Frustration (TCOF). We can all relate to uncooperative computers and probably have at one time or another wanted to try this man’s solution. The term “computer rage” has also been used to describe the frustration of those having to deal with computer failures. TCOF is probably not a measurable quantity and it is real. Feeling defeated by silicon never helps the situation.

The ROC paper focuses on ways to recover quickly from problems. Big clusters are a good area where the ROC approach should be useful because with clusters failure is inevitable. Another important point from the paper are the huge TCO multipliers. All too often users focus on the hardware cost alone as a measure of price to performance and ignore the other TCO issues. In particular, when power and cooling is factored in, the TCO ratios can increase significantly above those for the person-power..

I have presented my take on dynamic parallelization and it certainly seems to share some similar ideas to the ROC approach. I also have written about disposable nodes and I want to expand on this topic a bit.

Recent advances in low power processors and embedded systems has piqued my interest in the use of a large number of slower processors vs a small number of faster processors. My interest is based on the assumption that slow implies cooler and lower cost and fast implies hotter and expensive. With frequencies approaching the 2 GHz range, some of the latest low power chips are actually not as slow as one would expect. The big multi-core chips are faster, but get most of their performance gain from multiple cores.

Designing a cluster with slower processors means that the node cost can be quite low. Of course there are interconnect issues, but for the sake of my argument, let’s assumed we are building a cluster out of small Mini ITX motherboards with on-board GigE (for instance the Intel DH57JG). These boards cost on the order of $100. If we assume disk-less nodes and add a processor, memory, and small case the node costs may reach $400 (or less). The question, I ask myself, at what price point would it make it too expensive to bother repairing? Of course it might be fixable, but given the TCO costs, is it possible that the cheapest approach might be to turn off the node and forget about it. For a small amount of nodes, this may seem extreme, but if you have a cluster with 10,000 nodes, then losing a small amount of computing power may not be that significant.

Of course, there may be times when dead nodes can be harvested and sent back to the vendor or replaced with fresh new ones, but again, the cost to spend more than an hour or two fixing something that costs $400 may not make sense. A similar situation exists with home electronics. It is cheaper to buy a new DVD player than to fix an old one. The same is true for many home computers. I have been told by many people that they found it cheaper and less frustrating to buy a new system than pay someone to clean malware off their old “slow” machine.

I’ll close by mentioning a recent effort to use low powered cluster nodes. The FAWN project (Fast Array Of Wimpy Nodes) is a good example of using small low power nodes to achieve a good price/power/performance metric. There are other wimpy node approaches being studied as well. At some point, the increasing TCO may make nodes in a FAWN disposable and result in better ROC and help reduce the TCOF. That has got to be the most acronyms I ever used on one sentence.

Comments on "Disposable HPC"

stevemadere

It is amusing to see the same article quote the enormous implied (and at least partially hidden) TCO of a cluster and then advocate using a larger number
of cheaper, less reliable nodes rather than fewer more expensive nodes
with the maximum power possible.

Are you assuming that cost of administration rises super-linearly with CPU power?
If not, your solution is actually making the problem worse!

Additionally, one has to be very clever to forestall Amdahl\’s law
when deploying slower nodes.

I suspect the cost of implementing that cleverness far, far exceeds the savings
from using cheaper nodes.

The solution may indeed be disposable nodes but more likely they
are the fastest nodes you can get your hands on and the big
change in mindset is the willingness to treat such an expensive
piece of equipment as disposable.

Reply
truly64

Having increased the size of our linux cluster 5 fold over the last 4 years, I see no 5 fold increase in TCO. The total cost of ownership has been flat. In other words, in my experience, it costs just as much to own/manage a 100 node cluster as a 1000 node cluster.

First of all, system managers get very good at adding nodes easily, so each additional node is a negligible extra cost or effort.

Secondly, with commodity hardware having a typical 3 year hardware warranty, it makes no sense to put servers onto a costly maintenance contract, as the definition of a cluster means that it can accommodate a node failure by transferring a job to another node.

Thirdly, after 3 years, if a node fails, it is removed from the cluster, as it is well past obsolete and is not worth fixing. Three years ago, I was buying dual dualcore servers, now I can get dual 12 core servers for the same price. Why fix the old one?

So I find the Myricom analysis differs greatly from that observed in reality. One can create very impressive levels of computing power very cheaply, using open source os and tools, and maintain it on a shoestring without compromising reliability and availability. That is the beauty of a cluster deployment.

Reply
markhahn

the numbers in that paper are fairly bizarre: I can\’t really imagine how it could cost $3m to support 4.1 servers purchased for $160k and serving 4500 users. note that these TCO numbers actually exclude space, power, backup, connectivity, hw/sw support contracts and downtime!

the TCO for HPC is much lower – more like 10% per year, even including power. that puts your $400 node in a rather different light – do you mind spending a half-hour per year replacing a failed ATX PSU, etc?

not to mention that from a green perspective, \”disposable\” pretty much always means \”unsustainable\” and \”giant carbon footprint\”.

Reply

Thanks for another excellent post. Where else could anybody get that type of information in such an ideal way of writing? I have a presentation next week, and I am on the look for such info.

Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>