Not only does green HPC save money and reduce power usage, it also increase reliability.
One of my favorite computer stories is about my first web server. I had mentioned it previously, but I’ll provide a quick recap here. Back in the day, I put together a Linux web server using a 486-DX2 box I had lying around. Aside from being impressed that this “Linux thing” actually worked, I was impressed with the fact that a lowly x86 box could do a great job serving web pages. Of course, back then web pages were basic HTML 1.0, but compared to other more expensive UNIX servers this system was a real bargain.
The system was a large tower case with motherboard placed in a vertical position. The 486-DX2 ran a little warm so it had a small fan on top of it. I seem to recall that the CPU would draw about 5-6 watts of power. The system ran for at least 3 years without a hiccup. It had already seen a previous life as a workstation, so the new life as a web server was gravy. The up-times were measured in months as it was only rebooted after it was moved or upgraded. At one point, I decided to open it up to see if there was dust and dirt collecting on the inside. To my surprise, the small fan had fallen off the 486-DX2 and the system was running just fine. Based on the amount of dust on the fan, I figured this happened some time before I opened the system. I cleaned the dust out of the system, hung the fan on the 486-DX2 (some of the plastic tabs that held the fan in place were broken) and put the machine back together. It ran fine for another year or so. Now that is what I call reliability.
Imagine such a scenario with present day hardware. Without a cooler, the CPU would cook itself in no time. The lesson here is heat. Over the years CPUs have become not just hot, but really hot. I find it interesting in terms of heat density, we passed the “hotter than a hot plate” somewhere between the Pentium and Pentium Pro line. Heat, put simply, is the enemy. It not only costs money to move heat, it also effects the reliability of electronic systems.
Recently I have been reading and studying the Green 500 List. The Green 500 is like the Top500 List, however machines are listed in terms of performance-per-watt. The list was created by Dr. Wu-chun Feng of Virginia Tech. I’m going to borrow from some of Dr. Feng’s arguments for the rest of the column because he makes a very good case for green HPC in terms of reliability.
If you paid attention in chemistry class you may recall a thing call the Arrhenius Equation. Just in case you were asleep that day, let me refresh your memory. The Arrhenius Equation relates the reaction rate to temperature and activation energy. Perhaps you may recall this rule of thumb: At room temperature the reaction rate doubles for every ten degree Celsius increase in temperature. In other words, it is why your mother told you to use hot water to wash your hands. In general, the hotter things are the faster the reactions occur. In terms of electronic equipment, the hotter things get, the greater the failure rate. Indeed, once could make a similar argument that says for every ten degree increase in temperature, the failure rate will double. If you recall, in my last column I talked about MTBF (Mean Time Between Failure) and large clusters. Heat is one of the main reasons things fail. As clusters grow the statistics of failure rates becomes more and more important. And the gremlin pushing the failure rates is heat. I have my own simple rule that has no mathematical or scientific credibility whatsoever, The more hot things, the more heat, the more failure. It seems to work for me.
To reduce failure rates, or to increase reliability, one needs to reduce heat. Of course fans push cool air through clusters, but, this “edge of the envelope” heat removal has a few disadvantages. First, moving heat costs money and uses additional power. Second, careful design is very important. Pockets of heat can build up even in the coldest machine rooms. In addition, fans and colling systems can fail. Many administrators are surprised to learn that, should their air condition fail, the time to total meltdown with today’s clusters is usually measured in minutes — and could be shorter than the time it takes to shut down the clusters! And, finally, we all have a responsibility to the planet.
One way to reduce heat is to slow things down. Lower clock rates reduce heat and improve long term reliability. I have talked about using slower processors before and here is the bottom line. If we are going to be more green in terms of cluster performance, we need to develop and use good scalable algorithms. If we turn down the clock rates just a bit, then one way to get more performance is to add more cores/processors. The key to using more cores is scalability. So, this is our challenge; instead of using thirty-two 65 Watt high frequency cores, could you run comparably on sixty-four 25 Watt lower frequency cores? Running a lower core frequency increases reliability and reduces cost. That is goal of the Green500 list. To maximize computation rates with minimizing power usage.
Someday I hope to be writing about opening up a functioning cluster node to find broken fans and a pile of dust. And, after I blow out some of the dust and fix a fan or two, I just pushing it back into the rack so it can keep on crunching away a the latest global warming climate model. That is what I call global reliability.