HPC is just beginning to warm up to the idea of green computing. But can green give you a faster, better and cheaper high-performance cluster?
Green is this year’s buzzword. Print your brochures with green ink and add the word green in your product description and presto you “got your green on.” However, with HPC, it’s just not that easy. Indeed, in a market where performance rules, everything else seems to take a backseat. Everything, that is, until the electric bill for the data center arrives. Or, when the ability of the data center to get more space, power, or cooling is not possible.
Green HPC is really optimized HPC. Unless you have an unlimited budget, price-to-performance is the metric often used to evaluate various hardware solutions. While performance can be easily measured, the price component is often a bit more vague. Often, the price component includes the basic hardware procurement cost and totally ignores the operational expenses. In today’s hot and hungry server market, this analysis is short sighted at best. An example can help illustrate the problem.
The average 1U dual-socket cluster node currently requires around 300 watts of power. Cooling and power delivery inefficiencies can double this node-power requirement to 600 watts. Therefore, on an annual basis, a single cluster node can require 5256 kilowatt hours. At a nominal cost of $.10 per kilowatt hour, the annual power and cooling costs for a single cluster node is approximately $526.
These numbers are more striking when the cost of the entire cluster is taken into account. Consider a typical cluster purchase in todayâ€™s market where the typical node can cost $3500 (including racks, switches, etc). Using standard quad-core technology, a node has two processors and eight total cores. Using our average node price, a typical 128-node cluster (with 256 processors and 1028 cores) costs about $448,000. Based on the above assumptions, the annual power and cooling budget is $67,300. Over a three-year period, this amounts to $201,900. or 45% of the system cost.
Although costs can vary due to market conditions and location, the above analysis illustrates that the three-year data center cost can easily reach 40-50% of the hardware purchase price for a typical commodity cluster.
A more correct measure of price-to-performance should therefore include infrastructure/operational overhead. Ignoring these costs give an inflated and somewhat wishful price-to-performance metric. Operational costs are often reflected in a Total Cost of Ownership (TCO) metric, but in HPC we prefer a performance rating in our metric.
Green Is Performance
Based on our above analysis, anything that decreases the power and cooling cost will automatically decrease the price-to-performance ratios (lower is good). This conclusion is rather striking because green computing is often associated with low power (i.e. low clock speed) systems. In HPC, green does not mean slow. It means efficient. For the remainder of this article we’ll take a look at various ways efficiencies can be improved for HPC systems and some new approaches to green HPC.
Greening The Server
Because many of today’s commodity rack-mount servers use similar processors, memory, and hard drives, we won’t devote too much discussion to these components. In other words, we use what the market offers and the vendors are doing their part to make these components more efficient as well. There are same new power saving features available with new processor and clusters present a unique environment which is best managed by the job scheduler (see below).
Power Supplies If you are using standard rack mount servers, an area you can control is the power supply. Make sure you use Power Factor Correction (PFC) power supplies. A power supply with a PFC over 0.8 is using power efficiently. Unfortunately, an uncorrected power supply may have a power factor of 0.70-0.75 and waste energy. A good PFC power supply will have a power factor of 0.95-0.99. In addition to PFC, the efficiency of power supplies should be noted. In the past, a typical power supply may have been 60-70% efficient. That means that 30-40% of the electricity is lost as heat. Better power supplies have efficiency rating of over 80%.
Leveraging the Scheduler As mentioned most vendors are taking power efficiencies very seriously. For instance both Intel and AMD provide dynamic frequency control on many of their processors using Intel SpeedStep or AMD Cool’n Quiet features. Using these features it is possible to dynamically change the clock frequency for a processor by modifying the /sys/devices/system/cpu/cpu_/cpufreq/ filesystem on a given node. An idle processor could be throttled down when not in use (check out the command cpufreq-selector). In addition, if you use hard drives on your nodes, modern hard drives provide green modes. Using hdparm -S it is possible to set an in activity timer so that the hard drive will enter a power-down mode. You may need to adjust the interval in which the kernel writes the disk cache to the hard drive (set in /proc/sys/vm/dirty_writeback_centisecs). The syslog daemon may need to be modified as well as it writes directly to disk. This can be set to use cache in /etc/syslog.conf by appending a “-” sign to each entry.
The most efficient way to control the above power saving features is through the scheduler. Basically, the job scheduler knows the state of each node. It would not be too difficult to write pre and post job scripts that would place the CPU and hard drive in a low power when there is no job assigned to the nodes and in a “performance” state when a job is running. Changing these states incurs no real overhead and it is a relatively trivial hack to the scheduler.
A further reduction in power can be had by fully powering off the nodes that are not in use. Both Sun Grid Engine, and SLURM have some plans to handle provide this capability. Moab and AFS also have similar features. This capability could be particularly useful in diskless nodes where rapid booting and node provisioning is done over the network. Typically a node can be operational in less than 60 seconds after power is applied.
Another approach is to use the scheduler to control physical job placement. The idea is to place/move jobs to the cooler areas of the machine room. A paper written by HP describing some results of this method.
Greening the Rack
Better rack-mounting is also an opportunity for greening-up you cluster. The typical 1U server has at least 10-12 fans and a power supply. Blade based systems can make some design changes and share fans and power supplies between nodes. If you think about it, a typical rack filled with 1U servers has forty-two 1.7x18x26 inch channels (wind tunnels), each of which must sustain adequate air flow to cool the server. By consolidating certain fans and power supplies large amount of power can be saved. A good example of a green rack is the new iDataPlex from IBM. In addition, there are smaller sub-rack systems (blade systems that fit into a standard rack mount chassis) that provide a green advantage. Examples include the Supermicro Office Blade, the IBM BladeCenter S-Series, and the Dell M-series.
Another technique pioneered by Rackable Systems is the DC based server. In this design each server and storage system is equipped with a high efficiency -48VDC power supply instead of a standard AC power supply. Each rack cabinet efficiently converts standard AC to DC for use over the entire rack chassis. The lack of AC to DC conversion within the node means a high power efficiency (92%) and less heat to remove from the node.
The Intel Experiment
Every so often someone decides to test a long standing assumption. These assumptions were valid at some point in time, but changing economics or system designs sometimes suggest revisiting established ideas. Recently Intel decided to test a basic idea. That is, your data center needs chilled air to function. Of course, this seems like a valid assumption.
To test this idea Intel’s investigated the use of an air economizer to optimize power and cooling in their data centers. Over a ten month period, a test data center was cooled with a 100% air exchange despite an extreme range in temperature, humidity, and air quality. And, most importantly they reported no huge increase component failures.
A little detail will help. Two data centers were set up in a low humidity environment (Arizona). Each had 448 production blade servers. The control data center used traditional chilled air (air conditioning approach). The second or test data center used outside air. The operational range was set to use air from 65-90F. If the intake air temperature rose above 90F, chilled air was used to keep it at 90F. If the input air was less than 65F, warm exhaust air was recycled to keep the temperature above 65F. Standard household filters were used, but and no adjustment was made for humidity. The failure rate was 4.46% in economizer data center and 2.45% in the control side. It is important to note that the systems were maintained under the recommended operating temperature of 98F as it was not a test or server temperature limits. From the data, the average temperature of the servers on the test side varied between 70-80F, while the control side was set at 68F.
The result for this scenario was an estimated data center power savings of 67% and a potential savings of $2.87 million for a 10-MW data center. A
Research Brief contains more details. And, it seems this assumption was worth testing. You may hear more about this idea in the future.
The Top500 list is where the worlds fastest computers are ranked (that is fastest at running one particular benchmark). Long time followers of the list often inquire about the power requirements for these systems. Indeed, many would like to see a performance per Watt kind of metric so the infrastructure cost can factored in to price-to-performance of the system.
Fortunately, for these users there is now the Green500 list that ranks computers in terms of MFLOPS/Watt. Recently the third edition of the list was posted. Interestingly, the words fastest Top500 computer was the IBM Cell-based Roadrunner from Los Alamos National Laboratory. Roadrunner ranked as #3 on the Green500 list confirming that green and fast are not mutually exclusive. In addition, the top three supercomputers surpassed the 400 MFLOPS/watt milestone for the first time. There was also good news on the commodity front as the energy efficiency of a commodity system based on Intelâ€™s 45-nm low-power quad-core Xeon is now on par with IBM BlueGene/L machines.
Less Heat, Less Failures
The Intel experiment not withstanding, the hotter things are the greater the failure rate. Note that in the Intel experiment, the average temperature was not that much higher than control side. A good rule of thumb is that for every ten degree (Celsius) increase in temperature, the failure rate will double. (The Intel experiment seems to support this rule.) As clusters grow the statistics of failure becomes more and more important. Clearly failure rates for small clusters, under 5%, may be acceptable, but with large clusters, this can result in a large number of servers needing replacement. The rule is keep it cool, and it lives longer. Therefore, in addition to saving money, decreasing your price-to-performance ratio, your systems may even last a little longer.
Over the last several years, HPC has become much more efficient, sorry Green. There should be more progress in the coming years as older systems are retired and newer greener hardware and software is put into place. And, when you do the price-to-performance math, greener means what we all wanted in the first place: faster, better, cheaper.
Douglas Eadline is the Senior HPC Editor for Linux Magazine.