Green HPC: The New Secret To Going Fast

HPC is just beginning to warm up to the idea of green computing. But can green give you a faster, better and cheaper high-performance cluster?

Green is this year’s buzzword. Print your brochures with green ink and add the word green in your product description and presto you “got your green on.” However, with HPC, it’s just not that easy. Indeed, in a market where performance rules, everything else seems to take a backseat. Everything, that is, until the electric bill for the data center arrives. Or, when the ability of the data center to get more space, power, or cooling is not possible.

Green HPC is really optimized HPC. Unless you have an unlimited budget, price-to-performance is the metric often used to evaluate various hardware solutions. While performance can be easily measured, the price component is often a bit more vague. Often, the price component includes the basic hardware procurement cost and totally ignores the operational expenses. In today’s hot and hungry server market, this analysis is short sighted at best. An example can help illustrate the problem.

The average 1U dual-socket cluster node currently requires around 300 watts of power. Cooling and power delivery inefficiencies can double this node-power requirement to 600 watts. Therefore, on an annual basis, a single cluster node can require 5256 kilowatt hours. At a nominal cost of $.10 per kilowatt hour, the annual power and cooling costs for a single cluster node is approximately $526.

These numbers are more striking when the cost of the entire cluster is taken into account. Consider a typical cluster purchase in today’s market where the typical node can cost $3500 (including racks, switches, etc). Using standard quad-core technology, a node has two processors and eight total cores. Using our average node price, a typical 128-node cluster (with 256 processors and 1028 cores) costs about $448,000. Based on the above assumptions, the annual power and cooling budget is $67,300. Over a three-year period, this amounts to $201,900. or 45% of the system cost.

Although costs can vary due to market conditions and location, the above analysis illustrates that the three-year data center cost can easily reach 40-50% of the hardware purchase price for a typical commodity cluster.

A more correct measure of price-to-performance should therefore include infrastructure/operational overhead. Ignoring these costs give an inflated and somewhat wishful price-to-performance metric. Operational costs are often reflected in a Total Cost of Ownership (TCO) metric, but in HPC we prefer a performance rating in our metric.

Green Is Performance

Based on our above analysis, anything that decreases the power and cooling cost will automatically decrease the price-to-performance ratios (lower is good). This conclusion is rather striking because green computing is often associated with low power (i.e. low clock speed) systems. In HPC, green does not mean slow. It means efficient. For the remainder of this article we’ll take a look at various ways efficiencies can be improved for HPC systems and some new approaches to green HPC.

Greening The Server

Because many of today’s commodity rack-mount servers use similar processors, memory, and hard drives, we won’t devote too much discussion to these components. In other words, we use what the market offers and the vendors are doing their part to make these components more efficient as well. There are same new power saving features available with new processor and clusters present a unique environment which is best managed by the job scheduler (see below).

Power Supplies If you are using standard rack mount servers, an area you can control is the power supply. Make sure you use Power Factor Correction (PFC) power supplies. A power supply with a PFC over 0.8 is using power efficiently. Unfortunately, an uncorrected power supply may have a power factor of 0.70-0.75 and waste energy. A good PFC power supply will have a power factor of 0.95-0.99. In addition to PFC, the efficiency of power supplies should be noted. In the past, a typical power supply may have been 60-70% efficient. That means that 30-40% of the electricity is lost as heat. Better power supplies have efficiency rating of over 80%.

Leveraging the Scheduler As mentioned most vendors are taking power efficiencies very seriously. For instance both Intel and AMD provide dynamic frequency control on many of their processors using Intel SpeedStep or AMD Cool’n Quiet features. Using these features it is possible to dynamically change the clock frequency for a processor by modifying the /sys/devices/system/cpu/cpu_/cpufreq/ filesystem on a given node. An idle processor could be throttled down when not in use (check out the command cpufreq-selector). In addition, if you use hard drives on your nodes, modern hard drives provide green modes. Using hdparm -S it is possible to set an in activity timer so that the hard drive will enter a power-down mode. You may need to adjust the interval in which the kernel writes the disk cache to the hard drive (set in /proc/sys/vm/dirty_writeback_centisecs). The syslog daemon may need to be modified as well as it writes directly to disk. This can be set to use cache in /etc/syslog.conf by appending a “-” sign to each entry.

The most efficient way to control the above power saving features is through the scheduler. Basically, the job scheduler knows the state of each node. It would not be too difficult to write pre and post job scripts that would place the CPU and hard drive in a low power when there is no job assigned to the nodes and in a “performance” state when a job is running. Changing these states incurs no real overhead and it is a relatively trivial hack to the scheduler.

A further reduction in power can be had by fully powering off the nodes that are not in use. Both Sun Grid Engine, and SLURM have some plans to handle provide this capability. Moab and AFS also have similar features. This capability could be particularly useful in diskless nodes where rapid booting and node provisioning is done over the network. Typically a node can be operational in less than 60 seconds after power is applied.

Another approach is to use the scheduler to control physical job placement. The idea is to place/move jobs to the cooler areas of the machine room. A paper written by HP describing some results of this method.

Greening the Rack

Better rack-mounting is also an opportunity for greening-up you cluster. The typical 1U server has at least 10-12 fans and a power supply. Blade based systems can make some design changes and share fans and power supplies between nodes. If you think about it, a typical rack filled with 1U servers has forty-two 1.7x18x26 inch channels (wind tunnels), each of which must sustain adequate air flow to cool the server. By consolidating certain fans and power supplies large amount of power can be saved. A good example of a green rack is the new iDataPlex from IBM. In addition, there are smaller sub-rack systems (blade systems that fit into a standard rack mount chassis) that provide a green advantage. Examples include the Supermicro Office Blade, the IBM BladeCenter S-Series, and the Dell M-series.

Another technique pioneered by Rackable Systems is the DC based server. In this design each server and storage system is equipped with a high efficiency -48VDC power supply instead of a standard AC power supply. Each rack cabinet efficiently converts standard AC to DC for use over the entire rack chassis. The lack of AC to DC conversion within the node means a high power efficiency (92%) and less heat to remove from the node.

The Intel Experiment

Every so often someone decides to test a long standing assumption. These assumptions were valid at some point in time, but changing economics or system designs sometimes suggest revisiting established ideas. Recently Intel decided to test a basic idea. That is, your data center needs chilled air to function. Of course, this seems like a valid assumption.

To test this idea Intel’s investigated the use of an air economizer to optimize power and cooling in their data centers. Over a ten month period, a test data center was cooled with a 100% air exchange despite an extreme range in temperature, humidity, and air quality. And, most importantly they reported no huge increase component failures.

A little detail will help. Two data centers were set up in a low humidity environment (Arizona). Each had 448 production blade servers. The control data center used traditional chilled air (air conditioning approach). The second or test data center used outside air. The operational range was set to use air from 65-90F. If the intake air temperature rose above 90F, chilled air was used to keep it at 90F. If the input air was less than 65F, warm exhaust air was recycled to keep the temperature above 65F. Standard household filters were used, but and no adjustment was made for humidity. The failure rate was 4.46% in economizer data center and 2.45% in the control side. It is important to note that the systems were maintained under the recommended operating temperature of 98F as it was not a test or server temperature limits. From the data, the average temperature of the servers on the test side varied between 70-80F, while the control side was set at 68F.

The result for this scenario was an estimated data center power savings of 67% and a potential savings of $2.87 million for a 10-MW data center. A