dcsimg

HPC Turn-Offs: Power Control

Managing power can save money especially if it is from your wallet

Every time I walk down the rows of servers in a data center, I wonder, what percentage are doing something and what percentage are idle. For efficiency sake, turning off unused servers seems like a good idea because even when doing nothing, they use quite a bit of power (Google “leakage current” to understand why). It is also not too difficult to configure some schedulers to shutdown nodes when not in use (and restart them when needed). Ideally if a cluster is 75% utilized, then it should be easy to reduce power costs and save money. Of course, if your cluster is almost fully utilized all the time, then there is no money to be saved.

When one looks at how some clusters are purchased, you can understand why more attention is not given to this area. In the academic world, clusters are usually purchased with grant money that includes a facilities overhead percentage for the parent institution. That is, the power budget and cluster budget are decoupled and thus the cluster user/owner has no incentive to reduce power because they have already paid for the electricity and cooling. Indeed, some would argue that researches who dislike the large overhead slices taken by their respective institutions may actually want to use as much power as they can.

In addition to a fiscal disincentive, there are also some hardware concerns. One issue is that constant power cycling introduces thermal cycling. This may be a problem for some types of hardware, but keeping something at a constant higher temperature also increases the failure rate (every ten degree C increase in temperature doubles the failure rate). Although thermal cycling is something to keep in mind, most products/components are designed to undergo average thermal cycling with out any problem, i.e. you will not ruin your server if you turn it off every night. And, if it does ruin your server, then chances are the server was designed poorly.

Perhaps the bigger concern for many system administrators comes from experience. It is well known that turning a bunch of things off and then back on often results in a few systems not responding. The two most common failures are power supplies and hard drives. Speaking from personal experience, I have had more than one power supply stop working just because it was unplugged and plugged back in. This experience is probably why most system administrators don’t like dynamic power strategies for their data center.

The cost of power and the power required for nodes is eventually going to force some changes. I have heard of surcharges for research clusters in data centers to cover the extra power required. There is even opportunity for better power and thermal management through intelligent job placement (pdf) in the data center as well. In addition, the use of dynamic provisioning (booting nodes to the exact OS environment you need at run time) almost implies that un-provisioned nodes be kept off when not in use (and then provisioned when booted). All the major schedulers, e.g. SGE, Moab, LSF, and SLURM, have some capability to control the power to nodes in a cluster. The writing, as they say, is on the wall.

There is one other truth about the power issue — If you are paying for the power, you care how it is managed. As stated, many educational institutions decouple the power budgets from the users, but what about other situations. I have had an ongoing interest in building personal clusters. (e.g. a cluster where you own the reset switch). For my needs, a personal workstation/cluster must be quiet and not suck up power when I’m not doing anything. Like any good cluster geek, I eschew pressing power buttons and would much rather have things work automatically. Just recently I managed to achieve this goal with SLURM and some USB controlled relays. The current version of SLURM has a power saving mode that will turn-on and turn-off nodes when not needed (in my case SLURM turns-off individual motherboards).

My personal cluster has one main node and three worker nodes each with a single processor socket. Using the power saving mode in SLURM, the three worker motherboards are powered down when not in use. I am doing disk-less provisioning using Perceus with Caos Linux to make things easy. If the nodes are powered-on when a job is submitted to SLURM, the job runs normally. If the nodes are powered-off then SLURM will start them and run the job. When the jobs complete and after a user defined delay, SLURM will turn the nodes off.

Here is an example. I have small script that prints the message hello from node X where X is the node name. If I submit the script to three nodes using the SLURM srun command when the nodes are powered-on, I get an instant following result:

$time srun -N3 test.sh
hello from node n2
hello from node n0
hello from node n1

real	0m0.070s
user	0m0.004s
sys	0m0.008s

After a time SLURM will turn off the nodes. At this point, if I run same command
I get the following:

$time srun -N3 test.sh
srun: Job step creation temporarily disabled, retrying
srun: Job step created
hello from node n0
hello from node n2
hello from node n1

real	1m44.770s
user	0m0.004s
sys	0m0.012s

The important number is the real or wall clock time. Basically the powered-on version takes much less than a second. The powered-down version takes 105 seconds. During that time the nodes must boot, download and start the OS. The slurm daemons on the nodes must also synchronize with the main node. Once all this happens, the programs are run.

Clearly, for this scheme to make sense, I need to have long running jobs and/or a queue of work. I can foresee several ways this could be useful for the personal HPC user like myself. First, if there are a large number of jobs and I want to use the nodes (including head node), I could queue them up after I finished for the day. The jobs would complete and the extra hardware would be turned off when finished. Another scenario is if I had to run a large simulation, but still wanted to use the head node of my personal workstation/cluster to do other work. I could easily submit a job and go about my business letting the system manage its own resources. Again, I could leave for the day and not have to think about when my job finished or my electric bill.

Of course, running a single short job makes no sense. Fortunately, in HPC there are no short single jobs. Come to think of it, 105 seconds to start my cluster jobs from a dead stop is just enough time to grab a cup of coffee and read a few emails. Nothing like seamless multi-tasking.

Fatal error: Call to undefined function aa_author_bios() in /opt/apache/dms/b2b/linux-mag.com/site/www/htdocs/wp-content/themes/linuxmag/single.php on line 62