The Quad-Cores Have Landed: Putting Them to Work in HPC

As the new Intel Xeons hit the street, our resident HPC expert spends two weeks with sixteen cores.

In my September “rant,” I presented a script for testing dual-core processors. At the end of the article I quipped, “Of course by the time I get this coded, quad-core processors will be ready for testing.” Of course, I knew that the dual-cores were just hitting the market and quad-cores were due some time in 2007.

I am told the universe works in mysterious ways, and by some strange fortune, two dual-socket, quad-core Xeon nodes and some Infiniband HBA’s showed up at my doorstep in the beginning of September. (Keep in mind I write my columns about 2 months before the publication date of each issue). I would have two weeks to test and experiment with sixteen cores!

Not that I’m the bragging type, but I told the suitably-impressed delivery person, “You know I’m the first person in the neighborhood to get my hands on these new Intel quad-core Xeons.” Later in the day, my wife was equally impressed as she asked, “What is that noise in the basement, sounds like lots of fans or something.” In my best proud father voice I replied, “What you hear is two 1U cluster nodes, each with two quad-core Xeons, and 8 GB of memory grinding away on some benchmarks. These babies are brand new, I’m one of the select few that has early access to this hardware. I can’t talk about the performance until after November. Hush-hush you know.” “Just close the basement door when you’re running those things, Mr. Hush-hush,” was all she said.

Now, Where Did I Put That Script?

If you recall back to September, I provided a simple script to test how well a single dual-core processor can run two copies of the same program. The program gave a number between one and two that indicated the amount of speed-up for a given program. If running two copies of the program takes twice as long as a single copy, then the speed up is one. If it runs two copies of a program in the same amount of time, then the speed up is two. One way to look at the result is the number yields how many effective processors we have.

Looking at my script, I realized it needed some improvement, because instead of two cores per node, I now had eight. I actually created three scripts: one for two, four, and eight cores. I suppose if I were a bit more clever, I could meld these into one script. However, I thought my time was better spent exercising the cores than polishing my scripts. (You can download the scripts here.) For the sake of discussion, the eight core version (smp-mem-test8.sh) is shown in Listing One.

Listing One: A test script to benchmark the quad-core Clovertown

#!/bin/bash
# You can easily edit this for 2 or 4 cores
#Add your own list of programs here, we assume they are in bin
PROGS=”cg.A.1 bt.A.1 ep.A.1 ft.A.1 lu.A.1 is.A.1 sp.A.1 mg.A.1″
echo “8 Way SMP Memory Test” |tee “smp-mem-test-8.out”
echo “`date`” |tee -a “smp-mem-test-8.out”

for TEST in $PROGS
do
bin/$TEST>& temp.mem0
bin/$TEST>& temp.mem1 &
bin/$TEST>& temp.mem2 &
bin/$TEST>& temp.mem3 &
bin/$TEST>& temp.mem4 &
bin/$TEST>& temp.mem5 &
bin/$TEST>& temp.mem6 &
bin/$TEST>& temp.mem7 &
bin/$TEST>& temp.mem8
wait

S=`grep Time temp.mem0 |gawk ’{print $5}’`
C1=`grep Time temp.mem1 |gawk ’{print $5}’`
C2=`grep Time temp.mem2 |gawk ’{print $5}’`
C3=`grep Time temp.mem3 |gawk ’{print $5}’`
C4=`grep Time temp.mem4 |gawk ’{print $5}’`
C5=`grep Time temp.mem5 |gawk ’{print $5}’`
C6=`grep Time temp.mem6 |gawk ’{print $5}’`
C7=`grep Time temp.mem7 |gawk ’{print $5}’`
C8=`grep Time temp.mem8 |gawk ’{print $5}’`
SPEEDUP=`echo “3 k $S $C1 / $S $C2 / $S $C3 / \
$S $C4 / $S $C5 / $S $C6 / $S $C7 / $S $C8 / \
+ + + + + + + p” | dc`
echo “8 Way SMP Program Speed-up for $TEST is $SPEEDUP” |\
tee -a “smp-mem-test-8.out”
done

/bin/rm temp.mem*
echo “`date`” | tee -a “smp-mem-test-8.out”

The script is quite simple. It runs one copy of the program and records the sequential time (S in the script). Next, it runs eigt copies of a program at the same time. It then greps the wall time for all the runs (the programs record the run time).

For each program, I define “speed-down” as sequential time divided by the parallel time (C1-C8 in the script). The individual “speed-downs” are added together. A quick analysis deduces that if all eight copies of the program run in the same amount of time as the sequential program, you have a perfect speed-up of 8. If on the other hand, the eight programs take eight times as long (like they would on a single core), then the “speed down” is less than one. Adding together the “speed downs” gives a number very close to one. Therefore, in the eight-core case, results will be in the range from 1 to 8, which indicates the number of “effective cores.” If the script is run on a single core processor, then the results should always be one.

The script is much more fun when you use different types of programs. As I like to beat on systems using the NAS parallel suite (http://www.nas.nasa.gov/Resources/Software/npb.html), I chose to compile the eight test programs as single process versions (no MPI).

And as mentioned, the scripts measure the execution time as well.

About The Hardware

Before looking at the results, a word about the hardware is in order. First, since this is not “officially released hardware” at the time of this writing, I want to go on record as saying these results are preliminary and should be considered solely as guidelines. (The standard cluster proviso applies: Your mileage may vary, test your codes, etc.)

The hardware was provide by Appro (http://www.appro.com). Each node used an Intel S5000PAL motherboard with dual CPU sockets. The motherboard features the Intel 5000P chipset, with dual, independent system bus connections running at speeds up to 1333 MHz. Each motherboard can support two Clovertown processors, thus providing eight cores in a single 1U enclosure. Other features include support for 32 GB ECC Fully Buffered DDR2 (8 DIMMs) at 533 or 667 MHz, dual integrated Gigabit LAN controllers, ATI 16 MB video, four PCIe slots and six SATA drives with SAS and RAID options. Two 10 Gbps dual-port Mellanox InfiniHost III Ex MHEA28-XTC InfiniBand Host Channel Adapter (HCA) cards were also included to connect the nodes through a PCIe interface.

The very first thing I did was plug one node into a Kill-A-Watt (http://www.p3international.com) meter and measure the amount of electricity used by a single node. As I installed and ran software, the node never went much over 400 watts, and seemed to settle around 375 watts when it was idle. So far so good. Eight cores and the rest of the house was still powered up.

One issue I’ve read about is the design of the Clovertown. It is often stated that the Clovertown is not a true quad-core, as it’s really two Woodcrests “glued” together. Each two-core side shares a cache, but communication between sides occurs through the Front Side Bus (FSB). While those that split hardware hairs may take issue with the design, I find it a rather interesting first step for quad-core HPC systems.

Until recently, the standard cluster has been dual-processor nodes, each of single core variety. Communication between processors was over the FSB (or in the case of AMD over Hyperchannel). This arrangement has worked well for HPC, providing a good trade-off between cost and performance. The new dual-cores from AMD and Intel have changed this so that a “typical” cluster node now has four cores (two processors). So, you can think of a Clovertown processor as a “shrink” of a “typical” Woodcrest cluster node.

In a way, the hardware manufacturers keep folding systems into themselves. Cluster jocks usually have two questions when this happens. First, what effect does this have on memory contention? That is, with the Clovertown, we now have eight cores feeding from the same memory trough and there may be some issues. Second, we also have eight cores running eight MPI processes, all using the same interconnect. Again, contention may be an issue.

The script in Listing One is an attempt to get a feel for the memory contention issue. Of course, real performance depends on what you want to do with your hardware, but some quick tests might help provide some guidance.

The interconnect issue has to wait for a bit. It’s harder to test and it would be better to have more nodes.

Here is the Exclusive!

As I mentioned in my previous column, my memory contention script helps answer an obvious question: How well does the multi-core run multiple programs? In HPC, the questions is a little more specific because a typical MPI program runs multiple copies of the same program.

The results of my test are presented in Table One. Eight programs were used. The table lists the speed-up data for two, four, and eight simultaneous copies of each program. Again, one way to look at the speed-up number is how many effective cores did the programs “see.”

TABLE ONE: Average speed-up data for two, four, and eight copies of the same program. Each test was run five times.
test 2 Copies Std Dev 4 Copies Std Dev 8 Copies Std Dev
cg 1.7 0.1 2.3 0.1 2.3 0.0
bt 1.5 0.2 2.4 0.0 3.5 0.0
ep 2.0 0.7 3.3 0.2 8.0 0.0
ft 1.7 0.2 3.1 0.1 7.1 0.8
lu 1.7 0.2 3.5 0.8 4.4 0.0
is 1.7 0.2 3.2 0.1 4.6 0.5
sp 1.5 0.2 2.3 0.3 2.8 0.0
mg 1.7 0.2 3.1 0.8 3.1 0.7

As you can see, the effective cores for two copies is quite good (based on past tests with dual core and dual processor nodes). When the test moves up to four copies, the results tend to spread a bit. Most programs see at least three effective cores, which in the grand scheme of things is still pretty good. Moving on to eight yields an even larger range of performance. Indeed, some codes only see three effective cores. Others, see seven or above.

In the case of poor performance, I assume memory contention is the issue. There may be ways to mitigate some, but this is beyond the scope of this month’s column. In the case of those that see seven or more effective cores, I can only say, “ Holy cow! “ If I run codes like these, I want be first in line for the Clovertown.

About That Other Column

You may have noticed the extra columns in Table One. These are the standard deviation for the speed-up numbers. When I was running the tests, I noticed some variation in the results (yes, campers, always run benchmarks several times). The variation seemed more than the normal random variation I see in these kinds of tests. I therefore ran the complete set of tests five times, averaged the result, and computed the standard deviation.

After looking at these results, I have no explanation as to why some tests have such a large range (in some cases almost plus or minus a full effective core) and others show almost no variation. Some aspects of the hardware are pre-production, so perhaps the final release will show less deviation and better scaling. The good news is however, some codes should be able to use the Clovertown right out of the box.

More Cores at the Door

I had to send the servers off to a trade show for a few weeks, but as I write this, the systems just returned for more testing. In addition, I received a pile of Woodcrest processors to test as well. Both the Woodcrests and Clovertowns work in the same motherboard.

I plan to start MPI testing any day now. Of course, by the time I get all this done, octa-core processors will be all the rage.

Doug Eadline has finally enlisted enough monkeys to help him randomly type a book on clusters. A preview is available at http://www.clustermonkey.net/content/view/128/53/.

Fatal error: Call to undefined function aa_author_bios() in /opt/apache/dms/b2b/linux-mag.com/site/www/htdocs/wp-content/themes/linuxmag/single.php on line 62