dcsimg

Exercising Multi-core

An opportunity to run some simple yet telling tests on a 12-core Intel Gulftown server presents itself

Next week, I’ll be revealing some cool new hardware I am reviewing. Right now, I’m in the middle of running some tests and I am impressed with the amount of computing power I have in my basement. One component of the hardware I’m reviewing is a dual Intel Xeon Gulftown server (6 cores each, 12 cores total). Thus, I have a chance to see how well the latest Intel multi-core memory bandwidth is holding up.

I have always been puzzled as to why people focus on a single thread or core when evaluating performance on a multi-core processor. I mean the “multi” part means more than one core and to me that means running more than one program on a single machine. Seems simple enough, but I have rarely seen these types of numbers for multi-core systems.

I created a short test script that is described in a previous column. The idea is simple. If a single program/thread runs on a single core in X number of seconds, then Y copies should run in the same amount of time, provided Y is equal to the number of cores and there is perfect memory sharing (i.e. there is no memory contention). If it takes the collection of copies longer to run (than the a single copy), then the number of “effective cores” is reduced. I created a series of scripts that will work on 2,4,8,12, and 16 cores. (Note: If I spent more time I suppose I could make a single script that would use a command line argument, but I’m both lazy and short of time, plus I don’t run these scripts all that often.)

To make the test interesting, I use the NAS Parallel Benchmark Suite compiled for a single processor or core (i.e. it is not parallel). The NAS suite is a set of eight kernels that represent different aerodynamic application types. Each kernel is self checking. reports the run time, and offers a different memory access pattern. The script can be easily modified for other programs. If you want to use the NAS suite, you may find it helpful to download the Beowulf Performance Suite which has the run_suite script that automates running the NAS suite. An example of the four core script is given below.


#!/bin/bash

PROGS=”cg.A.1 bt.A.1 ep.A.1 ft.A.1 lu.A.1 is.A.1 sp.A.1 mg.A.1″
NPBPATH=”../npb/”
echo “4 Way SMP Memory Test” |tee “smp-mem-test-4.out”
echo “`date`” |tee -a “smp-mem-test-4.out”
# if needed, generate single cpu codes change -c for different compiler
# just check for last program
if [ ! -e "$NPBPATH/bin/mg.A.1" ];
then
pushd $NPBPATH
./run_suite -n 1 -t A -m dummy -c gnu4 -o
popd
fi

for TEST in $PROGS
do
$NPBPATH/bin/$TEST>& temp.mem0
$NPBPATH/bin/$TEST>& temp.mem1 &
$NPBPATH/bin/$TEST>& temp.mem2 &
$NPBPATH/bin/$TEST>& temp.mem3 &
$NPBPATH/bin/$TEST>& temp.mem4
wait
S=`grep Time temp.mem0 |gawk ‘{print $5}’`
C1=`grep Time temp.mem1 |gawk ‘{print $5}’`
C2=`grep Time temp.mem2 |gawk ‘{print $5}’`
C3=`grep Time temp.mem3 |gawk ‘{print $5}’`
C4=`grep Time temp.mem4 |gawk ‘{print $5}’`
SPEEDUP=`echo “3 k $S $C1 / $S $C2 / $S $C3 / $S $C4 / + + + p” | dc`
echo “4 Way SMP Program Speed-up for $TEST is $SPEEDUP” |\
tee -a “smp-mem-test-4.out”
done
/bin/rm temp.mem*
echo “`date`” |tee -a “smp-mem-test-4.out”

One should note that I don’t really care about individual program performance at this point. The series of tests measure how well a server scales as more programs are run. Of course, this test could be considered a “worst” case scenario because multiple copies of the same program (or memory access pattern) are presented at the same time. Perhaps, but in the case of an MPI program run on multi-core this is exactly what happens. The results are reported in Table One below. Note that even though there are 12 cores, I ran up to 16 copies of each program.

Test 2 copies 4 copies 8 copies 12 copies 16 copies
cg 2.0 3.4 5.7 6.6 7.7
bt 2.0 3.2 4.6 4.8 4.9
ep 2.0 3.9 7.8 11.8 12.7
ft 2.0 3.8 7.1 8.9 11.0
is 2.0 3.9 6.5 6.1 6.7
lu 2.0 4.0 7.8 11.2 14.8
sp 2.0 3.7 5.1 5.4 5.7
mg 2.0 3.8 6.4 6.6 9.1
Ave 2.0 3.7 6.4 7.7 9.1

Table One: Effective Cores for a 12-way Intel Xeon (Gulftown) SMP server running the NAS suite

One way to interpret the results is to assume “effective cores” or how many cores actually get utilized on the processor. The underutilization is due to memory contention. In the table above, results are pretty good across the board up to 8 copies. Running 12 copies we start to see the performance level off for some programs while others keep improving. At 16 copies, most programs see some improvement, but this is probably the limit of effective cores for this system.

My interest in these numbers started back in the dual processor days (not dual core). I wondered how well two separate processors (2 sockets) shared memory. The results were interesting and indicated that these simple tests were worthwhile. Indeed, at one point I tested a dual socket quad core Intel Harpertown (8 cores total) and was not impressed. By the way, the Clovertown was less impressive. The results are in Table Two below.

Test 8 copies
cg 2.5
bt 2.1
ep 8.0
ft 4.8
is 3.2
lu 4.9
sp 2.2
mg 2.2
Ave 3.7

Table Two: Effective Cores for a 8-way Intel Xeon (Harpertown) SMP server running the NAS suite

As you can see things have improved quite a bit. The take away from this data depends on your needs. If you are interested in running MPI codes on multi-core systems, I would start with these benchmarks and move on to other MPI tests. (Yes, run MPI codes on multi-core). You should also try running multi-node MPI benchmarks as well to evaluate the interconnect. My concern is that as more cores are added to each new generation of processor, some applications may be not be “memory friendly” and thus show reduced utilization. If I have time, I’ll run more tests, but there is something else in this box that is even more interesting. I’ll have more to say about that next week.

Fatal error: Call to undefined function aa_author_bios() in /opt/apache/dms/b2b/linux-mag.com/site/www/htdocs/wp-content/themes/linuxmag/single.php on line 62