Exercising Multi-core

An opportunity to run some simple yet telling tests on a 12-core Intel Gulftown server presents itself

Next week, I’ll be revealing some cool new hardware I am reviewing. Right now, I’m in the middle of running some tests and I am impressed with the amount of computing power I have in my basement. One component of the hardware I’m reviewing is a dual Intel Xeon Gulftown server (6 cores each, 12 cores total). Thus, I have a chance to see how well the latest Intel multi-core memory bandwidth is holding up.

I have always been puzzled as to why people focus on a single thread or core when evaluating performance on a multi-core processor. I mean the “multi” part means more than one core and to me that means running more than one program on a single machine. Seems simple enough, but I have rarely seen these types of numbers for multi-core systems.

I created a short test script that is described in a previous column. The idea is simple. If a single program/thread runs on a single core in X number of seconds, then Y copies should run in the same amount of time, provided Y is equal to the number of cores and there is perfect memory sharing (i.e. there is no memory contention). If it takes the collection of copies longer to run (than the a single copy), then the number of “effective cores” is reduced. I created a series of scripts that will work on 2,4,8,12, and 16 cores. (Note: If I spent more time I suppose I could make a single script that would use a command line argument, but I’m both lazy and short of time, plus I don’t run these scripts all that often.)

To make the test interesting, I use the NAS Parallel Benchmark Suite compiled for a single processor or core (i.e. it is not parallel). The NAS suite is a set of eight kernels that represent different aerodynamic application types. Each kernel is self checking. reports the run time, and offers a different memory access pattern. The script can be easily modified for other programs. If you want to use the NAS suite, you may find it helpful to download the Beowulf Performance Suite which has the run_suite script that automates running the NAS suite. An example of the four core script is given below.


PROGS=”cg.A.1 bt.A.1 ep.A.1 ft.A.1 lu.A.1 is.A.1 sp.A.1 mg.A.1″
echo “4 Way SMP Memory Test” |tee “smp-mem-test-4.out”
echo “`date`” |tee -a “smp-mem-test-4.out”
# if needed, generate single cpu codes change -c for different compiler
# just check for last program
if [ ! -e "$NPBPATH/bin/mg.A.1" ];
pushd $NPBPATH
./run_suite -n 1 -t A -m dummy -c gnu4 -o

for TEST in $PROGS
$NPBPATH/bin/$TEST>& temp.mem0
$NPBPATH/bin/$TEST>& temp.mem1 &
$NPBPATH/bin/$TEST>& temp.mem2 &
$NPBPATH/bin/$TEST>& temp.mem3 &
$NPBPATH/bin/$TEST>& temp.mem4
S=`grep Time temp.mem0 |gawk ‘{print $5}’`
C1=`grep Time temp.mem1 |gawk ‘{print $5}’`
C2=`grep Time temp.mem2 |gawk ‘{print $5}’`
C3=`grep Time temp.mem3 |gawk ‘{print $5}’`
C4=`grep Time temp.mem4 |gawk ‘{print $5}’`
SPEEDUP=`echo “3 k $S $C1 / $S $C2 / $S $C3 / $S $C4 / + + + p” | dc`
echo “4 Way SMP Program Speed-up for $TEST is $SPEEDUP” |\
tee -a “smp-mem-test-4.out”
/bin/rm temp.mem*
echo “`date`” |tee -a “smp-mem-test-4.out”

One should note that I don’t really care about individual program performance at this point. The series of tests measure how well a server scales as more programs are run. Of course, this test could be considered a “worst” case scenario because multiple copies of the same program (or memory access pattern) are presented at the same time. Perhaps, but in the case of an MPI program run on multi-core this is exactly what happens. The results are reported in Table One below. Note that even though there are 12 cores, I ran up to 16 copies of each program.

Test 2 copies 4 copies 8 copies 12 copies 16 copies
cg 2.0 3.4 5.7 6.6 7.7
bt 2.0 3.2 4.6 4.8 4.9
ep 2.0 3.9 7.8 11.8 12.7
ft 2.0 3.8 7.1 8.9 11.0
is 2.0 3.9 6.5 6.1 6.7
lu 2.0 4.0 7.8 11.2 14.8
sp 2.0 3.7 5.1 5.4 5.7
mg 2.0 3.8 6.4 6.6 9.1
Ave 2.0 3.7 6.4 7.7 9.1

Table One: Effective Cores for a 12-way Intel Xeon (Gulftown) SMP server running the NAS suite

One way to interpret the results is to assume “effective cores” or how many cores actually get utilized on the processor. The underutilization is due to memory contention. In the table above, results are pretty good across the board up to 8 copies. Running 12 copies we start to see the performance level off for some programs while others keep improving. At 16 copies, most programs see some improvement, but this is probably the limit of effective cores for this system.

My interest in these numbers started back in the dual processor days (not dual core). I wondered how well two separate processors (2 sockets) shared memory. The results were interesting and indicated that these simple tests were worthwhile. Indeed, at one point I tested a dual socket quad core Intel Harpertown (8 cores total) and was not impressed. By the way, the Clovertown was less impressive. The results are in Table Two below.

Test 8 copies
cg 2.5
bt 2.1
ep 8.0
ft 4.8
is 3.2
lu 4.9
sp 2.2
mg 2.2
Ave 3.7

Table Two: Effective Cores for a 8-way Intel Xeon (Harpertown) SMP server running the NAS suite

As you can see things have improved quite a bit. The take away from this data depends on your needs. If you are interested in running MPI codes on multi-core systems, I would start with these benchmarks and move on to other MPI tests. (Yes, run MPI codes on multi-core). You should also try running multi-node MPI benchmarks as well to evaluate the interconnect. My concern is that as more cores are added to each new generation of processor, some applications may be not be “memory friendly” and thus show reduced utilization. If I have time, I’ll run more tests, but there is something else in this box that is even more interesting. I’ll have more to say about that next week.

Comments on "Exercising Multi-core"

Thanks a lot for the blog post.Much thanks again. Much obliged.

I got what you mean , regards for putting up.Woh I am lucky to find this website through google. “The test and use of a man’s education is that he finds pleasure in the exercise of his mind.” by Carl Barzun.

I was curious if you ever thought of changing the page layout of

MeJsj1 Major thankies for the blog article.Much thanks again. Much obliged.

Looking forward to reading more. Great article. Will read on…

Very good article.Thanks Again. Want more.

Thank you for your blog.Really looking forward to read more. Great.

“Howdy! I know this is kinda off topic but I was wondering if you knew where I could locate a captcha plugin for my comment form? I’m using the same blog platform as yours and I’m having trouble finding one? Thanks a lot!”

Magnificent goods from you, man. I’ve remember your stuff previous to and you’re simply extremely great.
I actually like what you’ve acquired here, certainly like what you are stating and the
best way in which you assert it. You make it entertaining and you continue to care for to stay
it smart. I can not wait to learn much more from you.
This is really a terrific web site.

vZH1xB ytsdqcypmfyo, [url=http://wmqwecerfivv.com/]wmqwecerfivv[/url], [link=http://ajbecnlosvbm.com/]ajbecnlosvbm[/link], http://ajxnhcqeevqd.com/

Leave a Reply