Jonathan vs. The Roman Beauty

When comparing CPUs, you have to compare apples to apples. Doug Eadline compares the performance of AMD Opteron and Intel multicore processors to see which CPUs provide the best performance per core.

In the past, I’ve put Intel quad-cores through some very simple tests. Namely, I spent quite a bit of time trying to understand the results of my “effective processor” script. Now it’s AMD’s turn.

I’ve been testing an 8-core (four-socket) Opteron machine from Appro International (http://www.appro.com). The machine can be configured as a 3U cabinet or as a deskside tower. More important, however, I now have an AMD box with 8 cores (four dual-core, 2.8 GHz Opteron processors) to compare to an Intel 8-core box (two quad-core 2.67 GHz Clovertowns) — a very apples-to-apples comparison.

However, exact parity in clock speed is not essential for my comparison. I am interested in how well the system scales when running multiple copies of the same program, so the actual clock speed is normalized out of the results. For each system, the Linux kernel finds eight processors and that is at the core of all the tests (obligatory core pun when writing about apples).

Before showing the results, I’d like to present my conclusions: In the taxonomy of apples, there are two kinds that seem to represent the results. The first is like the Jonathan — deep red, mildly tart, rich in flavor, versatile, and excellent for snacking or baking. The other is more like the Roman Beauty, slightly tart, best for baking. Of course, I will let you try and figure out which is which.

The Contenders

I configured the two systems to be as similar as possible. Table One is a a comparison of the systems’ specifications. I am most interested in the number of cores. The difference in kernels will have little effect, as the programs are crunching numbers. The compilers could almost be considered identical. The compiler options in both cases were –O3 –ffast-math and the –march option was set to the respective processor family (nacona or opteron).

(As an aside, gcc/g77 3.X are part of both the RHEL4 and SLES9 distributions. All gcc versions greater than 4 now include gfortran instead of g77. (gfortran is Fortran 95 compliant.) The absence of gcc 4+ toolset has made RHEL4/SLES9 users a little envious, because they’re stuck with g77. Not to worry, you can build the 4.X series yourself or you can head over to AMD and grab a set of gcc/gfortran 4.1.2 RPMS (http://developer.amd.com/gcc.jsp) The new version of gcc/gfortran can coexist with the older compilers as well.)

TABLE ONE:

The Intel and AMD specifications

Option Intel Platform AMD Platform
Number of CPUs 2 4
Number of Cores 8 8
Clock Speed (GHz) 2.67 2.8
CPU Model 5355 8220
Memory Type FBDIMM DDR2
Amount (GB) 8 16
Motherboard S5000XAL Appro
OS Fedora 6 RHEL
Kernel 2.6.18.7 2.6.9-42
gcc/gfortran 4.1.1 4.1.2

In past columns, I presented a script that determined what I call “effective processors,” or how many processors your application actually sees when it is using all the cores. The script simply measures how long it takes to run one copy of your program, then how long it takes to run eight copies. If the times are the same, you have eight effective processors. If the second time is eight times longer than the single copy, you have one effective processor.

The actual programs are part of the NAS Parallel test suite. The suite contains eight programs that represent real application programs. The programs were run in in single process mode for the purposes of the test script — that is to say, each program was a single process running on one core. See http://www.linux-mag.com/id/2868/ for the actual scripts and NAS suite.

Round One

For reference, Table Two provides the previously reported results for the Intel platform. While running the tests, I found the results would vary quite a bit, so I ran them five times and reported the average and the standard deviation. The tests indicate the number of “effective cores” you can see achieve for running 4 and 8 copies of a given program As you can see, the variation in performance could almost amount to a whole core in some cases.

The Opteron platform results are given in Table Three. In comparison to the Intel platform, the number of effective cores is much better. In the four copy tests, the number of effective cores scales almost perfectly with number of jobs. When the number of copies is eight, there is some drop-off but nothing lower than six effective cores. The standard deviation is also much better, which means process placement is not as critical as on the Clovertown.

TABLE TWO:

Previously reported average speed-up data (effective processors) for 2, 4, and 8 copies of the same program on an eight-way Intel Clovertown system. Each test was run five times.

Test 2 Copies Std Dev 4 Copies Std Dev 8 Copies Std Dev
bt 1.5 0.2 2.4 0.0 3.5 0.0
cg 1.7 0.1 2.3 0.1 2.3 0.0
ep 2.0 0.7 3.3 0.2 8.0 0.0
ft 1.7 0.2 3.1 0.1 7.1 0.8
is 1.7 0.2 3.2 0.1 4.6 0.5
lu 1.7 0.2 3.5 0.8 4.4 0.0
mg 1.7 0.2 3.1 0.8 3.1 0.7
sp 1.5 0.2 2.3 0.3 2.8 0.0
TABLE THREE:

Average speed-up data (effective processors) for 2, 4, and 8 copies of the same program on an eight-way AMD Opteron system. Each test was run five times.

Test 2 Copies Std Dev 4 Copies Std Dev 8 Copies Std Dev
bt 1.5 0.0 3.8 0.0 6.1 0.0
cg 1.9 0.0 3.9 0.1 7.4 0.0
ep 2.0 0.0 4.0 0.0 8.0 0.0
ft 1.9 0.1 3.9 0.0 7.5 0.0
is 2.0 0.1 4.0 0.0 7.5 0.0
lu 1.8 0.2 3.9 0.1 6.5 0.1
mg 1.7 0.2 3.8 0.0 6.1 0.0
sp 1.7 0.4 3.7 0.0 5.3 0.0

Round Two

In addition to the simple “effective cores” script, I also ran the NAS suite in parallel to see what kind of real scalability the systems could provide. The NAS suite uses the Message Passing Interface (MPI). LAM/MPI was used for all the tests. Table Four shows the results of the Clovertown. Table Five shows the results of the Opteron.

Most MPI libraries work by creating a process for each parallel part of the program and then setting up the communication between processes. In the case of a cluster, this communication is often over Gigabit Ethernet, Myrinet, or Infiniband. On an SMP node (shared memory), LAM/MPI uses shared memory as a way to pass messages between processes.

Presumably, the parallel results should mirror the results in Tables Two and Three because the same programs are running — only now they are communicating via shared memory. If there is a difference, it should be due to communication over shared memory. (There is one exception: bt and sp will not run as an 8 way parallel program.)

If you look at the 8-core numbers for the Intel Clovertown you see that the results pretty much follow the effective processors. However, two results standout.

*The ft benchmark does well in the effective benchmark test, but does poorly in the parallel tests. If you recall, the Clovertown uses the same Front Side Bus (FSB) for all cores. In addition, the ft test produces a large amount of communication, so memory contention on the bus is probably slowing down the benchmark.

*The lu benchmark actually does better than the effective core results. Indeed, if you look at the 4-core results for lu, you will see what is called hyper-speedup and indicates the parallel results have exceeded linear scalability. The reason for this effect is largely due to cache. As the program and data are broken into smaller pieces when run in parallel, the smaller parts can use the cache more effectively. Hyper-speedup is often a pleasant surprise to the HPC practitioner.

Moving to Table Five, the data shows that the Opteron platform does quite well. Indeed, there are two remarkable cases of hyper-speedup for both cg and lu that are well over eight times faster. Again, this is largely due to cache effects. The excellent performance of the AMD platform is largely due to the use of hyper-transport to share memory between processors. Unlike a single FSB, hyper-transport allows more total memory bandwidth throughout the system and there is less contention for memory.

TABLE FOUR:

NAS Parallel Suite results for Clovertown platform

test Single Process MFLOPS 4-Core Speed-up 8-Core Speed-up
bt 725.9 1.8 not run
cg 335.9 2.2 2.2
ep 12.7 3.8 7.6
ft 1026.6 1.8 2.0
is 41.3 2.5 3.0
lu 821.6 5.1 6.5
mg 959.7 2.1 2.1
sp 563.0 2.0 not run
TABLE FIVE:

NAS Parallel Suite results for Opteron platform

test Single Process MFLOPS 4-Core Speed-up 8-Core Speed-up
cg 718.1 3.8 not run
bt 235.2 4.6 9.3
ep 10.8 4.0 7.9
ft 698.0 3.0 5.1
lu 44.3 4.2 6.1
is 887.2 3.8 9.9
sp 775.6 3.0 7.2
mg 581.9 3.8 not run

The Best Apple

Like all benchmark discussions, your application suite is really what matters. In these tests I tried to measure the basic behavior of two 8-core systems in an apples-to-apples comparison.

The Intel platform did well in some tests, but did not work well in all cases. The single core NAS performance (the MFLOPS column in Table Four) were quite good. Now that the Clovertown is available at a 3 GHz clock speed, these numbers should be even better. The limits of the FSB, however, can be seen in the scalability numbers. The single process ft benchmark on the Intel platform is a whopping 1.4 times better than the Opteron (1027 vs. 698). However, the eight-way performance of the Opteron platform on the same benchmark is 1.2 times better than the Intel platform (2106 vs. 1841).

The conclusion is quite simple: Pick the apple that suits your needs. In all likelihood, these tests are not a predictor of how well your application will run on these platforms, so test your code.

And remember, apples are good for you, some are better for baking than eating, and some just taste great no matter what you do to them.

Fatal error: Call to undefined function aa_author_bios() in /opt/apache/dms/b2b/linux-mag.com/site/www/htdocs/wp-content/themes/linuxmag/single.php on line 62