When I use the script, I prefer the NAS Parallel Benchmark Suite (Version 2.3) compiled for a single processor, or in this case, one core. (Download the suite from http://www.nas.nasa.gov/Resources/Software/npb.html.) The NAS suite is a group of programs designed to help evaluate the performance of parallel supercomputers. The benchmarks, which are derived from computational fluid dynamics (CFD) applications, consist of five kernels and three pseudo-applications.
Since what the programs actually do isnâ€™t as important as measuring their relative performance, I wonâ€™t go into what each program is designed to accomplish. I can say, however, that for the most part they reflect real computations that one might do on a cluster. If youâ€™re curious as to their purpose, consult the previously mentioned Web site.
One important aspect of the benchmarks is that theyâ€™re self checking. A self check is often missed in many benchmark efforts. All benchmarks results should be checked for correctness. Fast is one thing, and the right answer is another. In HPC, you need both.
As mentioned, I used a Pentium D 940 running at 3.2 GHz. The motherboard used was an Intel Model SE7230NH1 (based on the Intel E7230 chipset). The software environment was Fedora Core 4 with GNU 4.02 compilers. GNU 4.x now uses gfortran instead of g77.
The NAS benchmarks are available in various sizes. For these tests, I used the â€œAâ€ class, which ran for about 30 minutes using the script. The results are shown in Table One.
TABLE ONE:Results for NAS benchmarks
For comparison, Table One includes some older results for dual Xeon processors (2.2 GHz) on an Intel SE7500WV2 (Intel E7500 chipset) motherboard, and dual Athlon MP 1600+ processors (1.4 GHz) using a Tyan 2762 motherboard (AMD 760 chipset). Both of the latter systems used the GNU 2.96 compilers (gcc and g77).
Looking at the results, there are several things to notice:
*First, only one program had a perfect speed-up of two. Thus, you may assume this to be the exception rather than the rule for most dual-core systems.
*Second, the results for the Pentium D are actually not too bad; however, program sp.A.1 lost about one third of its potential speed-up to memory contention.
*Finally, looking at the older results, you can see that in the worst case, dual Xeons running cg.A.1, the effect of a second processor was almost nil. Remember the Pentium D results are for dual-core and the Xeon and Athlon results are for two separate processors.
The good news about dual-core is the price-to-performance ratio. Because dual cores are selling at about the same price as single core processors, youâ€™re almost getting a second core for free. If I need to run two copies of a program with different parameters, a dual-core machine could offer a great value.
By running the test script, one can start to develop an envelope of performance for specific processor/motherboard combinations. If you write your codes with minimal communication, you can expect to achieve performance close to that predicted by the test script. If you have a large amount of MPI communication (memory copying), then you can expect to get less speed-up than what the script predicts. The same can be said for threaded codes, but instead of MPI memory copying, memory sharing may increase contention issues.
Another interesting test is to try different compilers and re-run the script. In previous tests, Iâ€™ve found that the better the compiler, the worse the memory contention. While this result sounds counterintuitive, the reason is actually quite simple: Optimized code performs memory accesses faster than non-optimized code and thus increases contention issues.
The kernel may play a role as well. Generally, you donâ€™t have control of which core a program executes on. The program could (and does) flip back and forth between cores. Flipping between cores can cause cache misses, but the kernel scheduler tries to minimize these issues. There may be cases where pinning a code to a particular core may be provide better performance. The Portable Linux Processor Affinity (PLPA, http://www.open-mpi.org/software/plpa/) project is designed to help with this as well.
While there are always options with which to tinker, running the simple test script can help set realistic expectation levels for your codes. Taking an hour or so to run a script may also reveal hidden issues with your hardware and reduce the amount of hair pulling and head smacking that is required to write parallel codes.
One limitation of the test is that it only tests programs against themselves. While this feature makes for a simple test, it may not be as close to the real world as one might like. Running different programs may produce some further insights, but requires a more robust methodology. In particular, all of the programs have different wall clock times. If a short run-time program is executed while a long run-time program is executing, the effect on the short program can be seen, but this may not necessarily be true for the longer program. Creating a script to handle this kind of heterogeneity might get a little complicated.
A more interesting approach might be to generate a random ordering of the programs. Using a batch scheduler, the programs could be run one at a time on a single processor core; next, the same programs could be run two at a time on the same processor. If the time to empty the queue was measured, you could develop an average number for memory contention.
Of course, by the time I get this coded, quad-core processors will be ready for testing.