Blowing The Doors Off HPC Speed-up Numbers

A recent paper by Intel provides some hard data on CPU vs GP-GPU speed-up numbers.

The expression “blow your doors off” is often used to indicate fast performance. I believe the expression references a car going so fast that it “blew the doors off” the slower vehicle. I have never seen such and occurrence and find it hard to imagine what if anything, except for adroit placement of explosives, could actually remove a closed car door. I suspect bad physics in movies makes us a bit more susceptible to such images.

As silly as the expression is, my experience tells me that it is often used when something is much faster than something else. As a trained scientist, however, I want to know quantitatively how much faster. In the past, I have often replied to such proclamations on the Beowulf Mailing list, with “could you please quantify your observation?” As I recall, I don’t think I ever got a response. Because HPC is about performance numbers, I find it difficult to make any determinations about performance without a set of carefully prepared benchmark data. Running benchmarks in a willy-nilly fashion often just confuses the issue or worse puts wrong ideas in peoples heads.

My interest in benchmarking has a practical side for HPC. As everyone is well aware, the Linux HPC ecosystem changes rapidly with updated kernels (drivers), libraries (MPI), and applications. Early on, I would ask myself this questions, “What influence do these updates have on performance?” It is naive to think that all updates always lead to better performance. In order to quantitatively measure these changes, I put together what I called the Beowulf Performance Suite. My intention was not to create a benchmark, but rather a way to measure the effects of change.

Updating the Beowulf Performance Suite is on my to-do list as it does not account for things like multi-core, GP-GPUs, or different networks, to name a few. As it seems to be these days, I have not had the time to improve the tests, but it is still a good first measure of how well things are working on a cluster.

In recent years, the use of GP-GPUs for HPC has sparked quite a bit of interest. The concept is sound and the idea makes good sense from a commodity standpoint (i.e. let the larger video market help fund the HPC market). While there were more entries in the field (Cell and Larabee), the GP-GPU market has tuned into a two horse race with NVidia and AMD/ATI leading the way. From an HPC perspective, the GP-GPU can be considered a SIMD parallel co-processor (Single Instruction Multiple Data) Indeed, graphic processing is by its nature a SIMD process and it makes sense to borrow the hardware for other SIMD applications like those in HPC.

Even without seeing any numbers, the advantage of dedicated SIMD hardware (GP-GPU) vs a general purpose processor (CPU) makes sense. The SIMD hardware takes advantage of the regularity in a problem and can solve certain problems (like graphics) very fast. The CPU, by its general nature can solve similar problems, albeit slower, and at the same time rapidly solve problems that would bring SIMD hardware to a craw. Thus, I was not surprised when the first numbers for GP-GPU results began showing up on the web. I even had some first hand knowledge of the good speed-ups people were seeing on NVidia video cards vs their PC.

Initially, there were claims of 20-30X faster. These numbers were not totally unexpected based on my knowledge of good SIMD hardware. Then there were claims of 100X and even some wild claims in excess of 1000X speed up. Time for my inner scientist to step forward. At some point in my education, I learned that an order of magnitude improvement beyond the leading edge was good, two orders of magnitude was very interesting and invited further analysis, but three orders of magnitude was often a bunch of crap. Probably a bit of an over generalization, but as a scientist it often pays to be a bit skeptical.

Whenever I hear claims of GP-GPU speed up, I often ask “X times better than what?” And, here is where the devil lives. To be a meaningful comparison, the CPU type, speed, memory architecture, compiler, how many cores, precision level, etc. need to be specified. I found many of the performance claims missing much of this key data, which in my world means the speed-up numbers are worthless. It is not that I doubted the applications ran fast, intuitively I knew if they were SIMD in nature they should, I was not so sure the doors were being blown off in many of these cases.

Just recently there was a paper submitted by a team of Intel engineers at the International Symposium On Computer Architecture (ISCA). The paper, which I suggest reading, is entitled Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU. At first, I thought a bunch of Intel guys writing a performance paper about NVidia, it must be more marketing hype, but the paper has something that makes me smile — benchmarks and numbers. The paper is thorough and compares an Intel I7 (quad-core running at 3.2 GHz) with an NVidia GTX280 (240 Stream Processors at 1.3 GHz). The team used fourteen kernels (problem types) optimized for the each hardware platform and found that the average speed-up was 2.5 times. That number tends to keep the doors on some popular CPUs.

There are some things to consider. First, with most SIMD problems, GP-GPUs are faster than CPUs and Intel provides data to support this conclusion. Looking at a sampling of HPC type kernels (see paper for details) we find that SGEMM is 3.9 times faster, FFT is 3.0 times faster, SAXPY is 5.3 times faster, and SpMV is 1.9 times faster. Those results are impressive. Of course they are not headline grabbing 100 times faster, but in HPC even doubling performance is a good thing.

Second, I applaud this work and hope that it leads to more controlled and open benchmarking on all sides of the performance front. Ultimately overselling a technology never helps and in the end when I don’t get my 100X speed-up I might be inclined to think “this stuff does not work.” There is also a more subtle point to be made here. In both cases, the applications were optimized (parallelized) for their respective hardware. The same source code was not compiled for two different processors to see which one was faster. In the past, I suspect comparing a newly minted CUDA version of a program with the original sequential C version running on an older processor produced great speed-ups. It is understandable that after you spent the time to create your CUDA masterpiece that you don’t really want to go back to square one and recode you original program for a new multi-core processor just to see if you could do better. I mean you blew the doors off of something, that has to be worth talking about. Of course, we all need to know what that door-less something really is.

Fatal error: Call to undefined function aa_author_bios() in /opt/apache/dms/b2b/linux-mag.com/site/www/htdocs/wp-content/themes/linuxmag/single.php on line 62