Blowing The Doors Off HPC Speed-up Numbers

A recent paper by Intel provides some hard data on CPU vs GP-GPU speed-up numbers.

The expression “blow your doors off” is often used to indicate fast performance. I believe the expression references a car going so fast that it “blew the doors off” the slower vehicle. I have never seen such and occurrence and find it hard to imagine what if anything, except for adroit placement of explosives, could actually remove a closed car door. I suspect bad physics in movies makes us a bit more susceptible to such images.

As silly as the expression is, my experience tells me that it is often used when something is much faster than something else. As a trained scientist, however, I want to know quantitatively how much faster. In the past, I have often replied to such proclamations on the Beowulf Mailing list, with “could you please quantify your observation?” As I recall, I don’t think I ever got a response. Because HPC is about performance numbers, I find it difficult to make any determinations about performance without a set of carefully prepared benchmark data. Running benchmarks in a willy-nilly fashion often just confuses the issue or worse puts wrong ideas in peoples heads.

My interest in benchmarking has a practical side for HPC. As everyone is well aware, the Linux HPC ecosystem changes rapidly with updated kernels (drivers), libraries (MPI), and applications. Early on, I would ask myself this questions, “What influence do these updates have on performance?” It is naive to think that all updates always lead to better performance. In order to quantitatively measure these changes, I put together what I called the Beowulf Performance Suite. My intention was not to create a benchmark, but rather a way to measure the effects of change.

Updating the Beowulf Performance Suite is on my to-do list as it does not account for things like multi-core, GP-GPUs, or different networks, to name a few. As it seems to be these days, I have not had the time to improve the tests, but it is still a good first measure of how well things are working on a cluster.

In recent years, the use of GP-GPUs for HPC has sparked quite a bit of interest. The concept is sound and the idea makes good sense from a commodity standpoint (i.e. let the larger video market help fund the HPC market). While there were more entries in the field (Cell and Larabee), the GP-GPU market has tuned into a two horse race with NVidia and AMD/ATI leading the way. From an HPC perspective, the GP-GPU can be considered a SIMD parallel co-processor (Single Instruction Multiple Data) Indeed, graphic processing is by its nature a SIMD process and it makes sense to borrow the hardware for other SIMD applications like those in HPC.

Even without seeing any numbers, the advantage of dedicated SIMD hardware (GP-GPU) vs a general purpose processor (CPU) makes sense. The SIMD hardware takes advantage of the regularity in a problem and can solve certain problems (like graphics) very fast. The CPU, by its general nature can solve similar problems, albeit slower, and at the same time rapidly solve problems that would bring SIMD hardware to a craw. Thus, I was not surprised when the first numbers for GP-GPU results began showing up on the web. I even had some first hand knowledge of the good speed-ups people were seeing on NVidia video cards vs their PC.

Initially, there were claims of 20-30X faster. These numbers were not totally unexpected based on my knowledge of good SIMD hardware. Then there were claims of 100X and even some wild claims in excess of 1000X speed up. Time for my inner scientist to step forward. At some point in my education, I learned that an order of magnitude improvement beyond the leading edge was good, two orders of magnitude was very interesting and invited further analysis, but three orders of magnitude was often a bunch of crap. Probably a bit of an over generalization, but as a scientist it often pays to be a bit skeptical.

Whenever I hear claims of GP-GPU speed up, I often ask “X times better than what?” And, here is where the devil lives. To be a meaningful comparison, the CPU type, speed, memory architecture, compiler, how many cores, precision level, etc. need to be specified. I found many of the performance claims missing much of this key data, which in my world means the speed-up numbers are worthless. It is not that I doubted the applications ran fast, intuitively I knew if they were SIMD in nature they should, I was not so sure the doors were being blown off in many of these cases.

Just recently there was a paper submitted by a team of Intel engineers at the International Symposium On Computer Architecture (ISCA). The paper, which I suggest reading, is entitled Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU. At first, I thought a bunch of Intel guys writing a performance paper about NVidia, it must be more marketing hype, but the paper has something that makes me smile — benchmarks and numbers. The paper is thorough and compares an Intel I7 (quad-core running at 3.2 GHz) with an NVidia GTX280 (240 Stream Processors at 1.3 GHz). The team used fourteen kernels (problem types) optimized for the each hardware platform and found that the average speed-up was 2.5 times. That number tends to keep the doors on some popular CPUs.

There are some things to consider. First, with most SIMD problems, GP-GPUs are faster than CPUs and Intel provides data to support this conclusion. Looking at a sampling of HPC type kernels (see paper for details) we find that SGEMM is 3.9 times faster, FFT is 3.0 times faster, SAXPY is 5.3 times faster, and SpMV is 1.9 times faster. Those results are impressive. Of course they are not headline grabbing 100 times faster, but in HPC even doubling performance is a good thing.

Second, I applaud this work and hope that it leads to more controlled and open benchmarking on all sides of the performance front. Ultimately overselling a technology never helps and in the end when I don’t get my 100X speed-up I might be inclined to think “this stuff does not work.” There is also a more subtle point to be made here. In both cases, the applications were optimized (parallelized) for their respective hardware. The same source code was not compiled for two different processors to see which one was faster. In the past, I suspect comparing a newly minted CUDA version of a program with the original sequential C version running on an older processor produced great speed-ups. It is understandable that after you spent the time to create your CUDA masterpiece that you don’t really want to go back to square one and recode you original program for a new multi-core processor just to see if you could do better. I mean you blew the doors off of something, that has to be worth talking about. Of course, we all need to know what that door-less something really is.

Comments on "Blowing The Doors Off HPC Speed-up Numbers"


A critical thing to remember is that even if you speed up 50% of an application by 100x, the application is only faster by 2x.


There is a discussion on this paper on CUDA forums: http://forums.nvidia.com/index.php?showtopic=172350. I\’d maybe add following bit: if you look at CUDA zone, you\’ll see that most of 3-orders-of-magnitude reported speed-ups are from various academic papers, very rarely is such speed-up reported for a commercial application. Which kind of confirm my belief that most of papers written these days (and certainly not only in the HPC area, but all domains of science) is utter crap. However, NVIDIA accepted this kind of reports without any further checks, and kind of used it as promotional material, so they are certainly more than deserving this debunking from Intel guys. That said, I will also note that I would choose to implement an algorithm on GPU, rather than on CPU, any day: modern CPUs, and particularly these architected by Intel/AMD are real pain to program for, and even more for the optimization, and performance measurements – these things are just darn over-complicated. On the other side, GPUs have nice and understandable programming model (don\’t have to mess with data in multiple of fours is relief in itself, and this is just beginning of the story), and I\’d say it\’s much easier to come up with an optimized version of some code for the GPU than for the CPU. And this is also a factor of great importance in the comparison, it\’s not only about speed-ups and nothing else…


Blow the doors off?
Doug, I am shocked at your lack of knowledge of classic British films.
In the Italian Job the robbers practice blowing open an armoured car somewhere in a quarry in England. The vehicle ends up as a smoking wreck – cue Michael Caine saying \”You\’re only supposed to blow the bloody doors off\”


We shall draw a quiet veil over the Hollywood remake.

Thank you, I have recently been searching for information approximately this topic for a while and yours is the greatest I have found out till now. But, what in regards to the conclusion? Are you certain about the supply?

Once I originally commented I clicked the -Notify me when new feedback are added- checkbox and now every time a remark is added I get four emails with the same comment. Is there any manner you’ll be able to remove me from that service? Thanks!

Somebody essentially help to make seriously posts I would state. This is the very first time I frequented your website page and thus far? I surprised with the research you made to make this particular publish incredible. Wonderful job!

Im thankful for the post.Much thanks again. Really Great.

By8OyY Major thanks for the post.Really looking forward to read more. Great.

The time to study or visit the subject material or websites we’ve linked to below.

I think the article is very helpful for us,it has solved my problem,thanks!
Wholesale 68204a Oakley Sunglasses ID8210193 http://www.fleetsale.ru/new-arrival-oakleys-218.html

That would be the end of this article. Here you?ll uncover some web-sites that we think you?ll appreciate, just click the links.

Just beneath, are a lot of totally not related websites to ours, nevertheless, they are surely worth going over.

The info talked about within the write-up are several of the best out there.

Sites of interest we’ve a link to.

Leave a Reply