Determining how fast your application will run on multi-core based systems all depends, And, it may not depend on you.
In the past, being fast was a bit more simple. Of course I’m talking abut computers, clusters, and HPC and not runners, race cars, or any other things that moves. If you wanted to know if computer A was faster than computer B, you ran the same program and compared the results. In the same sense, if you wanted to know who was the fastest runner, you got a stop watch and said “go.” The fastest time is still the fastest time, how you got there seems to matter more these days. In the case of the runner, we are now required to test for the possibility of performance enhancing drugs. In the case of computers, we need to be more diligent as well. I’m not talking about overclocking, which in a sense is like performance enhancing drugs for computers — pushing a system to it’s limits and risking damage. At least over-clockers brag about their accomplishments.
Defining fastest in the multi-core age is where things get a little hazy. Back in the day when you had a single processor, memory, hard drive comparing apples-to-apples was a bit easier, although there was, and still is, a fair share of apples-to-oranges comparisons. Take for example the question “Which processor is fastest?” I equate such questions as “How tall is a building?” There is a need for more information. Indeed, multi-core has made this question almost worthless.
As I write this column, I am actually running some benchmarks on a new motherboard using a quad-core processor (Intel Q6600). I am curious if the new chip-set on the motherboard helps with performance (as the glossy data sheet seems to indicate.) And, therein lies the question, how do I give the new chips-set the thumbs-up (or down)? How do I know it is better?
In terms of clusters, there are lots of things that can bottleneck performance. Of course, it all starts on the node and that is where the processor, memory, and chip-set live. When considering the processor, most people look at clock speed, which in the past was a “kind-of a good measure” of performance, but not really. Within a given processor family, a faster clock usually means faster performance. Between different processor families, all bets are off. Multi-core further complicates the issue. As I mentioned, I’m currently running some single core benchmarks (i.e. there is one process running on just one core.) The large amount of cache associated with the core (4MB) will certainly ensure some nice results. However, these results are for most part useless, that is unless I plan to exclusively run a single process on my quad-core processor. It makes no sense to rely on this measurement in terms of overall performance because in all likelihood there will be four process run
ning when I am actually using this system. Why do I run the single process tests? I am part curious and part interested to see how, if any, the performance changes when I run other processes at the same time. I actually wrote a simple script to test how well multi-core processors handle multiple versions of the same program running at the same time. I have since enhanced the script to work with four and eight cores.
Here is today’s first take-away point: The performance of your application on a multi-core processor depends on what else is running on the other cores. If your application is using all the cores, then things are a little more clear, but as we see not entirely clear. In any case, this rule leads us to memory performance. The actually memory bandwidth seen by a core depends on many things – the needs of the other cores, the memory system design, amount of cache, type of memory, phase of the moon, and some other things I probably forgot to mention.
Let’s suppose you have a program that runs great using a single core. If you are in a shared environment, then you have no control over what else is happening with the other cores. The best you can do is if to have exclusive use of the the whole socket (processor), but you usually don’t. The speed of your program will vary and the best you can talk about is average performance. If you use all the cores, then you can make some statements about top speed. But wait, if your program can use four cores, then it is parallel, and if it is parallel it may be possibly run on two, three, or four nodes more efficiently than it runs on a single core. This situation may be the case if your application requires a large amount of memory bandwidth. I have seen this effect in real tests, by the way. But wait, there is more! If you are spreading your application across nodes, then you are not using all the cores on a given node and now you are back to possibly sharing a processor with other applications. We are back to where we started — no clear way to say how fast my program is running (other than averaging many runs). The same can be said for a shared interconnect as well.
Personally, I find this situation somewhat disconcerting. I mean, if I move an application from one cluster to another, will it run efficiently? Who knows, of course all the MPI pipes will connect, but how the application behaves on a new cluster is anyone’s guess. Of course, you could spend a lot of time finding the right combination of cores, memory, nodes, interconnect, and phases of the moon to get your best time. However, I doubt other than hardware vendors, anyone will go to such lengths. In this case, my second take away point is: Good enough will have to do.
Now there is a happy thought. How about some T-Shirts: High Performance Computing: Where “good enough” is the best we got. Well, my benchmarks are have finished. They look pretty good, so far.