HPC's version of the tortoise and hare.
Sit down, take a deep breath and relax. Let your mind wander for a bit. I’m going to talk about something that runs counter to one of the basic tenants of computer performance. Something you may believe to be true for ever and always. Last time I talked about how concurrent parts of a program are those that can be computed independently. I also explained how the parallel parts of a program are those concurrent parts that should be executed at the same time in a way to increase the speed of a program. As I discussed, the differentiator is overhead. It is now time to take the next step. By the end of this column, I hope to convince you that the following is sometimes true:
In a parallel computer, faster processors are not always better than slower processors.
This statement is kind of a trick because it makes no mention of how the processors are connected. And, that is the issue. The interconnect contributes to the overhead and thus the scalability — or how many processors you can add to your program before it stops getting faster. Scalability is also determined by Amdahl’s Law, which states that the overall performance of your parallel program will be determined by the amount of serial content. That is, the parts that cannot be executed in parallel. These parts are eventually going to limit how fast your program can run. We will come back to that later, but for now let’s talk about scaling parallel parts.
Lets use a simple Gedanken-experiment (thought experiment) to see how slower processors may help in some situations. Assume we have a program that has a large concurrent loop with a certain amount of overhead. Let’s assume we have eight very fast processors and a medium speed network (i.e. the network is slower than the processors ability to eat and crunch on data). What will happen with our program? It doesn’t take too much Gedanken to figure out that the network is going to limit the ability of the shinny new processors to do work (i.e. they will be waiting on the data). At some point, our program will not stop scaling and we will see less than an 8X speed up.
In a second experiment, let’s replace those eight processors with thirty two processors that are 4 times slower, but are matched to the medium speed network (i.e. the processors do not wait on any data). In this case, our program scales perfectly and we will have reduced our single processor time by 32. Since our processors are now 4 times slower, the time for our 32 processors would be equal to an 8X scaling for our 8 fast processors. Recall, we were unable to get an 8X scaling with fast processors. QED
If you understand this concept, you now understand why the interconnect is so important for many cluster applications. And why many building clusters spend good money for a fast interconnect. Feeding processors is important. The latest and fastest processors need the latest and fastest interconnects. Unfortunately, the rate at which processors increase in speed, often exceeds the available interconnects. Multi-core has changed this a bit, but the issue is still the same. Messaging rate and throughput have become important because there are multiple cores sharing the interconnect.
The “slow is better” approach is also employed in two commercial supercomputers. The venerable IBM Blue Gene, which consistently beats up any contender for the Top 500 spot, uses 700 MHz Power PC 440 processors (plus a floating point unit) and fast balanced interconnect. Another balanced system approach is from SiCortex which uses 5832 500MHz MIPS processors and a very fast balanced interconnect. The key word is balance. There is also a huge reduction in energy costs for these two systems. Something to keep in mind in the age of green.
Of course, applications vary and some applications are not as sensitive to the interconnect speed and do see benefits from using faster processors. Multi-core will certainly change this behavior, which is why benchmarking is important. As often quoted on the Beowulf Mailing list, “It all depends on the application. Benchmarking your application(s) is the best measure. YMMV (Your Mileage May Vary).” Words well worth remembering.
I will conclude this discussion with one more data point which comes to you first hand. Back in the day, a co-worker of mine, Anatoly Dedkov, and I were intrigued by this idea as it applies to disk based I/O because getting data off of spinning disks always seems to keep processors waiting. We actually wrote and presented a paper at the Parallel and Distributed Processing Techniques and Applications in 1995 (PDPTA’95). We shamelessly proposed the Eadline-Dedkov Law which states:
For two given parallel computers with the same cumulative CPU performance index, the one which has slower processors (and probably a correspondingly slower inter-processor communication network) will have better performance for I/O-dominant applications.
The whole idea struck us as counter-intuitive at first, which is why we investigated it further. Next time I’ll talk more about Amdahl’s law. In the mean time, please feel free to enlighten your colleagues about the slow processors idea. Makes for great heated arguments and fist fights.