Every time I talk about multi-core, I seem to start out with something like "back in the day" or "when things were much simpler," or some such lament. Now prepare yourself for a stunning bit of insight. Cue music.
Every time I talk about multi-core, I seem to start out with something like “back in the day” or “when things were much simpler,” or some such lament. I have decided to make it easier on me, and hopefully you too, and will now refer to two epochs in the commodity cluster computing world. The first, I will use SC for single core and the second will be MC for multi-core. That stunning insight makes me a genuine master of the obvious. I am going to add a little bit more, though.
The SC epoch is when most clusters had one or two single core processors in the same computing node. For the most part, these were treated like two nodes, that is each would run it’s own Message Passing Interface (MPI) process (SMP process migration not withstanding). The choice of what processors and what nodes is often for the scheduler to decide, but conceptually the programmer was thinking about a processor (core in today’s parlance), local memory, and an interconnect. The user, through the scheduler, can decide how to lay out their program — either tightly packed (two per node) or spread out (one per node) depending on performance (one would hope anyway).
In effect, dual socket (single processor/core) nodes were really an economical way to get more processors in the same space (admit it, you’re a cheapskate at heart). Little consideration was given to the shared nature of the node from the programmer’s standpoint. Because if you really looked at it, a single socket, single processor/core would be the preferred platform for MPI applications. In summary, in the now-waning SC epoch, dual processor/core nodes were kind of a kludge — it worked, it was economical, and the shared nature of the node could be swept under the proverbial rug. Besides, everyone else was doing it that way.
Moving on the the emerging MC epoch the sharing thing is not going to be so easy to sweep away. A typical compute node now has four or eight cores all sharing the same resources (memory, interconnect, disk drive) in some capacity. For example, both Intel and AMD use a very different approach to shared memory. There are trade-offs to each approach and in the end the programmer may need to think about how these schemes influence performance. There is a higher sharing cost that in the SC epoch was largely ignored. In the MC epoch, we will not have the luxury of ignorance.
For example, in the SC epoch (which, by the way, is still where many clusters live today) a common practice is to record micro-benchmarks for each node. Stream, which benchmarks memory bandwidth, or Netpipe, which benchmarks node-to-node communication performance, are two of the popular benchmark utilities. The goal of such measurements was to get a feel for the hardware because fast-memory, low-latency nodes are the ideal cluster building block.
Translating micro-performance into macro-performance (i.e. application performance) is always tricky, and basically a guess. One can rest assured, however, that poor memory and interconnection performance will not magically improve when running your application(s). On the converse side, there is no guarantee that stellar micro-benchmarks will always translate into stellar application performance. Ah, but we have the luxury of ignorance in the SC epoch, and micro-benchmarks often helped guide our decisions about cluster design and performance.
In the MC epoch we are not so fortunate. Suppose you want to measure the latency of an interconnect between two dual socket, dual core systems (i.e. four cores per node). Simple enough. You just pull out Netpipe and run it between the two nodes. In the MC epoch we are forced to remember that this is the latency between a single process over the interconnect to another single process. When you are actually running a code on these two nodes, there are probably four processes on each system communicating over the interconnect. A single Netpipe run probably does suffice to tell the whole story in this case.
The same can be said for benchmarking a single application on a single core. You may want to ask yourself if you’re going to run a single application at a time on your new eight core system. If so, then go ahead and believe the benchmark. If you are planing on running more than one process, then maybe you better rethink your benchmark strategy.
Here is the requisite sports analogy.
Trusting single core/interconnect benchmarks on multi-core systems is like believing that a great quarterback is all you need for a successful football team.
(That would be US football in this case. Substitute an appropriate local sports analogy as you see fit.) You might have a quarterback with great numbers, but if you don’t have blockers, receivers, or running backs, the performance of you star play is going to suffer.
We no longer have the luxury of ignorance in the MC epoch. We need to realize that single core benchmarks tell us even less than they did before. Of course, the obvious conclusion is better multi-core aware benchmarks. To wit, running real applications is the true way to make intelligent decisions about clusters. Sage advice that transcends the epochs.