Multi-cores may come in varying and odd numbered sizes. Once again, software needs to adapt.
For those outside the computer business, the “power of two” thing can be quite mystifying. When we build clusters we might start at 16 nodes, but then expand to 32 nodes, or even 64. Building a cluster with 128 nodes may baffle the casual base-10 observer. Why not 100 nodes or 130 nodes? Why is everything divisible by two?
This power of two rule has been applied to cores as well. We started at 2 then went to 4. But, now, that rule has changed. AMD has been shipping a 3-core Phenoms for desktops. And, just this week Intel has introduced a new 6-core Xeon processor. At least six is a multiple of two. So now what are we going to do? This move messes up the whole power of two thing we had going. Let’s see, if we have a cluster with 64 nodes with 4-core processors, dual socket, that is 512 total cores (64x4x2). A nice strong power of two number. Now if we have 3-core processors, then we get 348 cores. Oh the despair of it all.
All joking aside, the interesting thing about the AMD approach is the 3-core processors are probably salvaged 4-core processors. While some may balk at this idea, I think it is actually quite economical and may be an indication of things to come. As more cores end up on the die, the yields will never be 100%. Therefore, processors with 2-8 cores may become available. Even odd numbers of cores — egad. Of course, each will be available at differentiating price point, so you can kind-of dial up your number of cores.
Wondering about odd numbers of cores got me thinking about another issue. As more and more cores end up on the die, can we power off cores we don’t need? Kind of like turning the lights out in an empty room. For example, maybe you have an 8-core node and only 6 cores are running a job. The extra two cores are not needed at the moment. From an energy standpoint it would be nice to be able to tell a waiting node, “I need 6 cores right now, let the two other sleep.” With the caveat that if I need them we can wake them up. This situation presumes that the application requires 6 cores on each node (maybe 6 cores on each of 12 nodes) and that it works best with using 6 cores per node.
Whenever I write about multi-core, I always come back to software at some point. We have reached that point. If it is conceivable, that we may have weird core counts in the future. That is, one cluster might have 6-core nodes (two 3-core processors) and another cluster might have 8-core nodes (two 4-core processors). And our software will need to adapt to this situation.
Just like hardware engineers, software engineers also had the power of two pounded into their heads. With good reason, by the way. Allocating memory in power of two chunks made sense from an optimization standpoint. Some algorithms also work best with a power of two assumption. Indeed, parallel algorithms, often break data up by using some power or multiple of two. When it comes to mapping work to cores, I’m not so sure we need to hold on tight to the power of two rule. Of course this calls for fancier algorithms or even dynamic approaches to program execution. If you have read some of my past posts you know I’m a big fan of the dynamic approach to parallel software. As the market progresses, the “multi” in multi-core will mean one size may not fit all and software will be required to take up the slack in this area.
Where the core lives will also be important. Once Intel’s new Quickpath becomes standard on the Intel side of things, all cores are going to have three classes of neighbors:
- Cores that share the same cache and memory
- Cores that share memory over a fast interconnect (AMD Hypertransport or Intel Quickpath)
- Cores that send messages over a high speed interconnect
A typical cache read will take 20-30 nanoseconds. A small message sent over an interconnect can take 10 microseconds. That is three orders of magnitude in speed difference. Depending on the application, programmers may or may not need to worry about this disparity, but it is still there. If I am a core running happily along and the data I need is in cache, I can get right to work on it. If, on the other hand, the data I need is in the memory of another core on another node, then I have to wait about one thousand times longer.
It used be simple. One, maybe two, processors (single core), one memory bank, and one interconnect port. All that really mattered was getting data to and from each node. We now may have any number of cores on a number of nodes with a wide range of communication speeds. Not the best situation for clusters, but that is the commodity approach. Low cost, some contortion on the software end, but in general it works.