Scaling Bandwidth

SMPs may have cores, but clusters have bandwidth.

In my last two columns (Small HPC and HPC Hopscotch) I have been talking about multi-core, memory, and HPC programming. The recent release of the AMD six core Opteron got me thinking about this topic. It will soon be possible to buy a 12 core workstation (or even a 48 core version!) I recall the days when a 32 processor cluster (16 nodes with dual single core processors) was a nice addition to any lab or even computing center.

I also talked about memory locality and how multi-core has introduced a new hierarchy, near, near-by, and non-local memory. In the past an MPI programmer really thought about local memory and non-local (distant) memory on another node. Distant memory was only changed by sending a distant process a message using MPI. Near-by or SMP memory with a bunch of cores attached represents a different (although not all that new) programming paradigm for HPC users. You can still run MPI programs on SMP systems, but a threaded shared memory programming model is also attractive as you don’t need any of that “MPI stuff.” Seems like the cluster may become obsolete for all but those really big jobs.

Not so fast. Two things need to be considered. First, I invite you to read what I had to say back in 2006 about the programming issue. As we Wind On Down The Road there are two paths you can go by: message passing and threads. You may have to make a choice at some point. Realize that this is not a trivial decision. It is going to matter.

The second point I want to make is about memory bandwidth. Basically, it does not matter how many cores you have if you cannot keep them busy. There are limits to how fast memory can transmit data to and from cores. The placement of memory controllers on the processor allows parallel (simultaneous) memory access on an SMP motherboard. That is, instead of one memory controller for all the cores, there is now one for each processor socket. The memory controller is in charge of its own bank of memory. In this way, if a core in one socket is using local memory, it feels no affect when a core in the other socket is using its local memory (and vis versa). In effect the memory access has been parallelized. Note, if memory from another controller is needed, then QuickPath or HyperTransport step in.

This memory controller parallelization is why clusters are so powerful. If you consider that each node has at least one memory controller with an associated bandwidth, then N nodes has N times the memory bandwidth of a single node. Some numbers may help.

Consider the latest six core Opteron (Istanbul). I have read that a 24 core systems (4 processor sockets) has demonstrated a Stream bandwidth of 41 GBytes/sec. (Note this is a vast improvement over a 16 core Shanghai system which gave 25 GBytes/sec. Google on “HT Assist” for more information.) That means we have 1.7 GBytes/sec of bandwidth per core.

Now lets turn back the clock a bit. I dug up some old Stream numbers for dual core Opterons. A four core Opteron system (two dual core processors) was able to deliver 12 GBytes/sec or 3 GBytes/sec per core. If you were to run a memory bound MPI application on 24 cores, what would you want to use? A single 24 core SMP workstation (total memory bandwidth of 41 GBytes/sec) or six dual socket, dual core nodes (total combined memory bandwidth of 72 GBytes/sec).

In general, the more SMP cores you add to the mix, the lower the memory bandwidth. The more cluster nodes you add, the more the memory bandwidth scales. Note, I did not report processor or memory speeds or address the cost/power of six servers vs one workstation because I am making a “ballpark” argument to illustrate a point about clusters. While multi-core is packing more and more cores into a processor, memory bandwidth becomes a limiting factor. You may be able to pack 24 or even 48 cores in to a SMP workstation, but the total memory bandwidth may never be as good as a three or six 8-core cluster nodes.

Finally, if you follow my arguments, the best cluster node might be a single core with lots of memory and a fast interconnect. Given that single cores are hard to find the next best thing might be a single socket dual-core server node — while they are still available. Now you could use a bunch of single socket desktop motherboards and build a cluster, but who would ever do anything like that.

I now have 71 people following me on Twitter. I’m impressed because I don’t say that much. A man of few, but choice words I suppose. I even figured out how to tweet from my new phone and still I don’t say that much. I just tweeted about writing (in this column) about not tweeting. Careful a self referential infinite loop is developing.

Fatal error: Call to undefined function aa_author_bios() in /opt/apache/dms/b2b/linux-mag.com/site/www/htdocs/wp-content/themes/linuxmag/single.php on line 62