Data locality is the key to efficient code.
In my last column, I discussed a possible future for HPC programming. Briefly, I assumed that as cores per socket continue to grow, HPC programming models would diverge. I assume a small thread-based model would be used on multi-core systems (single memory domain) and a large model based on the current message passing (MPI) methodology used in clusters. It seems that I was not alone in thinking about this issue.
Joe Landman over at Scalability.org wrote about The coming bi(tri?)furcation in HPC, part 1 and John Leidel at InsideHPC.com even enshrined my name in his The Eadline Split: The Growing Chasm of Multicore Programming. I believe it is a discussion worth continuing. In the past, I have talked about the goodness that is MPI and at the same time I find it a bit low level for most programmers. Some may even call it HPC assembly language programming. My quest is to find a programming language that keeps me close to my problem and further away from the underlying hardware. Now the multi-core is in the mix, this goal seems even more distant.
The basic issue is memory. Or perhaps where your data is in memory — data locality. If I am core in cluster, my data can live in my local memory (memory directly accessible to the controller on my processor), near-by memory (memory that is accessed across HyperTransport or QuickPath), and non-local memory (memory on another node). For the sake of completeness, I’ll include a fourth type of memory called attached memory that covers things like GP-GPU co-processors. Keep in mind that the access times for these memory locations can vary by a factor of 100! The fastest is local and the slowest is non-local. This range of access times is one of the difficult issues for automated compilers and is often left to programmer dudes like myself. Trying to create a programming model that accounts for this range of access times is a bit difficult. Making sure the data is in the right place at the right time is not an easy problem.
The difficulty comes from “grain size”. The faster the memory access time the smaller you can dice up your program and still see a speed-up. Thus, finer grained parallelism works better on a single shared memory multi-core system. Large “chunkier” parallelism works best across nodes. These rules are generally true although there are some exceptions (and it always seems to be your code).
The situation is actually worse than I make it out to be. Some very good programmers can work within the memory rainbow. MPI programs can be crafted to used threads (local and near-by memory) or attached processors (attached memory) on inner loops and message passing calls (non local memory) in outer loops. It usually takes some work, but results are not impossible. Where things go down hill is portability. There are often some mighty big assumptions in these types of code. If the code is moved to different parallel environment (maybe a different interconnect or node type), the past efficiencies may be lost or even produce retrograde performance. Ouch! This situation is akin to the old days of assembly programming and the reason why Fortran became very popular — it was portable between machines because the compiler managed all the differences. And, a good Fortran compiler does not just make things portable across architectures, it optimizes as well.
There is no compiler that can optimize a code for an entire cluster of multi-core processors. Indeed, optimizing a program for minimal data movement is a very difficult problem. The issue is important in HPC. Open MPI has run-time options that allow both processor (keep a process on a core) and data affinity (keep data in a particular memory area) to be used on multi-core systems, but this choice is up to the user.
One of the things I did not get to mention previously is that multi-core processors are not without issues. That is, you cannot just keep dumping cores in to a processor and making parallel things go faster. A quick lesson in ccNUMA (cache coherent Non-uniform Memory Access) designs may help. Today’s x86 multi-cores are like mini clusters in a sense. Each processor socket has it’s own local memory and cache. It also has a way to use memory from other sockets over a fast interconnect (HyperTransport or QuickPath). Cache is the problem because it is very private memory for each core.
If a core changes a value in cache, the memory location associated with the cache becomes “dirty” and not usable by other cores. Thus the other caches must snoop around to find the correct value. On chip shared cache can do this rather quickly, but off chip (i.e.another socket) snooping must take place over the processor interconnect. More processors/sockets means more snooping is required. Add the fact that using near-by memory may be twice as slow as local memory and you can start to see why data locality is important.
Couple these issues with your typical MPI level programming and you see why I tend to get overwhelmed thinking about HPC. I like to think of it as a game of high tech hopscotch — landing in the right place at the right time. For those who really want to dig into the topic of memory, may I suggest Ulrich Drepper’s What every programmer should know about memory (pdf). It is long and detailed. I’m told that is were the performance devil lives.
I mentioned in the past, I’m tweeting on twitter. Plus I even got a new phone with a tiny keyboard so now I can actually post messages when I am not in front of my desktop computer. Which is, uhm, when I’m sleeping or watching the Real House Wives of NJ.