Aside from a new processor for cluster vendors to sell, the Nehalem represents a more subtle change in the market.
Nehalem, erh-ahh, the Intel Xeon 5500 series has arrived. First let me say, from what I have seen and read Intel has done a stellar job with the new micro-architecture (The tock step in their “tick-tock” strategy). In addition to improved memory bandwidth there are many nice performance, power saving, and virtualization features as well. All hail Intel — nice job. That said I have some observations that may deflate the Nehalem balloon a bit. Well someone has to do it.
Much of the 2-3x performance increase stuff Intel has been touting is based on comparing Nehalem to Harpertown. Fair enough, but keep in mind Harpertown was not all that stellar of a performer, for some applications, in the first place. The poor performance was largely due the use of a single memory controller for all the eight cores on a typical motherboard. This difference was clearly illustrated in an article I wrote comparing the Intel Clovertown to the AMD Opteron. Harpertown was an improvement over Clovertown, but the same single memory controller was still present in all Harpertown designs. Although the hardware in the article is now old, the point is still valid. Until Nehalem, Intel was hurting in the memory bandwidth department. How bad you ask? Well let’s look at a some data I collected on Harpertown.
Let’s start with my favorite benchmark, the NAS Benchmark Suite (NAS is a self verifying set of aerodynamic benchmark kernels developed by NASA). At the time, I had two Harpertown nodes connected via InfiniBand. I wanted to ask a simple question. What was the difference between running an 8 way version of the NAS Class B benchmark on one 8 core node (8×1) versus running the same test on two nodes but using only 4 cores on each of the nodes (2×4). The results are in the table below.
8×1 One Node
4×2 Two Nodes
Now that is interesting. A code runs better over IB than over shared memory. Conventional wisdom would dictate that running anything over shared memory vs distributed memory should provide a better result. In my tests I found the opposite to be true in some cases — by over a factor of two! This was a prime example of how a memory bottleneck can effect performance. Note that not all codes improved indicating the application dependent nature of memory bandwidth.
Those that want to argue about running MPI on a shared memory node, may wish to consult some of my experiences with OpenMP and MPI. Admittedly, these types of test should be run again and with other compilers, but the results do not diminish my point — improvement over the previous Intel memory architecture is not that hard because it was not all that great in the first place. There I said it. Yes, it is water under the HPC bridge because Nehalem is here, but none the less, the previous Intel quads had some shortcomings for some applications.
The preceding was a long argument to reach the point I wish to make — The AMD and now Intel multi-core processors are a cluster architecture. AMD pioneered the idea and should be given credit for doing it right in the first place with the introduction of HyperTransport. Intel has just confirmed the approach with the QuickPath interconnect. If you think about it, modern multi-core processor have a core processor, memory controller, and memory connected to other similar cores over a high speed interconnect. Sounds like a cluster to me. The multi-core processors take it step further however, and have done what some cluster technologists have wanted for years — shared memory. In both of the designs all cores can see all the memory on the motherboard. They cannot however, access all memory at the same speed. Local memory access is faster than through the interconnect. This type of architecture is called Non-Uniform Memory Access (NUMA).
To the programmer, it all looks the same. To end user, variation or even poor run times can result from a NUMA architecture if the data placement is non-optimal. In such a tightly coupled situation there is another issue as well — CPU cache. Recall that cache is used to compensate for the speed mismatch between CPU and memory. Keeping often used memory items in high speed cache memory helps to mitigate this problem, but there are cases where cache does not work either due to poor program design or algorithm requirements. To address cache on multi-core architectures, designers have introduced cache coherency (cc) into the architectures. This architecture is called of course called ccNUMA. The trick is to keep all the caches in sync with each other. For instance, if core A is working with a certain memory location that is in cache, the actually memory is considered “dirty” (not valid) and cannot be used by any other cores. If core B needs to read that same memory location, then the caches must be coherent so that the correct valued is used. Cache coherency is tricky and takes special hardware.
Like NUMA, ccNUMA is transparent to the programmer. To the end user, however, variation in run times can be the norm as cache coherency depends on what is in the cache and what else is running on the cores. The OS will often try to keep a process on the same core to help reduce cache data movement, but the OS also tries to keep the cores balanced and thus some process movement may be necessary. With Linux there are methods to “pin” a process to a core which overrides the ability of the OS to move a process at will.
There were companies who sold ccNUMA machines before advent of multi-core. Companies like SGI and Convex developed scalable high performance machines with ccNUMA capability. They were met with some market success. What pushed them to a small conner of the market? That would be MPI on commodity hardware. That is right, the cluster. Many problems worked just as well in a distributed memory environment than in a ccNUMA environment. Some may argue that programming in a ccNUMA is “easier” than MPI programming, however, that did not seem to stop people from using clusters. Indeed, the low cost of commodity hardware over the high end ccNUMA designs made it worth considering the MPI approach.
Rest assured, MPI applications run just fine on ccNUMA architectures. In some cases they may not run optimally, but binary/source code portability is always a nice feature. The converse is not necessary true. An OpenMP (threaded) application that runs well on a ccNUMA will not in general run on a cluster. Indeed, pinning a group of MPI processes to separate cores and their associated local memory, may have performance advantages to running a threaded shared memory application. A topic which I will be investigating further in the future.
The moral of the story? Introduced by AMD and now confirmed by Intel, ccNUMA is the way forward for multi-core. Ignore the shared memory features and you have what looks very much like a cluster on a chip. In my somewhat twisted logic, cluster win — again.