The eventual move from multi-core to many-core is on the horizon and it looks to be a real doozy.
Recently, there has been some discussion about a 1000 core processor or as I like to call them many-core processors. Indeed, a research group has proclaimed they have created a 1000 core processor using an FPGA (Field Programmable Gate Array). Such discussions and research stunts are good as they generate ideas, discussions and set milestones, or in this case Big Hairy Audacious Goals (BHAGs). The reality of a 1000 core processor, however, is going to take a huge fundamental (and probably painful) change in the way computing is done.
Surprisingly, the issue is not really how many cores you can pack on a die. The issue is memory bandwidth. I have been around long enough to know that anyone who promotes massively parallel processors and does not include a discussion of memory bandwidth probably, how should I say this politely, believes in Santa Claus.
The issue with scalability is memory bandwidth. As processor clocks increased, the cost of fast memory did not. Memory comes in two general types, Static and Dynamic. Static (SRAM) memory is faster and does not need to be refreshed, but is expensive. Dynamic memory (DRAM) needs to be periodically refreshed is slower and cost less than static memory. In general, the speed difference between a processor and DRAM is a factor of 50 (more or less). That means if a core needs something directly from memory it must wait 50 cycles until the data is available. Ouch.
The solution to improve performance with lower cost DRAM is placing a small amount of faster SRAM between the core and the DRAM. Data is copied from the slow DRAM to the fast SRAM when it is needed. As you probably know, this SRAM is the cache memory found on all processors. Cache memory holds portions of DRAM memory that is often reused in subsequent instructions. Thus, the local SRAM can keep up with the processing core. Adroit use of cache memory can make for huge improvements in program performance. HPC programmers are known to spend time to make sure applications are tuned for cache sizes. Cache generally speeds up all programs due to the spacial and temporal locality of how data is used, but it not commonly known that some programs can be written in such a way that the cache has no benefit. The technical name for these programs is “really really slow.”
Cache technology works quite well with one core. With two or more cores, there can be issues. In a multi-core processor, each core has its own exclusive cache. Since data can be in two places at the same time time, in SRAM cache and DRAM there is the possibility that the two values could be different. The core may have modified the value in its cache which means the value in DRAM is invalid or “dirty.” Suppose another core that shares the memory needs that value, how does it know if it is dirty and how does it get the correct value? That is what cache coherency is all about. Cache coherency is extra circuity that allows multiple cores to notify each other as to the state of dirty memory. It adds overhead, and here it comes, is not scalable to large numbers of cores.
Of course as an HPC maven you know how to scale — use messages. Intel’s many-core 48-core chip reflects this idea. Cache coherency had to go. Instead of cache coherency, Intel has implemented a 16 Kbyte Message Passing Buffer (MPB). Each core can transfer data directly from its MPB to another cores MPB. Once the data is sent, however, it is removed so that there is only one owner of any data at any time. Data never leave the processor and do not travel through main memory. One way to describe this processor is a “cluster on a chip.”
The current batch of multi-core systems are starting to look like clusters as well. Both HyperTransport and Intel QuickPath are high speed networks between processor cores that allow a cache coherent “SMP experience” between multiple cores and memory controllers. In the future as we move toward “many-cores,” this approach may not work as well.
Now that I have convinced you there are going to be messages in everyone’s future, here is the challenge. There are very few programs that use messages. The message passing HPC codes are dust in the desert of computer programs. In my opinion, we have not figured out this parallel programming thing just yet. The HPC community “lives with it,” but we are a unique bunch. How is the rest of the world going to deal with a problem that the rocket scientists still have not figured out? I have no idea. Finally, before you run out and sell your new Magny Cours system, these ideas are going to take a long time to hit the streets. The “Hairy Audacious” part of the goal is significant. You may want to keep those MPI skills sharp, however.
Douglas Eadline is the Senior HPC Editor for Linux Magazine.