Intel recently released an experimental processor that resembles a cluster on a chip.
Intel has a new 48-core experimental processor. I want to discuss the technology, but first I need to rant a bit. The PR contrived headline calls it a “Single-chip Cloud Computer.” Arrgh, they are even using an acronym calling it the SCC chip. Where to begin? First, when this project was started, I doubt the hardware engineers at Intel said, “Hey let’s build a processor for the cloud. You know that nebulous concept that is years away.” Second, I am sure they have good technical reasons for designing this chip, but sorry Intel PR geniuses I doubt it was for “The Cloud.” Yes, “The Cloud,” that vague but ever so trendy name for timeshare/grid/Internet-computing that gets tacked on to every technology news story I read. Let me try and help you out here. Computers can run almost anywhere, in an office, a house, a car, even on an airplane while flying through a cloud, but, you don’t “run a cloud on a processor” unless you are simulating them, which by the way some HPC people are apt to do. Please stop using the word “Cloud” to grab headlines. Intel makes cool stuff, let that be the story. There now I feel better.
Let’s move on to the real issue — parallel computing. In case you have not noticed, processors have more cores than they used to. In some case eight or sixteen times more cores. If this did not surprise you, then maybe this will. Designers cannot continue adding cores the way they have in the past, which is in an SMP cache coherent kind of way.
In todays multi-core CPUs there is special hardware that allows the other cores to know if data in it’s cache is invalidated by another core. For instance, if two cores are working with the same value of X and one of the cores changes X, then the other core mus be told that the value of X in its cache is no good, plus the value of X in main memory is also invalid (often called “dirty”). These hardware mechanisms work transparently and the programmer does not need to be concerned about cache coherency (Although a poorly written program can be made to run slowly due to coherency issues). Thus, we have things like Symmetrical Multiprocessing (SMP) that allows multiple programs, possibly threaded, to run at the same time on multi-core systems.
Cache-coherency limits scaling. The more cores, the more caches, the more difficult it becomes to keep track of everything. This is one reason why MPI programs on clusters scale so well. The memory of each node is private and there is no coherency issues between nodes. In essence, memory is copied from node to node through the MPI message passing protocol. We’ll come back to this in a minute, but first let’s take a closer look at the SCC, sorry the Intel Parallel Processor.
The chip is composed of 24 two-core tiles. Each tile has two IA-32 (that is 32 bits) cores, cache, and a router for inter-tile communication. The router provides 64GB/s interconnect bandwidth to the other tiles, which are configured as a mesh as shown in the figure below. There are four memory controllers and and I/O controller connected to the router network. You can find more details and figures in this short technical paper (pdf).
There are two interesting designs features that are unique to this processor. The first is the cache coherency protocol and the second is the power management. I will not say much about the power management other than it is important and seems to work quite well. My concern is the cache coherency protocol, or lack thereof. The designers realized that it was not feasible to include circuity for a 48-way cache coherency so they did what any hardware designer does, push it off to the software people. Thus, each core runs independently and has its own private cache, there is no coherency hardware.
While this is not always the best move because it may break existing software, it does seem like an idea worth trying. The mechanism works as follows. In addition to standard cache, each tile has a 16 Kbyte Message Passing Buffer (MPB). Each core can transfer data directly from its MPB to another cores MPB. Indeed, once data is sent, it is removed so that there is only one owner of any data at any time. Data never leave the processor and do not travel through main memory.
The explicit use of the MPB by software is quite unique, but should have MPI programmers jumping for joy. The MPB model is basically designed for message passing. Recall, message passing assumes all processors have independent cache. Porting MPI programs to these processors will be almost trivial. Of course, memory access is always an issue, but we can push that back on the hardware guys. It is also possible to implement things like OpenMP over the MPB, but this will require a little more software finesse under the hood because OpenMP is a threaded model that relies on cache coherent SMP architectures. End users may not need to change anything, however. There is even a modified version of Linux available. By the way, a separate Linux kernel runs on each core. It cannot run across the whole processor because it does not support cache coherency.
Here is the really cool thing about this idea. If you want to write software for this new gizmo, use MPI. Oh, that is right, you already do. It is comforting to know that your investment in cluster software will keep paying dividends. Cluster computing is scalable parallel computing, which is the only way we can keep pushing the performance curve both at the server level and now at the processor level. You will have to excuse me if I don’t call it the “Single-chip Cloud Computer”, but rather the “Single-chip Cluster Computer.” The acronym is the same and somehow I have to think it is more what the designers had in mind.