I was there watching Happy Days when The Fonz jumped the shark. I remember thinking, this show is getting weird. Not that I was a big fan, I happened to be watching it that night and decided it was time to call it quits. Thus was born the term “jumping the shark,” which does not necessarily mean the end, it just means that it will never be as good as it was, at least that is my interpretation.
Let’s move on to my overly provocative headline. At times I have thought about the direction of HPC clusters and I must confess things seem a little weird. I tend toward the fundamental side when it comes to parallel computing, a single core (CPU), lot’s of memory, and fast interconnect are what I prefer. In my opinion, this design makes programming easier. I assume that the communication behavior between CPUs is consistent as is the local memory access. In today’s modern designs there are multiple cores, multiple memory banks, and a fast network. Locality has become important and in programming terms locality is a bitch.
Given the commodity nature of cluster HPC, I am perfectly willing to support using what the market offers to get the FLOPS we need. In recent years, GP-GPU computing has also changed the way we think about HPC. I consider GP-GPU’s as a kind of SIMD array processor attached to a multi-core node. When the problem fits, GP-GPU is great, except that there is the software issue. What was hard (parallel programming), just seems to be getting harder (parallel programing with multiple nodes, cores, and GP-GPUs).
In my last column I wrote about using low power Atom processors. The comparison calls for more data, but I believe there is something to be said for the low power/many processor approach. It seems I am not alone. Not long after I posted the column, SeaMicro introduced 512 Intel Atom CPUs (1.6 GHz) running in a 10 U rack-mount unit. (There are pictures of the motherboard at SemiAccurate. The target market is the “server” industry, but any HPC geek can see that this design is very similar to the idealized design I mention above. Of course, the Atom processors are not as powerful as their big brother Xeons, but as I pointed out last time, The Nehalem Xeon runs 1.8 times faster, generates 7.3 times as much heat and costs 22 times as much as the D510 Atom. The Xeon performance is 7.7 times faster, but when you factor in the price-to-performance the Atom is 3 times better than the Xeon solution. (Dual core Atom vs quad core Xeon). There is something to think about here.
Some of you may want to close your eyes for what follows. I am going to take a standard 8 core x86 server with a GP-GPU board and do something nasty. First, I am going to take my very sharp saw and cut up the processors into individual cores. Next I’m going to cut the GP-GPU into eight separate pieces. The parallel stream architecture allows this to be done a bit easier than the multi-core CPUs. Next I’m going to take the main memory and the GP-GPU memory combine it and divide by eight. At the end of my octasection I have eight piles that include a single core, some stream processors, and some memory. Because I have some skill with a soldering iron, I am going to put all these back together and make eight little nodes on Mini ITX motherboards. Each node is a complete system and if I use Ethernet the new nodes resemble my ideal parallel computer with the addition of a small array processor on each node. Programing is still not trivial, but it is much more manageable.
In my reassembly, I may have slowed some things down, but by-golly this thing is much easier to program. Fortunately for me, some other people had a similar idea, but for different reasons. Take a look at the AMD Fusion pages. They are sure going to save me a lot of cutting and soldering. The AMD Fusion processors will combine GP-GPU (and array processor or SIMD engine) and a multi-core CPU. There will be one batch of memory and communication between the GP-GPU, CPU, and memory will be over a high speed bus. AMD is calling this new design and Accelerated Processing Unit (APU). If you want to learn more,there is a Fusion White Paper that has some details.
The initial target for the APU technology is the “client” (desktop, notebook) where graphics processing and power budget are important. The HPC cluster situation may get more interesting when the Fusion processors arrive in 2011. It may be possible to build more efficient and easier to program clusters with low cost/low power fusion processors. What then becomes of the the big brother server processors? AMD has stated that Fusion is coming to servers, but not just yet. I take this to mean “years away.”
Intel is working on it’s own brand of fusion, but NVidia does not, at this point, build CPUs and thus may not be able to put its much liked graphics core inside the processor. I find this whole concept very interesting. If the integrated CPU/GP-GPU “client” commodity processors are more like my idealized easier to program cluster processor, then maybe if we look down we can see the shark.