dcsimg

The End Of The HPC Server (As We Know It)

Has the big multi-core Xeon and Opteron server jumped the HPC shark?

I was there watching Happy Days when The Fonz jumped the shark. I remember thinking, this show is getting weird. Not that I was a big fan, I happened to be watching it that night and decided it was time to call it quits. Thus was born the term “jumping the shark,” which does not necessarily mean the end, it just means that it will never be as good as it was, at least that is my interpretation.

Let’s move on to my overly provocative headline. At times I have thought about the direction of HPC clusters and I must confess things seem a little weird. I tend toward the fundamental side when it comes to parallel computing, a single core (CPU), lot’s of memory, and fast interconnect are what I prefer. In my opinion, this design makes programming easier. I assume that the communication behavior between CPUs is consistent as is the local memory access. In today’s modern designs there are multiple cores, multiple memory banks, and a fast network. Locality has become important and in programming terms locality is a bitch.

Given the commodity nature of cluster HPC, I am perfectly willing to support using what the market offers to get the FLOPS we need. In recent years, GP-GPU computing has also changed the way we think about HPC. I consider GP-GPU’s as a kind of SIMD array processor attached to a multi-core node. When the problem fits, GP-GPU is great, except that there is the software issue. What was hard (parallel programming), just seems to be getting harder (parallel programing with multiple nodes, cores, and GP-GPUs).

In my last column I wrote about using low power Atom processors. The comparison calls for more data, but I believe there is something to be said for the low power/many processor approach. It seems I am not alone. Not long after I posted the column, SeaMicro introduced 512 Intel Atom CPUs (1.6 GHz) running in a 10 U rack-mount unit. (There are pictures of the motherboard at SemiAccurate. The target market is the “server” industry, but any HPC geek can see that this design is very similar to the idealized design I mention above. Of course, the Atom processors are not as powerful as their big brother Xeons, but as I pointed out last time, The Nehalem Xeon runs 1.8 times faster, generates 7.3 times as much heat and costs 22 times as much as the D510 Atom. The Xeon performance is 7.7 times faster, but when you factor in the price-to-performance the Atom is 3 times better than the Xeon solution. (Dual core Atom vs quad core Xeon). There is something to think about here.

Some of you may want to close your eyes for what follows. I am going to take a standard 8 core x86 server with a GP-GPU board and do something nasty. First, I am going to take my very sharp saw and cut up the processors into individual cores. Next I’m going to cut the GP-GPU into eight separate pieces. The parallel stream architecture allows this to be done a bit easier than the multi-core CPUs. Next I’m going to take the main memory and the GP-GPU memory combine it and divide by eight. At the end of my octasection I have eight piles that include a single core, some stream processors, and some memory. Because I have some skill with a soldering iron, I am going to put all these back together and make eight little nodes on Mini ITX motherboards. Each node is a complete system and if I use Ethernet the new nodes resemble my ideal parallel computer with the addition of a small array processor on each node. Programing is still not trivial, but it is much more manageable.

In my reassembly, I may have slowed some things down, but by-golly this thing is much easier to program. Fortunately for me, some other people had a similar idea, but for different reasons. Take a look at the AMD Fusion pages. They are sure going to save me a lot of cutting and soldering. The AMD Fusion processors will combine GP-GPU (and array processor or SIMD engine) and a multi-core CPU. There will be one batch of memory and communication between the GP-GPU, CPU, and memory will be over a high speed bus. AMD is calling this new design and Accelerated Processing Unit (APU). If you want to learn more,there is a Fusion White Paper that has some details.

The initial target for the APU technology is the “client” (desktop, notebook) where graphics processing and power budget are important. The HPC cluster situation may get more interesting when the Fusion processors arrive in 2011. It may be possible to build more efficient and easier to program clusters with low cost/low power fusion processors. What then becomes of the the big brother server processors? AMD has stated that Fusion is coming to servers, but not just yet. I take this to mean “years away.”

Intel is working on it’s own brand of fusion, but NVidia does not, at this point, build CPUs and thus may not be able to put its much liked graphics core inside the processor. I find this whole concept very interesting. If the integrated CPU/GP-GPU “client” commodity processors are more like my idealized easier to program cluster processor, then maybe if we look down we can see the shark.

Comments on "The End Of The HPC Server (As We Know It)"

laytonjb

Doug,

The last several columns have been great! I really enjoy the examination of the traditional higher-power CPUs and the new \”cloud\” oriented lower-power processors. Really cool stuff.

I hope your wife understands all the noise when you start cutting up the chips. I recommend a Dremel tool and good work gloves (don\’t forget to ground yourself).

Keep them coming!

Thanks!

Jeff

Reply
kalloyd

Doug,

The traditional configuration of CPU exclusive nodes with combinations of Ethernet and Infiniband can be \”repurposed\” to management and distribution functions to and between the compute nodes. These \”sergeant\” nodes become intermediaries between the head and compute nodes.

The idea is to balance high power consumption with lower duty cycles given the various throughput bandwidths. In other words, it is a balancing act.

Oh, and have a good time tweaking the hardware, Sparky!

Ken

Reply
margravezakhur

Hmmm,

You might want to look at the Cell Broadband Engine. It is too expensive in its current incarnations for most independent developers (now that the OS of the PS3 no longer allows a second OS), but 18 cores with division of labor and 2.5Mb Cache for each core is interesting, especially since the access to main memory has been moved to the software side (where it may belong for purposes of efficiency). The design seems to work rather well, and some HPCs were built off of a single worldly node and a bunch of PS3s not too long ago.

Reply
x95tobos

Trying hard not to be rude, but having fuzzy memories on VLSI topics and CPU design, and without ANY pictures of the process, this seems complete BS to me- OK, a good gedanken experiment, in the spirit of Albert Einstein, but I\’ll have to see you using a \”sharp saw\” to cut a chip with submilimeter features like vias and so on, in order to believe it, no matter how good your soldering skills may be.

Reply
markhahn

isn’t the question really about programming models? the Cuda model is pretty clumsy, and I’m guessing Fusion will still have separate x86 and GPU universes. to me, the goal should be to provide a reasonable programming model (x86_64 isn’t bad, since no one really pays attention to instruction encoding anymore) that leverages the kind of stackless minicores that GPUs provide. in other words, CPU architects should be asking “how can I add a bunch of lighter-weight threads to _extend_ our conventional architecture”. I don’t think there’s anything magic about how GPUs are implemented except that the omit all the fancy dynamic scheduling, OoO, renaming etc that conventional CPUs use to make non-streaming code go fast. even if there are sticking points (say, CPU and streaming sides differing in how they want to do caching), it seems like there could be fairly simple workarounds (2MB streaming pages?)

Reply

I really like this website , and hope you will write more ,thanks a lot for your information.

Reply