My next software nightmare: the convergence of cloud, GP-GPU, and multi-core.
Before I begin this weeks diatribe, I want to invite you to take the 2009 HPC Market Health micro-poll. It should be just below the “Community Tools” box to the right. As I write this piece we have 77 respondents. I don’t want to comment on the results just yet as I would like to get the number of respondents over 100. And, while I am on my promotion soap box, take a look at this weeks HPC Smackdown on clustered file systems. Jump in and pile drive those comments.
Now back to our regularly scheduled program. I just read about the AMD 1000 GPU supercomputer. Golly. It will use the new AMD/ATI Radeon Radeon HD 4870 Let’s do the press release math. The card is stated to have a peak performance of 1.2 TFLOPS (I assume single precision). Multiply by 1000 and you have over a PFLOP (Peta) of peak performance. Incidentally, if I have this right, each video card has two GP-GPU chips, with a total of 1440 stream processors per board. Again, multiplying by 1000 and you get 1.44 million stream processors.
Let’s give that numbers some perspective. Back in the day, when massively parallel was used to describe machines that used large amount of processors, the MasPar (short for Massive Parallel) MP-2 provided 16,384 “processors” (more like arithmetic units actually, but I’ll call them processors). The Thinking Machines CM2 had 3132 floating-point numeric co-processors while the CM-5 could have had up to 1024 nodes each with dual vector unit, and the nCube2 could have up to 1024 processors. The AMD system is massively parallel on a massive scale. I suspect there will be more of these beasts coming to a cloud near you.
The part I find interesting about the GP-GPU clusters is that for the most part, there is no way to cluster video cards. Of course you can put them in the same physical space, but getting them to all dance together requires some software contortions. I should know because I just finished a piece on GP-GPU computing. There is no shortage of software tools, and with exception of OpenCL there is no standard way to program GPU’s and multi-cores that live on the same motherboard. And, from what I have read, OpenCL can be a rather low level (i.e. tedious) approach to parallel computing.
Again, we are faced with a software issue. If the video vendors keep pumping out GP-GPUs (and start shacking them up with the multi-core processors) then the multi-core software situation is going to go from bad to worse. Instead of the stifling problem of coding for homogeneous multi-cores across nodes, now one will have to take into account the GP-GPU capabilities of each processor on the node. Yikes. I cannot even think about that.
While I’m on the verge of another multi-core break down, let’s talk about the intended use of this cluster. According to this article the PFLOPS supercomputer will be available for gamers, 3D animators, and I would assume anyone who lies in the middle of that continuum. The system is called the Fusion Render Cloud and will be used to deliver video games, PC applications and other graphically-intensive applications through the Internet “cloud” to virtually any type of mobile device with a web browser. There is mention of software by company called OTOY, which allows you to render 3D visuals in your browser via streaming compressed on-line data. Nice website, by the way. Which leads me to the question if there is website in the forest with nothing on it, do the trees really care?
While I’m all for new and exciting killer apps for HPC, I really was not thinking gamers. Anything is possible if you can sell it I suppose. And, I really don’t get the whole cloud thing. I think it has a purpose, like grid, but before we start living in the clouds there is whole lot of reality that needs some attention. Some of these issues are mentioned in Joe Landman’s Computing In The Clouds: Setting Expectations article.
Cloud as service for turn key applications seems to be a workable concept, but to me is is all about where the data access and control. Take for example the biggest cloud application on the planet — Google. Similar to the Fusion Render Cloud, the Google Search Cloud will perform a supercomputer sized search and provide the results to virtually any type of mobile device with a web browser. Great, I use Google every day, except, Google has a record of my searches. My search patterns are in the cloud somewhere. In addition, office applications that work in the cloud may hold my data as well. Call me old school, but I just get nervous about that kind of thing.
At this point I have no conclusions or insights into what the future holds. We are in an era where we measure the number of possible computing cores is now seven figures and the number of software solutions is still one figure (that includes zero by the way).
Comments on "Massive Clouds"
I still dont understand the big deal with these video cards. They are designed for SIMD applications. Most HPC computing is MIMD. With the exception of processing sound or video, what can you use a video card for? And also the “peak” performance that AMD quotes should be taken with a grain of salt, and to take that number and multiply by the number of cards to get a “total peak” performance is ludicrous? Other than playing WOW at 10,000 x 10,000 resolution, can you name a single (useful) application that is naively parallel (since communication between these vector units is all but impossible) and which is SIMD and which could take advantage of all those processing units at once?
I see potentillay an application in molecular dynamics (e.g. lattice Boltzmann, or celular automata) simulations. same operation must be perfomed on many identical simulated particles at every time step. still, some communication between the units is needed.
npast, the communication is what makes it MIMD. Based on the values of fields on neighboring processors, different actions are taken. The same thing is not being done on each processor. Slightly different actions are being performed based on what the neighbors just did. This is where things like graphics cards are no good. The “processors” on video cards are basicly DSPs and are designed to do things like apply a filter to a input stream. They might be good at problems that are naively parallelized, but generally do rather poorly at problems like solving PDE’s. While the algorithm may be the same on each node, the values one uses change from node to node based on conditions at that node as well as the current and former conditions on neighboring nodes.