Back in late 90′s there was a time when clustering was the rage. A few boxes, a fast Ethernet switch, some Linux software and presto, you had a cluster. Like every new technology there are misconceptions, hype and detractors. Eventually, …
Back in late 90′s there was a time when clustering was the rage. A few boxes, a fast Ethernet switch, some Linux software and presto, you had a cluster. Like every new technology there are misconceptions, hype and detractors. Eventually, the price performance numbers ushered in a new HPC computing paradigm. In some cases people cobbled together clusters, just to build them, in others they built clusters and ran actual HPC programs on them. Indeed, there seemed to be story after story about a new “off the shelf system” built at low cost or from some systems that were “just lying around.” They would get an MPI or PVM program working and find that they now had their own personal supercomputer. There seemed to be several enabling factors that ran through the success stories. These included:
- The performance is better than the existing system I was sharing with everyone else.
- The hardware cost is really low (price-to-performance is high).
- I did not have to buy any software.
- I can do things I could never do before on my desktop.
- I own this resource and have exclusive use.
It was an empowering exercise for most people. In some cases, you, a mere drone, could create a system that embarrassed the shinny new expensive computer system down the hall. The financial barrier to entry was low and the possibility of success was high. In order to get in the game, there was a small (or no) hardware cost and the software was freely available. You played at home or in the corner of your lab or office. After you ran your codes and were convinced it worked, you could sense it — a liberating change was on it’s way. The complete disruption of how things were done. You were the master of your domain.
Skipping ahead, Linux HPC clusters now dominate the market. We often hear about the “big ones”, but many small individual or group clusters still exist. I believe, the rapid and disruptive growth was due to the low barriers to entry. You could play with “it” before you committed big resources to a larger system. A simple and effective “try before you buy” proposition.
Last week I was at NVidia attending an HPC Editors day. NVidia is well known for its graphics products and perhaps less well known as an HPC vendor. It turns out that high performance graphic processors use parallel computing methods similar to those in HPC — check out Pixels to PetaFLOPS: How GPUs are Pushing the HPC Envelope for background on GPU’s in HPC. NVidia recognized this overlap and began to design its video cards for dual use. With the introduction of the 8-series graphics processor, each graphic chip had multiple general purpose processors. In addition, NVidia recognized that a new programming API was needed since programming a video card using graphics tools did not really seem like the best idea. Hence the introduction of CUDA — and new parallel programming API.
At the meeting, we were told about the new 10-series processor (more below) and introduced to people who were using the current products (8-series) for their HPC needs. The performance and scope of applications is quite remarkable. If you head over to the CUDA Zone, you can get a feel for the types of applications and speed-ups that are possible. Keep in mind this is over the last 12 months. While the hardware and software aspects are very interesting, something else grabbed my interest, something familiar. In talking with some of the end users, a common theme developed.
In most cases, some one became curious about the power that might exist in their video card. They set out on their own, at home or in their office and ported a portion of their code to CUDA. After what they described as minimal effort, they started to see speed-ups of 5-10 times (maybe more). Convinced they were on to something, they did more porting and in some cases managed to achieve remarkable speed-ups. At this point, they were able to convince their colleagues that GPU processing actually worked.
All of this sounds very familiar to the great cluster disruption. Take a look at the enabling factors list above. Like clusters, the cost to get in the game is minimal. There are over 70 million CUDA enabled GPUs sitting in workstations out there. If you don’t have one, a the cost of a basic GeForce video card was less than $100. As for the software, it is freely available. NVidia, quite wisely, makes the CUDA C compiler available at no cost (and with no registration hassles). It is essentially the same cluster recipe, a low (or no) cost of entry, a possible big pay-off, and some spare time.
Should you sell your clusters on EBay and buy video cards? Maybe not yet. We could be at the beginning of another paradigm shift, but we must keep our head about us. GPU computing is best classified as “predictable” computing where the computational flow is well known in advance (i.e. a matrix multiplication). Not all applications fit into this area. Second, current GPU processors use single precision (32 bit). This level of precision can work for many applications, but has kept other applications in the feasibility category. The recently announced 10-series GPU does include support for double precision, as will all future NVidia GPU hardware. Third, most HPC users do not want to port code particularly if it is for a specific hardware architecture. Fortunately CUDA can be applied in top down fashion and does not normally require a huge rewrite. Finally, scaling beyond a single GPU is the gleam in many HPC users eye. As always, it all comes down to software. Programmers have found CUDA to be a good abstraction layer for HPC and if it grows beyond the GPU and scales across multiple nodes things may get really interesting.
There is plenty more to talk about. In the future, I’ll be taking a closer look at CUDA and the new NVidia hardware. Just to pique your interest, the new 10-series (Tesla) cards will provide up to 4 GB of on-board RAM, one TFLOPS of single precision performance, double precision support, and 240 on-board processors. Combine this new hardware with the army of existing CUDA developers out there and one might even get a sense of “Deja Vu all over again.” Can you feel it?
Next week I’ll get back on track continuing my discussion on parallel programming, but now I have to answer the door. The deliver guy has my new GeForce video card. Funny thing is, I’m putting it in a system that does not have a monitor. Go figure.