As impressive as the NVidia hardware is, much of the success of NVidia GP-GPU computing can be attributed to the CUDA programming language. CUDA (which originally stood for Compute Unified Device Architecture) as a programming environment designed for GP-GPU computing. It is based on the C programming language and provides a high level abstraction layer so the end user does not have to worry about managing threads across the hardware. In particular CUDA is a well thought out programming language based on C. It does not support recursion due to memory limitations on the video card, but it does present and easy learning curve for most C programmers. In addition, there are plenty of pre-existing libraries (as source code) available to programmers. Some of these are listed below:
Parallel bitonic sort
Performance profiling using timers
1D DWT using Haar wavelet
CUDA BLAS and FFT library usage examples
CPU-GPU C- and C++-code integration
Binomial Option Pricing
Black-Scholes Option Pricing
Monte-Carlo Option Pricing
The Cuda toolkit also includes optimized FFT and BLAS libraries. NVidia has also committed to supporting the recent OpenGL standard.
While many of the GP-GPU headlines have been about NVidia based system, AMD/ATI has not been sitting idle. The new AMD FireStream 9270 is expected to launch in the first quarter of 2009 and boasts over a TFLOP of single precision performance. Double precision is expected to be 240 GFLOPS. While the peak performance numbers seem impressive, remember your application may vary and the only true test is your code. The 9720 is available as a PCIe card. There is no rack-mount solution at this time. The following table summarizes the soon to be released FireStream processor from AMD/ATI.
Number of GPUs
Peak compute rate
1.2 TFLOPS (single), 240 GFLOPS (double)
Floating point formats
IEEE single & double precision
GPU local memory
2GB GDDR5 SDRAM
256-bit @ 850 MHz
Peak memory bandwidth
PCIe x16 Gen 2
160 watts typical, <220 watts peak
Number of PCIe Slots
Table Two: AMD/ATI 9720 Specification
Unlike NVidia, AMD/ATI did not develop their own programming model. Instead they have adopted and enhanced the BrookGPU stream processor language. The current AMD/ATI version is called Brook+, Like CUDA it is based on the C language. A good overview of BrookGPU can be found here. It is also worth mentioning that the original Brook project supports both ATI and NVidia video cards (Form the web page: ATI 9700 (R300) and above and Nvidia 5200 (NV3X) and above should work). AMD/ATI also makes available GP-GPU library functions like BLAS, SGEMM and DGEMM. For those interested in historical completeness, ATI had developed an API called Close To Metal a low-level programming interface for GP-GPU computing. The project was short lived, however. AMD/ATI has committed to supporting the new OpenGL standard as well.
One cannot talk about HPC without considering Intel. In terms of GP-GPU processing, Intel has taken a different approach than NVidia and AMD/ATI. Intel has announced the Larrabee processor that consists of 32 X86 (Pentium, P54C) processor cores (later version are expected to include 48 cores). Larrabee is expected to compete with GP-GPU products from NVIDIA and AMD/ATI and use the x86 instruction set with Larrabee-specific extensions. Some interesting aspects of the Larrabee include, cache coherency across all its cores and the absence of specialized graphics hardware. However, recent comments by Intel suggest that Larrabee will not be positioned as a GP-GPU/HPC solution.
Programming Models Galore
There are a number of ways to program GP-GPU computers. Unfortunately, until recently there has not one method that was portable across all hardware. Also keep in mind, the programming models are designed to work on a single motherboard/GP-GPU systems. None of these are distributed programming languages as they all work within the confines of a single compute node (with exception of the PGI products, see below). Cluster mavens considering GP-GPU programming may find it helpful to look at Comparison of MPI, OpenMP, and Stream Processing for a comparison of various programming methodologies. While a full discussion of the various programing languages and methodologies would constitute another article (or two) the following table should serve as brief snapshot of the current state of affairs. Be aware that GP-GPU computing is a rapidly moving market. Indeed, the recent OpenCL spec took less than 6 months to become a standard. OpenCL (not to be confused with OpenGL) is a low level specification for GP-GPUs and multi-core CPUs.
Very nice programming model. Abstracts way thread management from the user. Similar to C, but no recursion. Fortran and C++ support is coming. Works on NVidia hardware only. Free binary available, no registration required.
The Stanford University Graphics group’s compiler and run-time implementation of the Brook stream programming language. Still considered “beta” (although it has been around for a while). Supports both NVidia and AMD/ATI hardware. BSD license
Available form AMD/ATI, based on Stanford University Graphics group’s compiler and run-time implementation of the Brook stream programming language. Based on C, but no recursion. AMD/ATI hardware only.
Currently a research project, but it allows MIMD On GP-GPUs (MOG)
Table Three: Programming Tools for GP-GPU hardware
Some Final Words
The most interesting thing bout GP-GPU processing is what I call the low barrier of experimentation. Many of the early successes in GP-GPU computing came about because someone became curious about the processing power that might exist in their video card. They set out on their own, at home or in their office and ported a portion of their code to CUDA or BrookGPU. After what they described as minimal effort, they started to see speed-ups of 5-10 times (maybe more). Convinced they were on to something, they did more porting and in some cases managed to achieve remarkable speed-ups. At this point, they were able to convince their colleagues that GPU processing actually worked.
All of this sounds very familiar to how cluster HPC came to into the market. Like clusters, the cost to get in the game is minimal. For example, there are over 70 million CUDA enabled GPUs sitting in workstations out there. If you don’t have one, a the cost of a basic GeForce video card was less than $100. As for the software, it is freely available. NVidia and AMD/ATU, quite wisely, makes the tools available at no cost (and with no registration hassles). It is essentially the same cluster recipe, a low (or no) cost of entry, a possible big pay-off, and some spare time.
It is no wonder people are “playing” with this new technology.
If you are serious about GP-PU computing (i.e. you want to use it in a production environment), then consider the NVidia Tesla or AMD/ATI FireStream lines of hardware. Although the same GP-GPU may be used in less expensive video cards, the HPC optimized solutions are more than clever re-packaging. The HPC versions are designed for continuous operation which requires higher quality parts and attention to cooling. While a typical video card can reach the processing levels of an HPC application, most video processing happen in bursts. There are times when the GP-GPU is waiting for user input — like when you are about to yank that guy out of a car and start your crime spree.
The HPC enabled GP-GPUs are making some serious in-roads. There seems to be some healthy competition and plenty of success stories in the market. A low barrier to entry plus a spectrum of development tools make this a possible addition to your HPC arsenal.