Table Two: AMD/ATI 9720 Specification
Unlike NVidia, AMD/ATI did not develop their own programming model. Instead they have adopted and enhanced the
BrookGPU stream processor language. The current AMD/ATI version is called Brook+, Like CUDA it is based on the C language. A good overview of BrookGPU can be found here. It is also worth mentioning that the original Brook project supports both ATI and NVidia video cards (Form the web page: ATI 9700 (R300) and above and Nvidia 5200 (NV3X) and above should work). AMD/ATI also makes available GP-GPU library functions like BLAS, SGEMM and DGEMM. For those interested in historical completeness, ATI had developed an API called Close To Metal a low-level programming interface for GP-GPU computing. The project was short lived, however. AMD/ATI has committed to supporting the new OpenGL standard as well.
One cannot talk about HPC without considering Intel. In terms of GP-GPU processing, Intel has taken a different approach than NVidia and AMD/ATI. Intel has announced the Larrabee processor that consists of 32 X86 (Pentium, P54C) processor cores (later version are expected to include 48 cores). Larrabee is expected to compete with GP-GPU products from NVIDIA and AMD/ATI and use the x86 instruction set with Larrabee-specific extensions. Some interesting aspects of the Larrabee include, cache coherency across all its cores and the absence of specialized graphics hardware. However, recent comments by Intel suggest that Larrabee will not be positioned as a GP-GPU/HPC solution.
Programming Models Galore
There are a number of ways to program GP-GPU computers. Unfortunately, until recently there has not one method that was portable across all hardware. Also keep in mind, the programming models are designed to work on a single motherboard/GP-GPU systems. None of these are distributed programming languages as they all work within the confines of a single compute node (with exception of the PGI products, see below). Cluster mavens considering GP-GPU programming may find it helpful to look at Comparison of MPI, OpenMP, and Stream Processing for a comparison of various programming methodologies. While a full discussion of the various programing languages and methodologies would constitute another article (or two) the following table should serve as brief snapshot of the current state of affairs. Be aware that GP-GPU computing is a rapidly moving market. Indeed, the recent OpenCL spec took less than 6 months to become a standard. OpenCL (not to be confused with OpenGL) is a low level specification for GP-GPUs and multi-core CPUs.
||Standard API for GP-GPU and multi-core. Large amount of vendor support. Powerful, but low level
||Very nice programming model. Abstracts way thread management from the user. Similar to C, but no recursion. Fortran and C++ support is coming. Works on NVidia hardware only. Free binary available, no registration required.
||The Stanford University Graphics group’s compiler and run-time implementation of the Brook stream programming language. Still considered “beta” (although it has been around for a while). Supports both NVidia and AMD/ATI hardware. BSD license
||Available form AMD/ATI, based on Stanford University Graphics group’s compiler and run-time implementation of the Brook stream programming language. Based on C, but no recursion. AMD/ATI hardware only.
||Currently a research language. Based on C/C++. designed to support hundreds to thousands of hardware threads. Supports Intel hardware.
||Portable API for multi-core, GP-GPU and cell processor. Commercial product for C++, non-standard.
||Partial support for automatic use of CUDA primitives by compilers. Maybe attractive solution for existing codes.
||Currently a research project, but it allows MIMD On GP-GPUs (MOG)
Table Three: Programming Tools for GP-GPU hardware
Some Final Words
The most interesting thing bout GP-GPU processing is what I call the low barrier of experimentation. Many of the early successes in GP-GPU computing came about because someone became curious about the processing power that might exist in their video card. They set out on their own, at home or in their office and ported a portion of their code to CUDA or BrookGPU. After what they described as minimal effort, they started to see speed-ups of 5-10 times (maybe more). Convinced they were on to something, they did more porting and in some cases managed to achieve remarkable speed-ups. At this point, they were able to convince their colleagues that GPU processing actually worked.
All of this sounds very familiar to how cluster HPC came to into the market. Like clusters, the cost to get in the game is minimal. For example, there are over 70 million CUDA enabled GPUs sitting in workstations out there. If you don’t have one, a the cost of a basic GeForce video card was less than $100. As for the software, it is freely available. NVidia and AMD/ATU, quite wisely, makes the tools available at no cost (and with no registration hassles). It is essentially the same cluster recipe, a low (or no) cost of entry, a possible big pay-off, and some spare time.
It is no wonder people are “playing” with this new technology.
If you are serious about GP-PU computing (i.e. you want to use it in a production environment), then consider the NVidia Tesla or AMD/ATI FireStream lines of hardware. Although the same GP-GPU may be used in less expensive video cards, the HPC optimized solutions are more than clever re-packaging. The HPC versions are designed for continuous operation which requires higher quality parts and attention to cooling. While a typical video card can reach the processing levels of an HPC application, most video processing happen in bursts. There are times when the GP-GPU is waiting for user input — like when you are about to yank that guy out of a car and start your crime spree.
The HPC enabled GP-GPUs are making some serious in-roads. There seems to be some healthy competition and plenty of success stories in the market. A low barrier to entry plus a spectrum of development tools make this a possible addition to your HPC arsenal.