The combined challenges of multi-core CPUs in HPC combined with with a pronounced lack in recent clock speed increases have led many users on a quest for alternative commodity hardware that can be used to increase performance. Enter the modern day Graphics Processing Unit (GPU).
Everyone knows the clock speed increases in commodity CPUs have slowed down to a crawl. There are several reasons for this, but the bottom line is that we aren’t likely to see increasing performance with each new processor as we have in the past. At the same time Moore’s Law is still holding, so the number of transistors per processor is increasing. CPU companies have taken the extra transistors and used them to create extra cores and cache, leading us into the multi-core era of computing. Cores are great, but there are some challenges facing many programmers as they convert to multi-core systems.
First, some applications are driven by single core performance. They can run over multiple cores using threads, OpenMP, or MPI, but in the end the overall code performance is driven by the single limiting process on a single core. Second, the performance of the memory and I/O buses are not increasing as rapidly as the number of cores is increasing. This situation means that the amount of memory bandwidth per core is decreasing, reducing performance. Also, the number of cores per network interface (NIC) is rapidly increasing, possibly reducing the performance, or at the very least requiring high performance networks.
The challenges with multi-core CPUs in HPC can be overcome to a certain degree, but people are still looking for more performance from their systems. This quest has led people to look around for other commodity hardware that can be used to increase performance. Enter the modern day Graphics Processing Unit (GPU).
History Repeats Itself
Let’s step back a few years and recall how clusters evolved. Clusters arose because the traditional HPC systems were expensive and processing time was limited — HPC in the cathedral as it were. At the same time, PC components were becoming faster. At NASA Goddard, Tom Sterling and Don Becker demonstrated the Beowulf concept using commodity PC parts and open source software. Their target cost was $50,000. Such a system would drastically reduce the entry cost for a local HPC resource. About that time, commodity parts (x86 hardware) were becoming price/performance competitive with the fastest workstations. In addition, Fast Ethernet was making it’s entry into the market. (Open source and the Internet came along as well, but that is another story.) The first Beowulf was a huge success and forged a path for true commodity-based HPC computing. This trend continues today where the HPC practitioner continues to leverages high performance at commodity prices.
As a rule, HPC users are always looking to gain more performance. One of the options is a hybrid scheme that adds a co-processor to the system. The co-processor(s) are used for computations on the node. There are several co-processor options being pursued, including using FPGAs (Field Programmable Gate Arrays), or specialized co-processors such as Clearspeed. But these are clearly not commodity-based solutions and have a subsequent higher cost. There is, however, a commodity co-processor component that has shown continued growth in specialized performance — modern day graphics cards.
Thanks in part to the PC gaming industry, today’s graphics cards have very high-performance GPUs (Graphic Processing Units) as computational engines. These GPUs are programmable, so you can write code for them. In essence, you are using your high-end graphics card not for slaying monsters and aliens, but for computing.
Many people get excited when a CPU is released at a higher GHz. A bit more performance, plug and play, life is great. GPUs on the other hand have been quietly, at least to the HPC world, taking giant leaps and bounds in performance. Figure One below illustrates the trend in the theoretical performance of ATI and NVIDIA GPUs from about 2003 to 2005 along with the performance of Pentium 4 CPUs.
|Figure One: Theoretical performance of ATI, NVIDIA GPUs and Pentium 4 CPUs (Used with permission from Ian Buck, Stanford University)
Note the huge increase in performance of the GPUs relative to the CPUs. At the end of the chart, in mid-2005, the performance of these GPUs is several times the performance of the CPUs. For example, Intel’s desktop quad core product, the Core 2 extreme QX6700 runs at 2.6 GHz and has a peak floating point performance of 50 GFLOPS, while NVIDIA’s latest GPU, the G80, has a peak floating point performance of 330 GFLOPS. ATI’s Radeon HD 2900XT peak performance is about 475 GLOPS. Using Crossfire, two ATI Radeon cards, and a little overclocking, the peak performance has reached approximately 1 TFLOP! Granted these numbers are single precision at the moment and for specialized computations, but the potential performance is huge and the hardware is available now.
Why is the GPU performance increasing so quickly? The PC gaming sector is large and growing, and is currently considered a multi-billion dollar market. The economic motivation for developing faster GPUs is obvious. In addition, it may be easier to take extra transistors and use them for GPU computations rather than just cache because of the nature of GPU calculations. (CPUs are using extra transistors for more cores not bigger cores). Moreover, if you haven’t noticed, the release cycle for GPUs is faster than for CPUs due to their less complex designs. So, with a high potential performance at perhaps a cheaper price point, you can see why GPUs are being considered as a co-processor or even a main processor.
GPUs are Different
GPUs are fundamentally different compared to regular CPUs. Currently the GPUs are designed to take a 3D model of a scene and pass it through various steps in the rasterization based rendering process to create an image that is displayed. To increase the performance of this process, modern GPUs are designed to pipeline the data. Within the rasterization process, there a number of stages that are performed in series using the GPUs and code fragments to process the data.
For computations, rather than rasterization, the GPUs can be thought of as Stream Processors. Stream processing takes a series of compute-intensive operations (called kernels) and applies them to every element in a set of data that is called a stream. In general, GPUs apply one kernel at a time to a particular stream, but multiple streams can be processed at the same time (they are independent of one another). But the kernel is not an entire program, but rather just a part of the algorithm. So to compute an entire algorithm, several kernels are applied in series to the stream. However, since the kernel has to be applied to the entire stream, there are things that you can’t do in GPUs such as reading and writing at the same time that you can do in CPUs.
Consequently you have to think of the data and the program in a different way when writing for GPUs. You have to think of the algorithm in terms of processing streams of data using a kernel that does the same operation on the entire stream. You can chain together kernels to perform series of operations but each kernel is restricted to apply the same operations on the entire stream. The benefits of stream processing are potentially huge! In particular:
- There is explicit parallelism in streams (data and task)
- No communication between stream elements (simplifies operations)
- Kernels cannot write to the input streams
- Stream elements are independent
- Parallelism hides memory access (latency)
While it may not be obvious how these items contribute to performance, if your code follows the stream model of computing, it can give you a huge boost in performance. But you have to live within the stream computing model and it’s requirements (Sorry no free lunch).
The concept of an attached processor is not new. In the days of big iron HPC, there were specialized devices called “array processors” that were attached to HPC computers systems. These devices, much like the GPUs of today, were designed to do specific tasks faster than the host CPU. It seems that the commodity graphics market has allowed an old idea to become new again.
Some of the requirements or limitations of stream computing, in addition to the ones previously discussed, are:
- No stack or heap
- No integer or bit-wise operations
- No scatter operations (a[i] = b)
- No reduction operations (max, min, sum)
These features are missing because games haven’t asked for them (yet) and it’s still early in the GPGPU (General Purpose Graphic Processor Unit) development (in other words they may come as the demand warrants it). But, if you can get around these limitations and follow the stream programming model, the potential for a huge gain in performance at low cost is out there.
There are two major programming APIs for GPUs: OpenGL and DirectX. This means you have to translate the algorithm into OpenGL or DirectX functions. This is not an easy task and you have to understand both the algorithm and the code you are writing and also how to program a GPU. Even using the stream processing concepts, it is difficult to write code using the GPU APIs.
For example, instead of talking about storage (memory) you have to talk about textures. Instead of talking about a computational kernel, you have to talk about fragment programs. And instead of talking about a foreach loop over a data stream, you have talk about a render pass. There are few people who understand computational algorithms and GPU coding. The obvious solution is to develop high-level languages for writing GPU code.
Higher level languages abstract many of the fundamental GPU codings that are needed for expressing algorithms. This mapping makes it easier for people who are not graphics coders to write code that runs on GPUs. Some of the higher level languages (or standard languages with extensions) for GPUs are:
The languages vary but most of them look something like C and C++ and incorporate stream concepts. For example, Brook treats the GPU as a streaming processor, and the language looks like C with some stream extensions. At one time there was also a derivative of Brook that was being developed called Brooktran that was close to Fortran yet based on Brook. Brook works with ATI and NVIDIA GPUs, OpenGL and DirectX, as well as Windows and Linux.
CUDA is very interesting because NVIDIA created a new model for programming general purpose computations on GPUs. It is data parallel computing using thousands of threads with a Parallel Data Cache to help increase arithmetic intensity for large performance boosts (arithmetic intensity refers to the compute intensity of the code). The nice thing is that you can program in C and then use extensions to program for the GPUs. This feature allows you to target certain portions of the code for execution on the GPU and the rest to run on the CPU. CUDA has 4 sets of extensions:
- Function type qualifiers (defines if a function can run or call from the CPU or GPU)
- Variable type qualifiers (specify memory location of a variable on a GPU)
- A directive to specify how a kernel is executed on the GPU from the CPU
- Four variables that specify the grid and block dimensions from the CPU (this describes the problem so the run-time can put the treads in the appropriate location)
As you can see, the extensions are fairly minor. You then take your code and build it with NVCC (NVIDIA’s compiler that comes with CUDA), which splits the code into two parts: the GPU-specific code, and the CPU-specific code. The CPU-specific code is output so you can compile it outside of NVCC. The GPU code is compiled into a GPU binary form.
CUDA has a run-time library that runs on the CPU (the host) and provides functions to access GPUs (possibly more than one) and to control them. It also has a component that runs on the GPUs that provide specific GPU functions, and a “common component” that has built-in vector data types (remember to think of the GPUs as stream processors) and a subset of the Standard C Library that runs on both the CPU and the GPUs. It also comes with pre-built BLAS libraries (Basic Linear Algebra Subprograms) and FFT libraries (Fast Fourier Transforms). One nice thing about CUDA is that it allows for combined CPU and GPU programming. It is currently specific to NVIDIA GPUs and it is freely available.
GPU Computing Resources
There are some GPU resources available around the Internet and fortunately there seems to be one central GPU website that links to papers, tools, workshops, etc. It is a great place to keep up with new developments on GPUs, search for materials, or learn more about the many ways GPUs are being used. It also includes a Wiki that has plenty of great information on getting started with GPUs. More importantly, there are a number of tutorials linked from the website that provide an introduction to GPUs. Alongside the tutorials there is also a link to developer resources which provides basic developer tutorials using lower level toolkits such as Cg. There are also some sample codes and a couple of utility codes. If you are working your way through the tutorials and have questions or if you have questions in general, the website also has a Forums site where you can post questions.
There are other resources as well, but some of them are specific to a particular graphics vendor. However, the materials are usually freely available. NVIDIA has a very extensive developer section that supports programming their GPUs for either graphics (games) or GPU programming. The primary discussion about GPU computing is on CUDA website, however. It has links to downloads, documentation, samples from the SDK, and a forum for CUDA.
ATI also has a website that has information on programming its GPUs (from a graphics perspective). The company also have a page devoted to Stream Computing that serves a launch point for ATI’s efforts into Stream Computing in general. ATI also has a site to describe its current stream computing processor called the AMD Stream Processor. You can also search the ATI website for CTM (Close to Metal) which is a thin layer interface above the GPU hardware. While it’s not a compiler, it allows other developers to either interface their existing tools to ATI’s GPUs or to develop new tools.
There is a relatively new company, called RapidMind, that’s developing programming technology that can be used in a normal C++ environment (compilers and IDEs). The RapidMind Platform is embedded in the application and transparently manages massively parallel computations. The general process to convert your code to use RapidMind is to identify the parts of the code you want to run on the GPU (RapidMind also supports the Cell Processor) and then perform the following steps:
- Convert the data types to use RapidMind’s data types (integer and floating point)
- Capture the computations (RapidMind can capture numerical operations when the code is run which are then recorded and dynamically compiled to a program object),
- Stream Execution (the RapidMind run-time manages the execution of objects on the GPU)
RapidMind has some case studies that show how it works and what kind of results you can expect. For example, there is a paper that compares the performance of the BLAS SGEMM routine (single precision matrix multiply), a single precision FFT, and European option pricing model (Black-Scholes) on a GPU versus the performance on a CPU with highly-optimized code. RapidMind compared the GPU routines with RapidMind to the optimized CPU routines using both an ATI x1900 XT card and the NVIDIA 7900 GTX, against an HP workstation running Intel Woodcrest chips and an AMD Opteron at 2.6 GHz. The researchers found that the GPU code for SGEMM was 2.4 times faster than the CPU code; the GPU FFT code was 2.7 times faster; and the Black-Scholes GPU code was 32.2 times faster than the CPU code.
There have been several success stories about using GPUs in computation. The first one is a project called Folding@Home. The Folding@Home application is downloaded and run on the user’s desktop as a voluntary effort to help researchers that have protein folding problems to solve. The application runs a protein folding code to help with their research. On Sept. 26, 2006, ATI announced that they had ported the Folding@Home application to use ATI GPUs. ATI estimated that an ATI X1900 graphics card could get 20X to 40X more performance than a standard CPU. The company also estimated that using ATI graphics cards it would take only one month to finish what previously took three years to solve.
Another success story is the Massachusetts General Hospital. At the hospital, they take low-power X-rays that can be combined to form a real-time image through a process called Digital Tomosynthesis. In the past the hospital had been using a 35-node cluster to do the computations. Using NVIDIA G80 GPUs they saw a performance improvement of about 100 times compared to standard CPUs. This increase means the same task can be performed on a workstation, moving the computational part of the process to the imaging unit.
These are just two success stories about GPUs. There are others. For example, some researchers have come up with several LU (matrix factorization) routines that run much faster on GPUs than CPUs. The LU routine is used by many CAE (Computer Aided Engineering) codes to solve problems.
GPUs for Clusters
So far I’ve talked about efforts to run code on GPUs on a single cluster node. GPU codes can be used in conjunction with the CPU, if the CPU is running a parallel MPI job. Expect more in this area in the future. In the mean time, GPUs have been used in clusters in other ways.
GPU clusters, or clusters where the GPU is the primary focus, are currently being used as tiled displays. An example of this is the package called Chromium, which performs parallel rendering. You can use Chromium to drive multiple displays or to drive a single display using multiple compute nodes equipped with GPUs.
Mike Houston at Stanford has been working with GPU clusters for some time. He worked with a team on a parallel GPU version of a protein search code that uses HMM (Hidden Markov Models). The team called its version ClawHMMer. The HMM code was rewritten to run on GPUs and then modified again so that the database searches are divided among multiple nodes in a GPU cluster. While in general it’s an embarrassingly parallel algorithm because the searches are independent of one another, the overall search itself isn’t parallel. The per node performance is quite good (about 10-40 times the performance of a single CPU) but more important, the parallel code scales very well (over 90 percent scalability at 16 nodes).
The Visualization Lab at SUNY Stony Brook has been working with GPU clusters for several years. The lab does research in both distributed graphics as well as distributed computation using the GPUs. In 2004, the Visualization Lab presented a paper about developing a Lattice Boltzmann Method (LBM) on a GPU cluster to simulate the dispersion of airborne contaminants in the Times Square area of New York City. The team added graphics cards (GPUs) to nodes in a cluster and rewrote the LBM code to use GPUs as well as the network. Since the GPUs don’t have direct access to networking interfaces, they had to transfer the data that is to be passed (either sent or received) to the CPU, which then transfers it to the required node, which then transfers it to the GPU on the node. Using this method, the programmers were able to get a speedup of 4.6X compared to just using the CPUs.
Based on their experience with the LBM code as well as others, the Visualization Lab has recently developed a object-oriented middleware for clustered GPU programming called ZippyGPU. It sits above the low-level GPU APIs such as OpenGL and Cg, as well as the communication API, MPI. It encapsulates and manages the objects and functions provided by these APIs, making coding much easier. Your code makes use of ZippyGPU’s objects to provide access to the GPU and to the network. ZippyGPU is designed to support visualization, image compositing, as well as GPU based computations. The lab has shown how to use ZippyGPU to write something like the GA toolkit for combining the memory of the GPUs within the MPI run for large data arrays and computing with them (or using them for visualization).
Acceleware is a company that is developing GPU hardware and software. They market Accelerator Boards that are plugged into PCI-e slots to provide GPU (Stream) processing for single machines, primarily workstations. They also market workstations that use two or even four Accelerator Boards in a single machine. As an example, the company has written a conjugate gradient solver for sparse systems of equations. On a 32-bit problem, the code achieved a 3X performance increase over current CPUs. They are are also developing electromagnetic problem solvers that use GPUs.
GPUs offer great hope in increasing the performance of HPC codes. While currently difficult to program and limited to “the right problem”, the performance gains are too large to ignore. Better tools and help are on the way. And for all the gamers out there, next time you are considering a new graphics card, think FLOPS instead of frags.
Jeff Layton is an Enterprise Technologist for HPC at Dell. He can be found lounging around at a nearby Frys enjoying the coffee and waiting for sales (but never during working hours).