Learn how graphics processing units are pushing the HPC envelope.
Everyone knows the clock speed increases in commodity CPUs have slowed to a crawl. There are several reasons for this, but the bottom line is that number crunchers aren’t likely to see increasing performance with each new processor as has happened in the past. At the same time, Moore’s Law still holds, so the number of transistors per processor is still increasing. CPU companies have taken the extra transistors and used them to create extra cores and cache, ushering in the multi-core era of computing.
Cores are great, but there are some challenges facing many programmers as they convert to multi-core systems.
Some applications are driven by single core performance. The application can run over multiple cores using threads, OpenMP, or MPI, but in the end, the overall code performance is driven by the single limiting process on a single core.
The performance of the memory and I/O buses aren’t increasing as rapidly as the number of cores is increasing. Thus, the amount of memory bandwidth per core is decreasing, reducing performance. Also, the number of cores per network interface card (NIC) is rapidly increasing, possibly reducing the performance, or at the very least requiring high performance networks.
The challenges with multi-core CPUs in HPC can be overcome to a certain degree, but the quest for more performance continues. The hunt has led people to look for other commodity hardware that can be used to increase performance. Enter the modern day Graphics Processing Unit (GPU).
History Repeats Itself
Let’s step back a few years and recall how clusters evolved. Clusters arose because traditional HPC systems were expensive and processing time was limited — HPC in the cathedral as it were. At the same time, PC components were becoming faster. At NASA Goddard, Tom Sterling and Don Becker demonstrated the Beowulf (http://www.beowulf.org) concept using commodity PC parts and open source software. The pair’s target cost was $50,000. Such a system would drastically reduce the entry cost for a local HPC resource. About that time, commodity parts (x86 hardware) were becoming price/performance competitive with the fastest workstations, and Fast Ethernet was making it’s entry into the market. (Open source and the Internet came along as well, but that is another story.) The first Beowulf was a huge success and forged a path for true commodity-based HPC computing. This trend continues today where the HPC practitioner continues to leverages high performance at commodity prices.
As a rule, HPC users are always looking to gain more performance. One of the options is a hybrid scheme that adds a co-processor to the system. The co-processor (one or more) is used for computations on the node. There are several co-processor options being pursued, including using Field Programmable Gate Arrays (FPGA, http://en.wikipedia.org/wiki/FPGA, or specialized co-processors such as Clearspeed (http://www.clearspeed.com). But neither solution is commodity-based solutions and have a subsequent higher cost.
There is, however, a commodity co-processor component that has shown continued growth in specialized performance: modern day graphics cards.
Thanks in part to the PC gaming industry, today’s graphics cards have very high-performance GPUs as computational engines. These GPUs are programmable, so you can write code for them. In essence, you’re using your high-end graphics card not for slaying monsters and aliens, but for computing.
Many people get excited when a CPU is released at a higher GHz. A bit more performance, plug and play, life is great. GPUs on the other hand have been quietly, at least to the HPC world, taking giant leaps and bounds in performance.
Figure One below illustrates the trend in the theoretical performance of ATI and NVIDIA GPUs from about 2003 to 2005 along with the performance of Pentium 4 CPUs.
FIGURE ONE: Theoretical performance of ATI and NVIDIA GPUs and Pentium 4 CPUs (Used with permission from Ian Buck, Stanford University)
The GPUs have huge increase in performance as compared to the CPUs. At the end of the chart, in mid-2005, the performance of these GPUs is several times the performance of the CPUs. For example, Intel’s desktop quad core product, the Core 2 Extreme QX6700 runs at 2.6 GHz and has a peak floating point performance of 50 GFLOPS, while NVIDIA’s latest GPU, the G80, has a peak floating point performance of 330 GFLOPS. ATI’s Radeon HD 2900XT peak performance is about 475 GLOPS. Using Crossfire, two ATI Radeon cards, and a little overclocking, the peak performance has reached approximately 1 TFLOP! Granted these numbers are single precision at the moment and for specialized computations, but the potential performance is huge and the hardware is available now.
Why is the GPU performance increasing so quickly? The PC gaming sector is large and growing, and is currently considered a multi-billion dollar market. The economic motivation for developing faster GPUs is obvious. In addition, it may be easier to take extra transistors and use them for GPU computations rather than just cache because of the nature of GPU calculations. (CPUs are using extra transistors for more cores not bigger cores).
Moreover, if you haven’t noticed, the release cycle for GPUs is faster than for CPUs due to their less complex designs. So, with a high potential performance at perhaps a cheaper price point, you can see why GPUs are being considered as a co-processor or even a main processor.
GPUs are Different
GPUs are fundamentally different compared to regular CPUs. Currently, GPUs are designed to take a 3D model of a scene and pass it through various steps in the rasterization-based rendering process to create an image to display. To increase the performance of this process, modern GPUs are designed to pipeline the data (see http://en.wikipedia.org/wiki/Graphics_pipeline). Within the rasterization process, there a number of stages that are performed in series using the GPUs and code fragments to process the data.
For computations, rather than rasterization, the GPUs can be thought of as stream processors (http://en.wikipedia.org/wiki/Stream_processing). Stream processing takes a series of compute-intensive operations (called kernels) and applies them to every element in a set of data that is called a stream. In general, GPUs apply one kernel at a time to a particular stream, but multiple streams can be processed at the same time (they are independent of one another).
But the kernel is not an entire program; it’s just a part of the algorithm. So to compute an entire algorithm, several kernels are applied in series to the stream. However, since the kernel has to be applied to the entire stream, there are things that you can’t do in GPUs such as reading and writing at the same time.
Consequently, you have to think of the data and the program in a different way when writing for GPUs. You have to think of the algorithm in terms of processing streams of data using a kernel that does the same operation on the entire stream. You can chain together kernels to perform series of operations, but each kernel is restricted to apply the same operations on the entire stream.
The benefits of stream processing are potentially huge. In particular:
There is explicit parallelism in streams (data and task)
No communication between stream elements (simplifies operations)
Kernels cannot write to the input streams
Stream elements are independent
Parallelism hides memory access (latency)
While it may not be obvious how these items contribute to performance, if your code follows the stream model of computing, it can give you a huge boost in performance. But you have to live within the stream computing model and it’s requirements. (Sorry, there’s no free lunch.)
The concept of an attached processor is not new. In the days of big iron HPC, there were specialized devices called “array processors” that were attached to HPC computers systems. These devices, much like the GPUs of today, were designed to do specific tasks faster than the host CPU. It seems that the commodity graphics market has allowed an old idea to become new again.
Some of the requirements or limitations of stream computing, in addition to the ones previously discussed, are:
No stack or heap
No integer or bit-wise operations
No scatter operations (a[i]=b)
No reduction operations (
These features are missing because games haven’t asked for them (yet), and it’s still early in “GPGPU” (General Purpose Graphic Processor Unit) development. Demand may warrant additions. Yet if you can get around these limitations and follow the stream programming model, the potential for a huge gain in performance at low cost is out there.
There are two major programming APIs for GPUs: OpenGL (http://www.opengl.org) and DirectX (http://en.wikipedia.org/wiki/DirectX). To run your algorithm on GPUs, you must translate it into OpenGL or DirectX functions. This is not an easy task. You have to understand the algorithm and how to program a GPU. Even using the stream processing concepts, it is difficult to write code using the GPU APIs.
For example, instead of talking about storage (memory) you have to talk about textures. Instead of talking about a computational kernel, you have to talk about fragment programs. And instead of talking about a
foreach loop over a data stream, you have talk about a render pass. There are few people who understand computational algorithms and GPU coding. The obvious solution is to develop high-level languages for writing GPU code.
Higher level languages abstract many of the fundamental GPU codings that are needed for expressing algorithms. This mapping makes it easier to write code that runs on GPUs, if you’re are not a graphics coder. Some of the higher level languages (or standard languages with extensions) for GPUs are Sh (http://libsh.org/), Brook (http://graphics.stanford.edu/projects/brookgpu), Shallows (http://shallows.sourceforge.net/), CUDA (Compute Unified Device Architecture, http://www.NVIDIA.com/object/cuda.html), and RapidMind (http://www.rapidmind.net/).
The languages vary in style, but most of them look something like C and C++ and incorporate stream concepts. For example, Brook treats the GPU as a streaming processor, and the language looks like C with some stream extensions. At one time, there was also a derivative of Brook named Brooktran that was close to Fortran yet based on Brook. Brook works with ATI and NVIDIA GPUs, OpenGL and DirectX, as well as Windows and Linux.
CUDA is very interesting because NVIDIA created a new model for programming general purpose computations on GPUs. CUDA is data parallel computing, using thousands of threads with a Parallel Data Cache to help increase arithmetic intensity for large performance boosts (arithmetic intensity refers to the compute intensity of the code). With CUDA, you can program in C and then use extensions to program for the GPUs. This feature allows you to target certain portions of the code for execution on the GPU and the rest to run on the CPU.
CUDA has four sets of extensions:
Function type qualifiers define if a function can run or be called from the CPU or GPU
Variable type qualifiers specify memory location of a variable on a GPU
A directive to specify how a kernel is executed on the GPU from the CPU
Four variables specify the grid and block dimensions from the CPU. The variables describe the problem so the run-time can put the threads in the appropriate location
As you can see, the extensions are fairly minor. You take your code and build it with NVCC (NVIDIA’s compiler that comes with CUDA), which splits the code into two parts: the GPU-specific code, and the CPU-specific code. The CPU-specific code is output so you can compile it outside of NVCC. The GPU code is compiled into a GPU binary form.
CUDA has a run-time library that runs on the CPU (the host) and provides functions to access and control GPUs (possibly more than one). It also has a component that runs on the GPUs to provide specific GPU functions, and a “common component” that has built-in vector data types (remember to think of the GPUs as stream processors) and a subset of the Standard C Library that runs on both the CPU and the GPUs. CUDA also comes with pre-built Basic Linear Algebra Subprograms (BLAS) libraries and Fast Fourier Transforms (FFT) libraries. One nice thing about CUDA is that it allows for combined CPU and GPU programming. It is currently specific to NVIDIA GPUs and it is freely available.
GPU Computing Resources
There are some GPU resources available around the Internet and one central GPU website, http://www.gpgpu.org, that links to papers, tools, workshops, and more. GPGPU is a great place to keep up with new developments on GPUs, search for materials, or learn more about the many ways GPUs are being used.
The GPUPU site also includes a wiki (http://www.gpgpu.org/w/index.php/Main_Page) with plenty of great information on getting started with GPUs, a number of tutorials (http://www.gpgpu.org/sc2006), and a list of developer resources (http://www.gpgpu.org/developer). There are also some sample codes and a couple of utility codes. The website also has a forum for questions at http://www.gpgpu.org/forums/.
There are other resources as well, but some of them are specific to a particular graphics vendor. However, the materials are usually freely available. For instancve, the NVIDIA website has a very extensive developer section (http://developer.NVIDIA.com/page/home.html) with instructions on how to program the company’s GPUs for either graphics or general computation.
ATI also has a website (http://ati.amd.com/developer/index.html) with information on programming its GPUs (largely for games). Additionally, the company has a page devoted to stream computing (http://ati.amd.com/technology/streamcomputing/index.html), which serves a launch point for ATI’s efforts into stream computing in general.
You can read more about the ATI stream processor, called the AMD Stream Processor[ ATI is now owned by AMD, hence the name of the product] at http://ati.amd.com/products/streamprocessor/index.html. Or search the ATI website for “CTM” (Close to Metal), a thin interface above the GPU hardware. While CTM is not a compiler, it allows other developers to either adapt existing tools to ATI’s GPUs or to develop new tools.
The primary discussion about GPU computing is on the CUDA website at http://developer.nvidia.com/object/cuda.html. It has links to downloads, documentation, samples from the SDK, and a forum for CUDA.
There is a relatively new company, called RapidMind (http://www.rapidmind.net) that’s developing GPU programming technology that can be used in a normal C++ environment (compilers and IDEs). The RapidMind Platform is embedded in the application and transparently manages massively parallel computations.
To convert your code to use RapidMind, you identify the parts of the code you want to run on the GPU (RapidMind also supports the Cell Processor) and then perform the following steps:
Convert the data types to use RapidMind’s data types (integer and floating point).
Capture the computations (RapidMind can capture numerical operations when the code is run which are then recorded and dynamically compiled to a program object).
Stream execution. (The RapidMind run-time manages the execution of objects on the GPU).
RapidMind has some case studies (http://rapidmind.net/case-studies.php) studies that show how it works and what kind of results you can expect. For example, there is a paper (http://rapidmind.net/case-gpu.php) that compares the performance of the BLAS SGEMM routine (single precision matrix multiply), a single precision FFT, and European option pricing model (Black-Scholes) on a GPU versus the performance on a CPU with highly-optimized code. RapidMind compared the GPU routines with RapidMind to the optimized CPU routines using both an ATI x1900 XT card and the NVIDIA 7900 GTX, against an HP workstation running Intel Woodcrest chips and an AMD Opteron at 2.6 GHz. The researchers found that the GPU code for SGEMM was 2.4 times faster than the CPU code; the GPU FFT code was 2.7 times faster; and the Black-Scholes GPU code was 32.2 times faster than the CPU code.
There have been several success stories about using GPUs in computation. The first one is a project called Folding@Home (http://folding.stanford.edu). The Folding@Home application is free, runs on any desktop, and folds proteins to help researchers. (Think SETI@Home for proteins.)
On Sept. 26, 2006, ATI announced that it had ported the Folding@Home application to use ATI GPUs. ATI estimated that an ATI X1900 graphics card could get 20X to 40X more performance than a standard CPU. The company also estimated that using ATI graphics cards it would take only one month to finish what previously took three years to solve.
Another success story is Massachusetts General Hospital. The hospitals combines low-power X-rays to form a real-time image through a process called digital tomosynthesis (http://www.mghbreastimaging.org/NewFiles/about_us.html). In the past, the hospital used a 35-node cluster to perform the computations. Using NVIDIA G80 GPUs, the hospital gained a 100X performance boost as compared to standard CPUs. Now, the same task can be performed on a workstation, moving the computational part of the process to the imaging unit.
These are just two success stories about GPUs. There are others. For example, some researchers have come up with several LU (matrix factorization) routines that run much faster on GPUs than CPUs. The LU routine is used by many computer-aided engineering codes to solve problems.
GPUs for Clusters
A GPU can assume work from the CPU on a single cluster node, but can also work cooperatively with the CPU, if the CPU is running a parallel MPI job. Expect more in this area in the future. In the mean time, GPUs have been used in clusters in other ways.
GPU clusters, or clusters where the GPU is the primary focus, are currently being used as tiled displays. An example of this is the package called Chromium (http://chromium.sourceforge.net), which performs parallel rendering. You can use Chromium to drive multiple displays, or to drive a single display using multiple compute nodes equipped with GPUs.
Mike Houston at Stanford University (http://graphics.stanford.edu/~mhouston) has been working with GPU clusters for some time. He worked with a team on a parallel GPU version of a protein search code that uses hidden Markov models (HMM). The team called its version ClawHMMer (graphics.stanford.edu/papers/clawhmmer/hmmer.pdf). The HMM code was rewritten to run on GPUs and then modified again so that the database searches are divided among multiple nodes in a GPU cluster. While in general it’s an embarrassingly parallel algorithm because the searches are independent of one another, the overall search itself isn’t parallel. The per node performance is quite good (about 10-40 times the performance of a single CPU), but more important, the parallel code scales very well — over 90 percent scalability at 16 nodess.
The Visualization Lab at SUNY Stony Brook (http://www.cs.sunysb.edu/~vislab/ “) has been working with GPU clusters for several years. The lab does research in both distributed graphics and distributed computation using the GPUs.
In 2004, the Visualization Lab presented a paper (http://www.cs.sunysb.edu/~vislab/projects/gpgpu) paper about developing a Lattice Boltzmann Method (LBM) on a GPU cluster to simulate the dispersion of airborne contaminants in the Times Square area of New York City. The team added graphics cards (GPUs) to nodes in a cluster and rewrote the LBM code to use GPUs as well as the network. Since the GPUs don’t have direct access to networking interfaces, they had to transfer the data that is to be passed (either sent or received) to the CPU, which then transfers it to the required node, which then transfers it to the GPU on the node. Using this method, the programmers were able to get a speedup of 4.6X compared to just using the CPUs.
Based on their experience with the LBM code as well as others, the Visualization Lab has recently developed a object-oriented middleware for clustered GPU programming called ZippyGPU (http://www.gpgpu.org/sc2006/workshop). It sits above the low-level GPU APIs such as OpenG, as well as the communication API, MPI. It encapsulates and manages the objects and functions provided by these APIs, making coding much easier. Your code makes use of ZippyGPU’s objects to provide access to the GPU and to the network.
ZippyGPU is designed to support visualization, image compositing, as well as GPU based computations. The lab has shown how to use ZippyGPU to write something like the GA toolkit for combining the memory of the GPUs within the MPI run for large data arrays and computing with them (or using them for visualization).
The company Acceleware (http://www.acceleware.com/) develops GPU hardware and software, and markets its Accelerator Boards to plug into PCI-e slots to provide GPU (stream) processing for single machines, primarily workstations. The company also markets a workstation that uses two or even four Accelerator Boards. As an example, the company has written a conjugate gradient solver for sparse systems of equations (see http://www.gpgpu.org/sc2006/workshop). On a 32-bit problem, the code achieved a 3X performance increase over current CPUs. Acceleware is also developing electromagnetic problem solvers that use GPUs.
A Fast Future
GPUs offer great hope in increasing the performance of HPC codes. While currently difficult to program and limited to “the right problem,” the performance of GPUs is simply too great to ignore. Better tools and help are on the way. And for all the gamers out there, next time you are considering a new graphics card, think FLOPS instead of frags.