If pork is the other white meat, GPUs might just be other "other silicon." Want to get started writing code for your graphics processor? Use this intro to NVIDIA's CUDA to get started.
General Purpose computing on Graphics Processing Units, or GPGPU, is one of the hottest up-and-coming trends in software development. By writing code specifically to run on the massively parallel stream processors found on today’s high-end 3-D graphics cards, programmers can speed up an array of algorithms — and not just in high-performance computing (HPC) applications. One of the most popular GPGPU architectures available for Linux is NVIDIA’s “Compute Unified Device Architecture” (CUDA). If you are looking to get started, CUDA allows you to write GPU-optimized code in C, using only a few language extensions.
Background: Mapping Parallel Programming onto Graphics Card Hardware
CUDA was introduced in 2006, and subsequent releases have expanded its capabilities to keep up with newer releases of NVIDIA hardware, while maintaining a backwards-compatible API. It does require a card based on NVIDIA GPUs, from the GeForce 8 series or newer, although the toolchain can build emulation-mode code for testing purposes.
Understanding the GPU hardware is the key to understanding when and how CUDA can accelerate your program. With the GeForce 8 series, NVIDIA switched from designing GPUs with dedicated units for discrete tasks such as vertex shading and pixel shading to a more flexible “unified shading architecture” in which the GPU consists of a set of identical, multithreaded “streaming multiprocessors” (SMs) which can be programmed for any task. Early GeForce 8 GPUs included just two SMs, while top-of-the-line models today contain as many as 128. The company also makes workstation-class GPUs under the “Quadro” label, and dedicated GPGPU models under the “Tesla” label that contain many more.
Each SM consists of eight processing cores, an instruction unit, and shared memory. The eight cores execute concurrently in separate threads, which is where the parallelism begins. Still, there are several limitations — although GPUs can execute many times more operations per second than CPUs, they have very little cache, very limited flow control, and in general are slower at memory operations. Just as importantly, each SM has multiple concurrent cores, but those cores execute the same instructions, on different chunks of data. CUDA includes a variety of ways to cope with these limitations, including breaking data sets into blocks that can be easily divided up among the SMs.
CUDA can provide orders of magnitude more performance than a CPU, if the problem is parallelizable and contains a lot of arithmetic. Because other code is best executed on a CPU, however, the system makes a clear separation between “host code” that runs on the CPU and “device code” that runs on the GPU. Finally, in addition to its multithreaded cores, CUDA repurposes the GPU’s texture memory to serve as a fast, shared memory space accessible to threads without waiting for the slower system memory bus.
The techniques CUDA uses are also automatically scalable over the number of SMs in your GPU — you do not need to re-write anything to take better advantage of a high-end GeForce 200 or Tesla card over an older model. Multiple CUDA-capable cards in a single system can also be used together, further expanding the possibilities.
Setting Up the Toolchain
NVIDIA’s basic CUDA toolchain consists of three parts: The NVIDIA graphics card drivers, the CUDA Toolkit, and the CUDA SDK. The drivers are available through most Linux distributions, but are also provided directly by NVIDIA. CUDA will not work with the open source nv or nouveau drivers.
Both of the latter components are available from the company’s developer Web site, and both require accepting an end user license agreement (EULA) [I do not think this is true, there is no registration required to download the CUDA SDK - DJE]. The Toolkit includes NVIDIA’s CUDA compiler driver nvcc, libraries, headers, and other tools necessary to build and compile CUDA applications. It is available for 32-bit or 64-bit processors, and is provided in distribution-specific packages — currently Fedora, Red Hat Enterprise Linux, OpenSUSE, SUSE Linux Enterprise Desktop, and Ubuntu. The SDK contains sample projects and templates.
You will also need to have a working GCC toolchain already installed; nvcc compiles CUDA GPU-executable “device code” itself, and calls GCC to compile and link “host code.” GCC 3.4 and 4.x are supported. The NVIDIA site also includes the standard documentation, references, and getting started guides, plus CUDA-based libraries for linear algebra, Fast Fourier Transforms, and image processing.
Both the Toolkit and SDK are provided as binary installers that must be run as root from a shell prompt. The Toolkit installs to /usr/local/cuda, while the SDK installs to ~/NVIDIA_CUDA_SDK. You will need to add /usr/local/cuda/bin to your PATH, and /usr/local/cuda/lib to your LD_LIBRARY_PATH.
If you have a CUDA-compatible GPU, you can immediately begin compiling sample code from the SDK with the standard GNU
make. If you do not, however, you can compile the examples for software emulation by compiling with
The SDK includes two utilities you can use to verify your system’s compatibility. Run
./deviceQuery from the SDK’s C/bin/linux/release directory, and it will return a detailed report on the GPU hardware installed on the system, including the number of SMs and cores, memory and texture information, and CUDA version compatibility. Newer releases of CUDA sometimes add additional capabilities to better take advantage of new GPUs, but all hardware is backwards-compatible. You can run
nvcc -V to verify which version of the CUDA Toolkit you have installed. Finally, the
bandwidthTest utility will test communications between the host and GPU device, and return a simple pass/fail report.
NEXT: Writing Basic CUDA Code