If pork is the other white meat, GPUs might just be other "other silicon." Want to get started writing code for your graphics processor? Use this intro to NVIDIA's CUDA to get started.
Writing Basic CUDA Code
Although CUDA includes a very low-level “driver API” that exposes CUDA fundamentals, it requires the developer to manually manage processes, threads, and other objects. The preferred programming interface — particularly for beginners — is called “C for CUDA,” and allows the developer to write CUDA code using a handful of C extensions. A runtime library called cudart handles initialization and management of processes and threads on the GPU device.
C for CUDA’s principle extension is a special type of function called a “kernel” that, when called, is run in parallel threads on the GPU device, rather than in serial fashion on the CPU. Kernels are defined with the __global__ specifier, and are called with a special triple angle-bracket syntax, e.g.
myKernel<<< ... >>>(). Within the angle brackets you specify how many threads will be created at runtime. The syntax is formally known as the execution configuration, and is flexible to support CUDA’s hierarchy of threads, blocks of threads, and grids of blocks, but for simple cases a full understanding of that hierarchy’s nuance is not required.
For example, the canonical choice for parallel computation is vector addition:
__global__ void VectorAdd(float* A, float* B, float* C)
int myindex = threadIdx.x;
C[myindex] = A[myindex] + B[myindex]
declares the kernel, and it is called within the program as:
VectorAdd<<< 1, N >>>(A, B, C);
With this invocation, the kernel is executed N times by N different threads. This group of threads is collectively referred to as a thread block. Each thread receives its own threadIdx, and thus computes one unique value in the final vector C.
You can also treat thread blocks as two- or three-dimensional arrays. Notice that the counter is assigned the x “component” of threadIdx; for the sake of convenience, threadIdx can be addressed as a one-dimensional measurement (as makes sense for a vector), or as a two- or three-dimensional array, with threadIdx.y and threadIdx.z components as well. The dimensions and size of the thread block are defined in the kernel’s execution configuration — in this example, a vector of length N.
At runtime, each thread block is distributed to one of the available SMs. Larger-scale problems can be broken into multiple grids of blocks — each grid is assigned to a GPU.
The actual syntax for a kernel execution configuration is <<< GridDimensions, BlockDimensions >>>. GridDimensions tells CUDA the number and size of the grid of blocks — in the example above, there is only one block, so the constant 1 suffices. For a grid of multiple blocks, you can lay out your blocks as a one dimensional vector or a two-dimensional array, and would pass the vector or array as the GridDimensions argument. Similarly, BlockDimensions tells CUDA the number and arrangement of threads within each block.
For example, using an 8-by-16 array of threads would require declaring a array of C for CUDA’s dim3 data type, then passing it in the kernel execution configuration:
dim3 dimBlock(8, 16);
MyFunction<<< 1, dimBlock>>(A, B, C);
Within a kernel declaration, the blockIdx variable indexes the current block, just as the threadIdx variable indexes the current thread.
This may appear confusing at first glance, but the approach corresponds to the architecture of the hardware device: a thread is mapped to a processor core, a block (composed of multiple threads) is mapped to an SM (multiple cores), and, finally, a grid (multiple blocks) is mapped to a GPU (multiple SMs). This abstraction allows C for CUDA code to scale transparently on more powerful hardware without recompilation; the more SMs and cores are available, the more will be done in parallel.
Other Language Additions
In addition to parallel execution, threads within a particular block can both synchronize and access shared memory on the SM. The intrinsic synchronization function __syncthreads() causes each thread to pause and wait and wait until all of the other threads have caught up before proceeding.
Using the shared memory on an SM is much faster than waiting for the system memory bus, particularly when dealing with large numbers of threads. Shared memory on an SM is allocated with the __shared__ qualifier, and C for CUDA provides special functions to copy data between the host computer’s memory and shared memory on the GPU device.
C for CUDA also provides techniques to take advantage of the GPU’s specialized texture memory. Unlike SM’s shared memory, texture memory can only be accessed by kernels through a special operation called a texture fetch, and are read-only. To optimize for the parallel jobs typical of C for CUDA, texture memory can be accessed as integer, floating-point, or vector data types, and can be addressed as one-, two-, or three-dimensional arrays.
CUDA includes other extensions to C, such as the ability to page-lock host memory to speed up execution, to interact with OpenGL and Direct3D code, special vector data types, and mathematical functions optimized to run on SM cores, all of which can be used to more fully take advantage of the GPU’s special architecture. A full reference to the language extensions, as well as the low-level driver API and examples, is available in the CUDA Programming Guide PDF on the NVIDIA web site.
Open source adoption and alternatives
GPGPU techniques like CUDA originally were the domain of high-performance computing, large-scale simulations, and the like. CUDA was the system used to calculate the GSM encryption codebook by a security research team in 2009, for example. But GPGPU is increasingly coming to desktop applications, particularly where image and video processing are concerned. Several 3-D renderers for Blender are working on GPGPU support. The Nona stitcher tool used by the popular open source panoramic photo editor Hugin has a GPU-backend, and the Gimp has experimented with writing a GPU-backend for its GEGL image-processing library.
For Linux users, the major problems with CUDA are that it is proprietary software, an NVIDIA-only standard, and requires running the closed source, binary-only NVIDIA drivers. This makes it impossible for many distributions to include, on top of limiting the developer to writing code that runs only on a single manufacturer’s hardware.
ATI had a competing closed GPGPU system called Close To Metal (CTM) designed for its GPUs in 2007. In 2008, however, ATI announced that it would instead support the open standard OpenCL (for Open Computing Language) on its hardware in future releases. OpenCL was originally developed at Apple, but it now managed by the Khronos Group that also maintains the OpenGL standard.
Since ATI’s OpenCL support was announced, NVIDIA has begun work on OpenCL for its GPUs. Beta releases started to appear at the end of 2009. Like its CUDA releases, though, NVIDIA’s OpenCL tools for Linux require the binary NVIDIA drivers. Developers who are looking for a completely free software GPGPU framework may have to wait for the open source Gallium3D project.
In the meantime, though, if you want to get started writing GPU-utilizing code in Linux, the CUDA architecture is the fastest and most mature solution available.