Developing for GPUs, Cell, and Multi-core CPUs Using a Unified Programming Model

The need to suddenly start developing parallel software poses a severe challenge to the entire software industry.

Processor architectures have recently added several forms of explicit parallelism, most prominently multiple cores. In fact, due to the exponential increase in transistor density predicted by Moore’s Law and the difficulties with using a large number of transistors efficiently in individual cores, we can expect the number of cores to grow exponentially over time.

Unfortunately, most development tools-and the skill sets of most developers-are designed around serial computational models. The need to suddenly start developing parallel software poses a severe challenge to the entire software industry.

The development of parallel software is riddled with traps for the unwary, including race conditions, deadlock, and non-determinism. In addition, new parallel algorithms and approaches to programming are required to achieve performance that will scale with additional cores. The challenge is great, but so is the opportunity: serial processors have been far into the domain of limited returns for many years. A shift to explicit parallelism exposes opportunities for much higher performance even in the short term.

It is important to realize that processors also use many other forms of explicit and implicit parallelism, including parallel operations on the contents of a single register, pipelining, memory prefetch, single-core simultaneous multithreading (“hyperthreading”) and superscalar instruction issue. Some new processor options have also emerged, such as the Cell BE processor and GPUs, that are even more aggressive in their use of parallelism, but have the potential for massive improvements in performance per watt.

In developing the RapidMind platform, we wanted to address the multi-core programming challenge but at the same time wanted to enable developers to take advantage of these other opportunities for radically increased performance. Our system enables the rapid development high-performance software for multi-core processors, including GPUs and the Cell BE.

It supports a general-purpose programming model and can be used for any kind of application. Applications written using the RapidMind platform have achieved, in externally validated benchmarks, performance improvements of over 30 times, but without significantly increased programmer effort. In fact, use of the RapidMind approach often resulted in code that was simpler and more portable than the baseline code.



Figure 1: Peak CPU Performance



Figure 2: Cell BE Performance Comparison



Figure 3: GPU Performance

The GPU and CPU benchmarks used the same baseline code, independently optimized by researchers from Hewlett Packard’s High Performance Computing Division. In the case of the GPU, a speedup of over 30 times was observed. In the case of the CPU, in this case a 4-core Intel Core 2 Duo processor, a speedup of nearly 8 times was observed over the single-core performance of the baseline code.

Our approach is based on familiar concepts and tools, in order to minimize disruption to the current software development process and allow fast adoption. A developer continues to use their existing C++ development environments (IDEs) and even compilers, and can write code using simple extensions of the existing C++ concepts of functions and arrays.

The resulting applications can run on a variety of processors and will automatically scale to future processors with additional cores, cores with heterogeneous instruction set extensions, and even accelerators such as GPUs. An emphasis on portability means that not only can a developer use different processors today, they will be able to use the same source to target future processors.

Portability is important not only for migrating between currently available processors and accelerator configurations, but also for future-proofing code against future developments, including the development of heterogeneous multi-core processors.

For concreteness, this article begins with a simple tutorial. We will then discuss the SPMD stream processing model, the parallel computational model used by our platform.

This computational model is based on a general and scalable form of data-parallelism that is applicable to a broad range of applications, but avoids the dangers and complexities of explicit thread-based parallel programming. In particular, it avoids the problems of race conditions and deadlock by construction, and also makes it possible to construct deterministic parallel programs.

Basics

The interface to the RapidMind platform is embedded in C++, and uses only ISO standard C++ for its syntax. You use your existing compiler, such as gcc, but include a header file that defines a small set of types. Normally, to perform a computation in C++, you declare variables of the appropriate type, such as float or int, and then specify a sequence of operations on them.

To perform a computation with the RapidMind platform, you do the same thing, but with RapidMind types. The difference is that operations on RapidMind types can either be executed immediately in the usual way or recorded. A recorded sequence of computations is compiled dynamically by the platform to machine language on the target co-processor, creating new code. It can then be executed later as a function, but can run in parallel over a set of co-processors.

In addition to making it easy to specify and execute parallel computations, RapidMind makes it possible to easily write programs that can build custom code dynamically. This allows you to use all the modularity constructs of C++ to organize your computation, but avoid the overhead.

Since only operations on RapidMind types are recognized by the RapidMind API, the cost of any other C++ operations-following pointers around data structures, invoking virtual functions-can be compiled out. Furthermore, the remaining computational “trace” processed by RapidMind has semantics that allow the RapidMind code generator to perform extremely aggressive parallel execution-targeting multiple parallel execution mechanisms in the target processors-leading to outstanding performance.

There are three basic types in RapidMind: the Value, the Array, and the Program. The Value type represents numbers (and fixed-length tuples of numbers), the Array multidimensional collections, and the Program functions. By recording sequences of operations on instances of the Value type you can create a Program, and by applying a Program to an Array you can invoke a parallel computation.

Values

The Value<N,T> template type represents a tuple of N numbers of type T, where T can be a float, an int, a bool, or a similar built-in basic type. The template is provided so that generic code can be written; however, much of the time, and in the examples to follow, we will use a shorthand: a Value3f is equivalent to Value<3,float>, a Value2i is equivalent to Value<2,int32_t>, a Value1ui is equivalent to Value<1,uint32_t>, and a Value4bool is equivalent to Value<4,bool>.

Operations on Value types always take place component by component. For example, if a and b are instances of type Value3f, then a*b computes a new Value3f whose components are the products of the corresponding components of a and b. Scalar promotion is also supported: if one of the operands to an arithmetic operation is a scalar, that is, only has one component, then that component is automatically duplicated so it is the same length as the other operand.

It is also possible generally to perform arithmetic between instances of Value<N,T> types and ordinary C++ variables and constants of type T, and to assign instances of type T to Value<N,T> types, in which case scalar promotion also takes place. However, the type must match exactly; type promotion never takes place automatically. This affects how constants are written. For example, if a is a Value3f, then the expression 2*a will not compile. It must be written 2.0f * a.

Constructors for Value types can be used to create vector constants or to convert types. For instance, Value2f(4.5,3.4) is a constant two-tuple. Scalar promotion is always done here, as is type conversion. By default, instances of Value types are not initialized when they are created. However, due to the above rules, the generic expression Value<N,T>(0) always works to initialize a Value of any numeric component type to zero.
Finally, the standard math functions (sin, cos, exp, and the like) are defined over Value types, and on tuples with more than one element act component by component as well. These use the same names as the C math functions; generally, a Value1f can substitute for a float in most circumstances.

Arrays

Arrays are regular multidimensional collections with Value types as elements. They are declared using the syntax Array<D,V>, where D is the dimensionality (and can be 1, 2, or 3) and V is a Value type. Note that the size of an Array is not part of its type, but is part of its runtime state. Instances of an Array type can also be assigned to one another in constant time.

Unlike C++ arrays, they use by-value semantics, making them more consistent with the basic types like float and int that are also assigned by value. This makes them more convenient to use and improves program structure and modularity. For instance, it is reasonable to declare an instance of an Array inside a function and then return it from that function without explicitly allocating memory. The RapidMind platform includes automatic memory management that also optimizes away any unnecessary copies of data.

Instances of an Array<D,T> type support random access with the [] operator, which requires a Value<D,int> as its argument. To get data into or out of an Array instance, several mechanisms are provided. The read_data() and write_data() methods provide pointers to data allocated by the Array itself, while adopt_data() and access_data() allow the user to allocate the memory and then provide it.

Programs

So far, the RapidMind interface just looks like a standard vector and collection library. This is intentional, but behind this simple facade is some sophisticated machinery for recording sequences of operations on instances of Value and Array types and compiling them into machine language. This machinery is invoked by creating an instance of a Program object, which is best illustrated by example:

Value1f f;
Program p = RM_BEGIN {
  In<Value3f> a, b;
  Out<Value3f> c;
  Value3f d = f * exp(a) * sin(b);
  c = d + a * 2.0f;
} RM_END;

When this code is executed, the computation between the RM_BEGIN and RM_END macros is stored in the Program object p. Suppose A and B are instances of Array<D,Value3f>, with A and B having the same number of elements. Then the following expression invokes a parallel computation on every element of A and B and places the result back in A:
A = p(A,B);

On the Cell BE, this computation will run in parallel on the SPE co-processors; if a GPU accelerator has been selected as the target, the computation will be processed there; if a multi-core CPU is selected as the target, then the computation will run on other cores in the same machine.

There are two interesting points to note here. First, the array A is both an input and an output. When this is done, the output is always placed in separate memory from the input and this memory is swapped for the original memory when the computation is done. The inputs to a program are always the “old” values, and the assignment takes place as if it was done in simultaneously and instantaneously. This means that data dependencies between inputs and outputs are eliminated, avoiding read-write ordering hazards which are a major cause of race conditions.

Second, note the dependence of the program p on the value f. This works just like a non-local reference in a regular function. If the value of f is updated, which can be done with a simple assignment, and the program run again, the program will use the new value, even if the actual execution of the program is on a separate processor (this only works if the non-local variable is a RapidMind type; if it is an ordinary C++ type, its value will get “baked into” the program when it is built).

Generally, program objects act like C++ functions, with a few exceptions. First, inputs are always passed by value, never by reference. As noted, this is important for eliminating unnecessary data dependencies. Second, programs may have multiple outputs (there is a syntax for multiple value return we do not go into here).

Third, they cannot have side effects, that is, change the value of an external variable that is not an input or output. Assignment to a non-local variable is permitted but only changes the value of that variable inside the program. Again, this is necessary to eliminate race conditions.

This gives the basics of the interface. There are a few more concepts needed for fully general programming, but first let’s look at a simple example.

Loop Conversion Example

Consider the following program, which is doing some simple math to average and scale two images:

#include <cmath>

float f;
float a[512][512][3];
float b[512][512][3];

inline float process(float x, float y) {
return f * (x + y);
}

int main (int argc, char* argv[]) {
 <read in data here>
 for (int i = 0; i < 512; i++) {
   for (int j = 0; j < 512; j++) {
     for (int k = 0; k < 3; k++) {
       a[i][j][k] =
         process(a[i][j][k],b[i][j][k]);
     }
   }
 }
 <write out results here>
}

This computation is obviously easy to parallelize. We are going to parallelize the inner loop using component-by-component operations on Value types and are going to parallelize the outer two loops by doing a parallel program application over two-dimensional arrays. Along the way we will also mention a few things of practical importance.

First, we will include the header file and invoke the rapidmind namespace:

#include <rapidmind/platform.hpp>
using namespace rapidmind;

Then, we will change the types of the global variables to use RapidMind types. Note that rather than use a three-dimensional array (which we could have done), we are using two-dimensional arrays of three-tuples:

Value1f f;
Array<2,Value3f> a(512,512);
Array<2,Value3f> b(512,512);

We’ll template the function giving the per-element computation. By using a template, it’s still possible to use the process function on its original data type if necessary:

template <typename T>
inline T process(T x, T y) {
return f * (x + y);
}

It’s a good idea to create programs separately from where they are used for processing, to avoid accidentally creating them over and over. So, we’ll have another function that creates the Program object p. Of course, we could have put the computation in the process function here directly, but we also wanted to show that it’s possible to use ordinary C++ functions like process to describe RapidMind computations:

static Program prog;
void init_prog() {
 prog = RM_BEGIN {
   In<Value3f> r, s;
   Out<Value3f> q;
   q = process(r,s);
 } RM_END;
}

To finish this example, in the main function we can initialize the RapidMind platform, create the program, and then apply it to some data. Of course, in a more practical application, we would create the Program object p once and then apply it numerous times.

int main (int argc, char* argv[]) {
 rapidmind::init();
 init_prog();

 <read in data here>
 a = prog(a,b);
 <write out results here>
}

Some Additional Details

Now that we’ve seen a basic example, we can discuss some more advanced topics. Support for control flow in programs allows the system to avoid doing unnecessary work and support applications with irregularity in execution time. Accessors provide flexibility in how data is read and written. Collective operations such as reduction provide alternative patterns of parallelism and communication.

The basic parallel function application pattern combined with an appropriate set of collective operations results in a general programming model capable of expressing any parallel algorithm. There are a few other more advanced topics, including swizzling, grids, facilities for program manipulation and drill-down mechanisms for particular processors. We also only briefly mention local arrays and dicing. Due to limited space, we refer the interested reader to our Web site for further information on these.

Control Flow

The RapidMind platform supports control flow inside programs. Ordinary C++ control flow can be used, but the control conditions can only be based on information known at the time the program is created. Such control flow actually manipulates how the code is generated.

For example, an ordinary for statement results in explicit loop unrolling, useful for creating high-performance code. Control flow depending on data computed when the program actually runs requires some special syntax, as in the following example:

Program p = RM_BEGIN {
  In<Value3f> a, b;
  Out<Value3f> c;

  Value3f d = f * exp(a) * sin(b);
  RM_IF (all(a > 0.0f)) {
     c = d + a * 2.0f;
  } RM_ELSE {
     c = d - a * 2.0f;
  } RM_ENDIF;
} RM_END;

Note that in addition to being in upper case, the RapidMind control constructs also require ending keywords like RM_ENDIF. In addition to RM_IF, loops are supported with RM_WHILE/RM_ENDWHILE and RM_FOR/RM_ENDFOR. The control argument to control flow keywords must always be a Value1bool. The comparison operators return RapidMind Booleans, but are tuples if the input values are tuples. The any and all functions perform “horizontal” AND and OR operations, respectively, collapsing Boolean tuples into single values.

Accessors

Accessors are functions that select sub-regions of arrays using affine (scale and translate) operations on input and output indices. For example, we can modify a program application so that it reads from every second element of its first argument, takes only the first 256 elements of its second argument, and then writes the result into the elements 256 to 511 of its output array (in this case, the size of the output array should already have been set):v

slice(C,256,511) = p(stride(A,2),take(B,256));

Accessor functions actually return a type which refers to the elements of the array by reference, which allows them to be used on the left hand side of an assignment. However, this type can be assigned to an instance of an Array, which causes a by-value assignment. The platform still optimizes away copies whenever possible.

Reductions

It is very common to want to combine all the elements of an array using some pairwise operator. If the operator is associative, it is possible to do this combination efficiently in parallel, and this is called a reduction. The reduce function builds and performs a reduction using some program object p, which should have two inputs and one output, and should also be associative:

a = reduce(p,B);

Some common reductions, such as summation, are built in:

a = sum_reduce(B);

Other forms of collective operations are built in. It is also possible to do random reads (gathers) on non-local arrays and random writes (scatters) to output arrays. This allows the expression of arbitrary patterns of data dependency.

Local Arrays and Programs

So far we have only shown programs applied to global arrays. In fact, arrays can be declared inside programs, programs can be called from inside other programs, and inputs and outputs can also be arrays rather than single tuples. A dice operation is also provided that can chop up a large input array into a set of tiles (represented as an array of accessors), and programs can then be applied over this set, possibly writing into another set of tiles.

Local arrays and dicing are especially important since they allow the direct expression of memory locality. Memory bandwidth is often a primary limiting factor in high-performance computation. A scalable parallel algorithm can often be designed around the principle of breaking a large problem into smaller problems, solving each of the smaller problems in a smaller and faster local memory, and then recombining the results. It should also be noted that the platform automatically prefetches data when it is used as input and output to programs, which is another way to exploit memory locality.

SPMD Programming Model

The RapidMind platform uses a data-parallel programming model. However, it is important to realize that since control flow is supported, this programming is an instance of the SPMD (Single Program Multiple Data) programming model, not the more limited SIMD (Single Instruction Multiple Data) programming model. The SPMD model is capable of expressing execution-time irregularity and can in fact be used to express task parallelism, but in a structured way that avoids the need for explicit synchronization.

When control flow is present, the platform uses dynamic load balancing automatically. When it is not, program application is in fact an expression of pure SIMD data parallelism and more efficient static scheduling can be used. Sometimes this form of parallel computation is referred to as stream processing, since there is an emphasis on coherent streaming of data into an intense computational kernel and streaming it back out again. This is a good match to the need for modern processors to have good memory locality in order to achieve maximum performance.

When we refer to the SPMD model, we are only referring to a single Program object’s invocation; of course, in a given application there might be several such invocations with various forms of dependency on each other. The platform is capable of executing program invocations asynchronously, allowing the host program to continue to execute in parallel with any computation specified with RapidMind types. However, looking at the output produced by a program invocation causes any pending computation to complete.

The end result is that the developer can construct an application using serial semantics, using a simple sequence of operations. Application of functions (in the form of RapidMind Program objects) to arrays is a natural way to express parallelism. The RapidMind platform can then exploit the explicit parallelism of program applications across arrays and tuples as well the as implicit parallelism given by data dependencies.

Conclusion

The rise of multi-core processors has created a need for the development of parallel applications. This has many challenges, but in fact it can be taken as an opportunity to dramatically improve performance by exploiting other forms of parallelism in processors and even accelerators. The SPMD stream processing model is a structured way of expressing parallelism that not only addresses the multi-core challenge but also can be used to exploit these additional opportunities, leading to radical improvements in application performance.

Fatal error: Call to undefined function aa_author_bios() in /opt/apache/dms/b2b/linux-mag.com/site/www/htdocs/wp-content/themes/linuxmag/single.php on line 62