Developing for GPUs, Cell, and Multi-core CPUs Using a Unified Programming Model

The need to suddenly start developing parallel software poses a severe challenge to the entire software industry.

Processor architectures have recently added several forms of explicit parallelism, most prominently multiple cores. In fact, due to the exponential increase in transistor density predicted by Moore’s Law and the difficulties with using a large number of transistors efficiently in individual cores, we can expect the number of cores to grow exponentially over time.

Unfortunately, most development tools-and the skill sets of most developers-are designed around serial computational models. The need to suddenly start developing parallel software poses a severe challenge to the entire software industry.

The development of parallel software is riddled with traps for the unwary, including race conditions, deadlock, and non-determinism. In addition, new parallel algorithms and approaches to programming are required to achieve performance that will scale with additional cores. The challenge is great, but so is the opportunity: serial processors have been far into the domain of limited returns for many years. A shift to explicit parallelism exposes opportunities for much higher performance even in the short term.

It is important to realize that processors also use many other forms of explicit and implicit parallelism, including parallel operations on the contents of a single register, pipelining, memory prefetch, single-core simultaneous multithreading (“hyperthreading”) and superscalar instruction issue. Some new processor options have also emerged, such as the Cell BE processor and GPUs, that are even more aggressive in their use of parallelism, but have the potential for massive improvements in performance per watt.

In developing the RapidMind platform, we wanted to address the multi-core programming challenge but at the same time wanted to enable developers to take advantage of these other opportunities for radically increased performance. Our system enables the rapid development high-performance software for multi-core processors, including GPUs and the Cell BE.

It supports a general-purpose programming model and can be used for any kind of application. Applications written using the RapidMind platform have achieved, in externally validated benchmarks, performance improvements of over 30 times, but without significantly increased programmer effort. In fact, use of the RapidMind approach often resulted in code that was simpler and more portable than the baseline code.

Figure 1: Peak CPU Performance

Figure 2: Cell BE Performance Comparison

Figure 3: GPU Performance

The GPU and CPU benchmarks used the same baseline code, independently optimized by researchers from Hewlett Packard’s High Performance Computing Division. In the case of the GPU, a speedup of over 30 times was observed. In the case of the CPU, in this case a 4-core Intel Core 2 Duo processor, a speedup of nearly 8 times was observed over the single-core performance of the baseline code.

Our approach is based on familiar concepts and tools, in order to minimize disruption to the current software development process and allow fast adoption. A developer continues to use their existing C++ development environments (IDEs) and even compilers, and can write code using simple extensions of the existing C++ concepts of functions and arrays.

The resulting applications can run on a variety of processors and will automatically scale to future processors with additional cores, cores with heterogeneous instruction set extensions, and even accelerators such as GPUs. An emphasis on portability means that not only can a developer use different processors today, they will be able to use the same source to target future processors.

Portability is important not only for migrating between currently available processors and accelerator configurations, but also for future-proofing code against future developments, including the development of heterogeneous multi-core processors.

For concreteness, this article begins with a simple tutorial. We will then discuss the SPMD stream processing model, the parallel computational model used by our platform.

This computational model is based on a general and scalable form of data-parallelism that is applicable to a broad range of applications, but avoids the dangers and complexities of explicit thread-based parallel programming. In particular, it avoids the problems of race conditions and deadlock by construction, and also makes it possible to construct deterministic parallel programs.


The interface to the RapidMind platform is embedded in C++, and uses only ISO standard C++ for its syntax. You use your existing compiler, such as gcc, but include a header file that defines a small set of types. Normally, to perform a computation in C++, you declare variables of the appropriate type, such as float or int, and then specify a sequence of operations on them.

To perform a computation with the RapidMind platform, you do the same thing, but with RapidMind types. The difference is that operations on RapidMind types can either be executed immediately in the usual way or recorded. A recorded sequence of computations is compiled dynamically by the platform to machine language on the target co-processor, creating new code. It can then be executed later as a function, but can run in parallel over a set of co-processors.

In addition to making it easy to specify and execute parallel computations, RapidMind makes it possible to easily write programs that can build custom code dynamically. This allows you to use all the modularity constructs of C++ to organize your computation, but avoid the overhead.

Since only operations on RapidMind types are recognized by the RapidMind API, the cost of any other C++ operations-following pointers around data structures, invoking virtual functions-can be compiled out. Furthermore, the remaining computational “trace” processed by RapidMind has semantics that allow the RapidMind code generator to perform extremely aggressive parallel execution-targeting multiple parallel execution mechanisms in the target processors-leading to outstanding performance.

There are three basic types in RapidMind: the Value, the Array, and the Program. The Value type represents numbers (and fixed-length tuples of numbers), the Array multidimensional collections, and the Program functions. By recording sequences of operations on instances of the Value type you can create a Program, and by applying a Program to an Array you can invoke a parallel computation.


The Value<N,T> template type represents a tuple of N numbers of type T, where T can be a float, an int, a bool, or a similar built-in basic type. The template is provided so that generic code can be written; however, much of the time, and in the examples to follow, we will use a shorthand: a Value3f is equivalent to Value<3,float>, a Value2i is equivalent to Value<2,int32_t>, a Value1ui is equivalent to Value<1,uint32_t>, and a Value4bool is equivalent to Value<4,bool>.

Operations on Value types always take place component by component. For example, if a and b are instances of type Value3f, then a*b computes a new Value3f whose components are the products of the corresponding components of a and b. Scalar promotion is also supported: if one of the operands to an arithmetic operation is a scalar, that is, only has one component, then that component is automatically duplicated so it is the same length as the other operand.

It is also possible generally to perform arithmetic between instances of Value<N,T> types and ordinary C++ variables and constants of type T, and to assign instances of type T to Value<N,T> types, in which case scalar promotion also takes place. However, the type must match exactly; type promotion never takes place automatically. This affects how constants are written. For example, if a is a Value3f, then the expression 2*a will not compile. It must be written 2.0f * a.

Constructors for Value types can be used to create vector constants or to convert types. For instance, Value2f(4.5,3.4) is a constant two-tuple. Scalar promotion is always done here, as is type conversion. By default, instances of Value types are not initialized when they are created. However, due to the above rules, the generic expression Value<N,T>(0) always works to initialize a Value of any numeric component type to zero.
Finally, the standard math functions (sin, cos, exp, and the like) are defined over Value types, and on tuples with more than one element act component by component as well. These use the same names as the C math functions; generally, a Value1f can substitute for a float in most circumstances.


Arrays are regular multidimensional collections with Value types as elements. They are declared using the syntax Array<D,V>, where D is the dimensionality (and can be 1, 2, or 3) and V is a Value type. Note that the size of an Array is not part of its type, but is part of its runtime state. Instances of an Array type can also be assigned to one another in constant time.

Unlike C++ arrays, they use by-value semantics, making them more consistent with the basic types like float and int that are also assigned by value. This makes them more convenient to use and improves program structure and modularity. For instance, it is reasonable to declare an instance of an Array inside a function and then return it from that function without explicitly allocating memory. The RapidMind platform includes automatic memory management that also optimizes away any unnecessary copies of data.

Instances of an Array<D,T> type support random access with the [] operator, which requires a Value<D,int> as its argument. To get data into or out of an Array instance, several mechanisms are provided. The read_data() and write_data() methods provide pointers to data allocated by the Array itself, while adopt_data() and access_data() allow the user to allocate the memory and then provide it.


So far, the RapidMind interface just looks like a standard vector and collection library. This is intentional, but behind this simple facade is some sophisticated machinery for recording sequences of operations on instances of Value and Array types and compiling them into machine language. This machinery is invoked by creating an instance of a Program object, which is best illustrated by example:

Value1f f;
Program p = RM_BEGIN {
  In<Value3f> a, b;
  Out<Value3f> c;
  Value3f d = f * exp(a) * sin(b);
  c = d + a * 2.0f;

When this code is executed, the computation between the RM_BEGIN and RM_END macros is stored in the Program object p. Suppose A and B are instances of Array<D,Value3f>, with A and B having the same number of elements. Then the following expression invokes a parallel computation on every element of A and B and places the result back in A:
A = p(A,B);

On the Cell BE, this computation will run in parallel on the SPE co-processors; if a GPU accelerator has been selected as the target, the computation will be processed there; if a multi-core CPU is selected as the target, then the computation will run on other cores in the same machine.

There are two interesting points to note here. First, the array A is both an input and an output. When this is done, the output is always placed in separate memory from the input and this memory is swapped for the original memory when the computation is done. The inputs to a program are always the “old” values, and the assignment takes place as if it was done in simultaneously and instantaneously. This means that data dependencies between inputs and outputs are eliminated, avoiding read-write ordering hazards which are a major cause of race conditions.

Second, note the dependence of the program p on the value f. This works just like a non-local reference in a regular function. If the value of f is updated, which can be done with a simple assignment, and the program run again, the program will use the new value, even if the actual execution of the program is on a separate processor (this only works if the non-local variable is a RapidMind type; if it is an ordinary C++ type, its value will get “baked into” the program when it is built).

Generally, program objects act like C++ functions, with a few exceptions. First, inputs are always passed by value, never by reference. As noted, this is important for eliminating unnecessary data dependencies. Second, programs may have multiple outputs (there is a syntax for multiple value return we do not go into here).

Third, they cannot have side effects, that is, change the value of an external variable that is not an input or output. Assignment to a non-local variable is permitted but only changes the value of that variable inside the program. Again, this is necessary to eliminate race conditions.

This gives the basics of the interface. There are a few more concepts needed for fully general programming, but first let’s look at a simple example.

Loop Conversion Example

Consider the following program, which is doing some simple math to average and scale two images:

#include <cmath>

float f;
float a[512][512][3];
float b[512][512][3];

inline float process(float x, float y) {
return f * (x + y);

int main (int argc, char* argv[]) {
 <read in data here>
 for (int i = 0; i < 512; i++) {
   for (int j = 0; j < 512; j++) {
     for (int k = 0; k < 3; k++) {
       a[i][j][k] =
 <write out results here>

This computation is obviously easy to parallelize. We are going to parallelize the inner loop using component-by-component operations on Value types and are going to parallelize the outer two loops by doing a parallel program application over two-dimensional arrays. Along the way we will also mention a few things of practical importance.

First, we will include the header file and invoke the rapidmind namespace:

#include <rapidmind/platform.hpp>
using namespace rapidmind;

Then, we will change the types of the global variables to use RapidMind types. Note that rather than use a three-dimensional array (which we could have done), we are using two-dimensional arrays of three-tuples:

Value1f f;
Array<2,Value3f> a(512,512);
Array<2,Value3f> b(512,512);

We’ll template the function giving the per-element computation. By using a template, it’s still possible to use the process function on its original data type if necessary:

template <typename T>
inline T process(T x, T y) {
return f * (x + y);

It’s a good idea to create programs separately from where they are used for processing, to avoid accidentally creating them over and over. So, we’ll have another function that creates the Program object p. Of course, we could have put the computation in the process function here directly, but we also wanted to show that it’s possible to use ordinary C++ functions like process to describe RapidMind computations:

static Program prog;
void init_prog() {
 prog = RM_BEGIN {
   In<Value3f> r, s;
   Out<Value3f> q;
   q = process(r,s);
 } RM_END;

To finish this example, in the main function we can initialize the RapidMind platform, create the program, and then apply it to some data. Of course, in a more practical application, we would create the Program object p once and then apply it numerous times.

int main (int argc, char* argv[]) {

 <read in data here>
 a = prog(a,b);
 <write out results here>

Some Additional Details

Now that we’ve seen a basic example, we can discuss some more advanced topics. Support for control flow in programs allows the system to avoid doing unnecessary work and support applications with irregularity in execution time. Accessors provide flexibility in how data is read and written. Collective operations such as reduction provide alternative patterns of parallelism and communication.

The basic parallel function application pattern combined with an appropriate set of collective operations results in a general programming model capable of expressing any parallel algorithm. There are a few other more advanced topics, including swizzling, grids, facilities for program manipulation and drill-down mechanisms for particular processors. We also only briefly mention local arrays and dicing. Due to limited space, we refer the interested reader to our Web site for further information on these.

Control Flow

The RapidMind platform supports control flow inside programs. Ordinary C++ control flow can be used, but the control conditions can only be based on information known at the time the program is created. Such control flow actually manipulates how the code is generated.

For example, an ordinary for statement results in explicit loop unrolling, useful for creating high-performance code. Control flow depending on data computed when the program actually runs requires some special syntax, as in the following example:

Program p = RM_BEGIN {
  In<Value3f> a, b;
  Out<Value3f> c;

  Value3f d = f * exp(a) * sin(b);
  RM_IF (all(a > 0.0f)) {
     c = d + a * 2.0f;
  } RM_ELSE {
     c = d - a * 2.0f;

Note that in addition to being in upper case, the RapidMind control constructs also require ending keywords like RM_ENDIF. In addition to RM_IF, loops are supported with RM_WHILE/RM_ENDWHILE and RM_FOR/RM_ENDFOR. The control argument to control flow keywords must always be a Value1bool. The comparison operators return RapidMind Booleans, but are tuples if the input values are tuples. The any and all functions perform “horizontal” AND and OR operations, respectively, collapsing Boolean tuples into single values.


Accessors are functions that select sub-regions of arrays using affine (scale and translate) operations on input and output indices. For example, we can modify a program application so that it reads from every second element of its first argument, takes only the first 256 elements of its second argument, and then writes the result into the elements 256 to 511 of its output array (in this case, the size of the output array should already have been set):v

slice(C,256,511) = p(stride(A,2),take(B,256));

Accessor functions actually return a type which refers to the elements of the array by reference, which allows them to be used on the left hand side of an assignment. However, this type can be assigned to an instance of an Array, which causes a by-value assignment. The platform still optimizes away copies whenever possible.


It is very common to want to combine all the elements of an array using some pairwise operator. If the operator is associative, it is possible to do this combination efficiently in parallel, and this is called a reduction. The reduce function builds and performs a reduction using some program object p, which should have two inputs and one output, and should also be associative:

a = reduce(p,B);

Some common reductions, such as summation, are built in:

a = sum_reduce(B);

Other forms of collective operations are built in. It is also possible to do random reads (gathers) on non-local arrays and random writes (scatters) to output arrays. This allows the expression of arbitrary patterns of data dependency.

Local Arrays and Programs

So far we have only shown programs applied to global arrays. In fact, arrays can be declared inside programs, programs can be called from inside other programs, and inputs and outputs can also be arrays rather than single tuples. A dice operation is also provided that can chop up a large input array into a set of tiles (represented as an array of accessors), and programs can then be applied over this set, possibly writing into another set of tiles.

Local arrays and dicing are especially important since they allow the direct expression of memory locality. Memory bandwidth is often a primary limiting factor in high-performance computation. A scalable parallel algorithm can often be designed around the principle of breaking a large problem into smaller problems, solving each of the smaller problems in a smaller and faster local memory, and then recombining the results. It should also be noted that the platform automatically prefetches data when it is used as input and output to programs, which is another way to exploit memory locality.

SPMD Programming Model

The RapidMind platform uses a data-parallel programming model. However, it is important to realize that since control flow is supported, this programming is an instance of the SPMD (Single Program Multiple Data) programming model, not the more limited SIMD (Single Instruction Multiple Data) programming model. The SPMD model is capable of expressing execution-time irregularity and can in fact be used to express task parallelism, but in a structured way that avoids the need for explicit synchronization.

When control flow is present, the platform uses dynamic load balancing automatically. When it is not, program application is in fact an expression of pure SIMD data parallelism and more efficient static scheduling can be used. Sometimes this form of parallel computation is referred to as stream processing, since there is an emphasis on coherent streaming of data into an intense computational kernel and streaming it back out again. This is a good match to the need for modern processors to have good memory locality in order to achieve maximum performance.

When we refer to the SPMD model, we are only referring to a single Program object’s invocation; of course, in a given application there might be several such invocations with various forms of dependency on each other. The platform is capable of executing program invocations asynchronously, allowing the host program to continue to execute in parallel with any computation specified with RapidMind types. However, looking at the output produced by a program invocation causes any pending computation to complete.

The end result is that the developer can construct an application using serial semantics, using a simple sequence of operations. Application of functions (in the form of RapidMind Program objects) to arrays is a natural way to express parallelism. The RapidMind platform can then exploit the explicit parallelism of program applications across arrays and tuples as well the as implicit parallelism given by data dependencies.


The rise of multi-core processors has created a need for the development of parallel applications. This has many challenges, but in fact it can be taken as an opportunity to dramatically improve performance by exploiting other forms of parallelism in processors and even accelerators. The SPMD stream processing model is a structured way of expressing parallelism that not only addresses the multi-core challenge but also can be used to exploit these additional opportunities, leading to radical improvements in application performance.

Comments on "Developing for GPUs, Cell, and Multi-core CPUs Using a Unified Programming Model"


The question I have is, are younger programmers using C++?


csk317 – 5-year-olds are building virtual instruments and doing parallel programming using NXT-G, a simplified version of Labview. And, yes, they’ve no fear of C, C++, Java, Smalltalk (they get to play with that on their OLPC’s) or numerous other “power tools”.

Challenge, yes… severe, no.

While I liked this article, I don’t agree with the byline – the challenge is there but it’s not severe.

Sure, there is now a call for toolkits and libraries which blend inter-core communications solutions with “unlooping” and other means of deconstructing monolithic problem spaces – as the author says, “expressing parallelism”.

RapidMind seems to be doing a respectable job of that; I’m sure there are other solutions as well. It would be nice to see a table comparing/contrasting them; maybe LinuxWorld’s up to the challenge?

Thanks, Doc McCool – keep up the good work!



There is currently a meme going around that C/C++ are { archaic | obsolete | superseded | pointless to learn | old fart languages | etc } as compared to Java, Ruby, Python, C#, and the other currently popular programming languages. This goes hand in hand with the meme that says that personal computers are obsolete and everything will be done “in the cloud,” that soon people will not need desktops or perhaps even laptops, that everything will be done on handheld devices, and so on. In the extreme case, this is interpreted to mean that people should only learn web backend scripting languages that give fast results. Damn the torpedoes, full speed ahead!

No doubt there are kernels of fact to some aspects of these memes, but considering them prophetic predictions of the future is pretty idiotic. Contemporary culture is made up of sound bites and fast tempos, great expectations and tight budgets, quick thinking and quick execution. C++ programming requires thoughtful engineering, and is usually applied to critical software infrastructure like operating systems, device drivers, sophisticated software packages, etc., on top of which reside things like web apps and huge Java constructions, among others.

The patience needed for that sort of thing has always been in short supply, that’s nothing new. Today’s new twist in the C++ world is having to deal with ubiquitous multi-core and multi-cpu systems, and write code in such a way to exploit them. This, too, is not a new need. However, it is still not a part of general C++ culture, and not even a well-understood and well-provisioned aspect of programming in general.

The question is, how will it play out?


Straight ‘C’ and native assembly will always outperform C++ in terms of speed and efficiency. I do like C++, but I don’t use it for high performance critical code. As somebody who has had a great deal of experience working with the Cell Broadband Engine, to get the best performance out of your application, there is no replacement for directly using the SPU intrinsics and/or writing assembly language. Optimization of your data access patterns for optimal DMA transactions is equally if not more important to achieve the best performance and efficiency. And of course, arranging your algorithm such that it can be made parallel with minimal synchronization/data dependencies across processors is important too.


So far, no one has commented on the fact that the article doesn’t say *anything*
negative about “RapidMind”. My guess is that this:

Program p = RM_BEGIN {
In a, b;
Out c;
Value3f d = f * exp(a) * sin(b);
c = d + a * 2.0f;

Is turned into “stuff” plus a string. If this is true, you
aren’t going to get any warnings when you compile, even if you
type in complete garbage.

Also, I surmise that “RapidMind” is a commercial product and
the history of commercial general purpose languages doesn’t
bode well for it. It also seems a bit out of place
in the Linux world where there is generally an open source
aspect to “stuff”. (e.g. MySQL, QT)


Hi, I’d like to respond to a couple of the comments posted here.

First, regarding the post by pdhackett: actually, the interface shown above is native C++ which is completely type-checked using normal mechanisms when you invoke your existing C++ compiler. RapidMind just provides some types and macros in a header file, using normal C++ semantics. There is no special preprocessor, as you were probably assuming, so what you see above is NOT ever turned into a string. RapidMind code IS ordinary, portable, ISO-standard C++ code. The C++ compiler *will* tell you if you have a malformed program.

What actually happens is the RapidMind numerical types, like Value, are instrumented so the RapidMind platform (which is linked in like a library) can observe the sequence of operations that are applied to them. The BEGIN (which is just a macro wrapping a function call into the API) starts a “trace” of these operations, and END (another function call) stops the trace. In addition to this “retained” usage, Values also work in “immediate” mode as ordinary numerical types outside of BEGIN/END blocks. Immediate mode is handy for modifying non-local variables, as described below.

Once a trace has been captured, at runtime the platform uses a staged code generator (completely separate from the C++ code generator) to construct some optimized machine language that reimplements that trace so it can run in parallel.

“Program” objects are basically brand-new functions, built at runtime, that you can use to (asynchronously, as it happens) kick off a parallel version of the sequence of operations captured in the trace. In other words, the RapidMind platform interface described above adds the capability to C++ to dynamically construct parallelized functions in a safe way.

What this means is:

- C++ overhead is eliminated. Operations that are not on RapidMind types just act as “scaffolding” that organizes the sequence of numerical operations on RapidMind types. This scaffolding is completely ignored by the platform’s code generator. You can use all the C++ modularity constructs like classes, namespaces, virtual member functions etc. that you want freely, then “compile them out”.

- The modularity and scope constructs of C++ get automatically transformed into interprocessor communication patterns. Non-local variables declared outside of a BEGIN/END work for Programs as you would expect for functions. This means that binding between code on the “host” and code on the “co-processors” follows the same scope rules as the rest of C++, although the implementation is significantly more involved internally: the co-processor code may be running on separate processors that may not even share the same memory space. RapidMind hides this complexity completely.

- The “metaprogramming” approach enables some interesting alternative programming models, which can significantly reduce the size of code, without reducing performance, and in some cases even enhancing it significantly. For example, you can easily generate parameterized variants of functions programmatically, or variants that depend on data only known at runtime. In particular, you can turn interpreters into compilers trivially, which is a rather extreme form of overhead elimination and run-time dependency. It should be noted that the staged code generator, since it operates on small kernels, is very fast.

Which addresses an earlier comment: “C++ is not an HPC language”. Our approach lets you have your cake (modularity and abstraction) and eat it too (performance). We’ve seen practical examples of portable code that’s 1/10 the size and far easier to understand, but nearly twice the performance as compared to a non-portable C implementation on the same hardware. Abstraction helps with some of the code size reduction, but the scope rules and the embedded approach get rid of all the annoying glue code as well. The “kernel” language IS the API. What you see above is IT.

As for chitown76′s comments: we have a lot of experience with the Cell too, and actually, code written this way can significantly outperform code written at a low level on the Cell, and with much less effort, because
(a) you can try out more high-level optimizations faster
(b) you can programmatically generate code with RapidMind that would be insane to try to build by hand
(c) you can write parameterized code and then automatically search for the sweet spot for block sizes, loop unroll factors, etc.
(d) the platform automates the more common optimizations, like double-buffering and prefetch, so they’re always there, but are invisible and portable.

I should emphasize that we are not limited by the semantics of C++ in our code generator: the RapidMind platform has its own, which is designed around use of vector operations. But if you really, really, really want to, and you are willing to break portability, you CAN use explicit asyncronous DMA transfers and assembly intrinsics through the interface. Most Cell SPU assembly instructions can be specified with a simple function call, for instance.

Such drill-down is occasionally useful, but should only be done after profiling a more generic implementation, and should be hidden whenever possible inside a suitable abstraction (which our approach makes zero-cost). Also, often higher-level transformations to the algorithm and the data layout have a bigger impact on performance, and should not be neglected.

Simple things should be easy. Difficult things should be possible.

Michael McCool


Thanks for the insightful information. I do have a better perspective on your offering now. Although, in response, I do have no doubt that the set of tools you mention reduces the amount of effort required to implement an application on a multi-core system such as the Cell, but I am not yet convinced that using these tools that you could significantly outperform (or even outperform) a hand-coded optimized implementation. We can continue to discuss this point over and over, but until an example is openly and readily available then I am still in the camp that for the performance intensive parts of your application where every cycle matters, you’re still not going to get any better than a hand-coded assembly optimized implementation.
Thanks again for your insightful article! As a member of the multi-core development community, I look forward to hearing more from you in the near future.

Hi! I could have sworn I’ve been to this site before but after reading through some of the post I realized it’s new to me. Anyways, I’m definitely happy I found it and I’ll be book-marking and checking back often!

F*ckin’ remarkable issues here. I’m very glad to peer your article. Thanks a lot and i am looking forward to touch you. Will you please drop me a mail?

Major thankies for the blog article.Really looking forward to read more. Really Great.

Leave a Reply