Where do all good ideas go? Into the CPU of course.
Back in the the day, we used processors that did not have floating point hardware. You may find it hard to believe, but the first micro-processors did not have on-board floating point units. Floating point was possible, but it was done in software. When the Intel 8086 hit the market there was an option to add a math co-processor called the i8087. If you included this co-processor, floating point calculations got much faster, provided you had software that could use it. Almost all systems had an empty socket for the i8087. It could purchased with the system or added later. Back then, processors came in a 40 pin DIP (Dual In-line Package). The trick was to push the rather large chip into a socket without bending or breaking the pins.
The separate and sometimes absent math co-processor resulted from both manufacturing and marketing decisions. From a manufacturing standpoint, the additional circuitry was too much to fit on the processor. In terms of marketing, it was a way to keep the system cost down so that only those who needed fast floating point would have to buy the co-processor.
The Intel i8087 Co-processor
This trend continued for several generations and included the 80286/80287 and the 80386/80387 pairings. The 80486 was the first Intel processor with integrated floating point hardware. As an aside, there was some clever (or wasteful, depends on how you look at it) marketing done with the 80486. Intel created the 80486 processor with a floating point unit, but sold a lower cost 80486SX with either a disabled or broken floating point unit. If you wanted a floating point unit there was a part called the 80487. It turns out the 80487 was just an 80486 with a working math processor!
The point I want to make is “processors always absorb functionality.” A more recent example are the on-die memory management units on AMD and Intel processors. My prediction is over the next two years virtually all low-end video processing will move the processor die. In fact this has already started. Consider the recent demonstration of AMD Fusion. Intel will soon be introducing the new Sandy Bridge architecture with embedded video. This video capability is more than basic functionality. It is “reasonable” graphics performance. And reasonable enough that most people will never look past the CPU for their graphic needs. There is no doubt that high end graphics users will continue to use high end graphics hardware, but the low end is going to change quite a bit.
In terms of HPC this could get interesting. While the embedded graphics will not be as powerful as full scale Nvidia Telsa or AMD FireStream card they will give processors a real boost in terms of data/stream/parallel processing (i.e. three will now be a SIMD unit on the processor). Another game changer will be the ability to use system memory for the on-board SIMD units. There will be no need to transfer memory over the PCIe bus. Thus, almost all HPC applications will have the opportunity to use GP-GPU technology.
There has been some discussion about how this will effect NVidia, which has no processor in which it can embed its CUDA cores. If the low end video card market goes away due to embedded “good enough” graphics, then NVidia will become a high end graphics company and not have the low end to help subsidize new products. This situation is covered quite well by Greg Pfister in his Perils of Parallel blog (don’t forget to read the comments).
When the i8087 was added to the hardware mix, some software had to be re-compiled (or even reprogrammed) to use its capabilities. New compilers very quickly became aware if a floating point co-processor was available and made use of its features. The situation is similar to the use of SSE2 for floating point calculations. Initially,compilers had to be told to explicitly use the SSE instructions, but now it is largely automatic. (There are some x86 processors that do not support any SSE functionality.) The point is, eventually support for the extra CPU hardware ends up as an optimization flag in the compiler. Most users don’t even know it is there. I believe SIMD units will eventually be treated the same way.
The Portland Group has implemented a high level interface to NVidia GPUs, which could possibly work for all types SIMD units and even CPU cores. Eventually, compilers may get smart enough to do this automatically. Recently, Portland Group also announced CUDA support for X86, which should greatly extend the reach of the popular NVidia CUDA model.
Additionally, there is the MCUDA Toolset, which is a linux-based tool designed to effectively compile the CUDA programming model to a CPU architecture. There is also Ocelot a dynamic compilation framework for heterogeneous systems that currently allows CUDA programs to be executed on NVidia GPUs and x86-CPUs at full speed without recompilation. Eventually, these compilers should be able to automatically recognize and use extra SIMD hardware. It should also be pointed out that OpenCL can target both GPU and CPU cores. And finally, there is Intel’s Ct language that is supposed to target both GPU and CPU cores as well. (Interestingly the Ct page now redirects to the Intel Array Building Blocks which is a combination of Intelâ€™s Ct Technology and RapidMind technology.)
Like all good processor additions, the GP-GPU will fade into the background. Programmers may still have to explicitly specify parallel execution using the CUDA or OpenCL model (or something else), but the binaries will use whatever cores are available. And, trust me, there will be cores.