In last weeks installment, I talked about Nvidia and the subsequent announcement of their new Fermi GPU. I also invited you to take a look at the Fermi White Paper. I’m not going to rehash the white paper here, but I do suggest reading it.
Before I dive into Fermi the GPU, I wanted to take a moment and pay tribute to Enrico Fermi the person — for whom the new GPU was named. For you young whippersnappers out there, Enrico Fermi is one of the giants of physics. He was instrumental in advancing physics on many fronts including quantum theory, nuclear and particle physics, and statistical mechanics (a favorite subject of mine). He had a rare combination of talent that allowed him to be both an excellent theorist and experimentalist. His legacy is legendary as he has an element named after him (Fermium, a synthetic element created in 1952), a national lab (the Fermi National Accelerator Lab), and a class of particles that bare his name (fermions). No lightweight this Fermi fellow.
Which brings me to the Nvidia Fermi architecture. In terms of HPC it is also no lightweight. I believe it is going to be a game changer. Let me explain. My previous opinion of GP-GPU computing was certainly positive and I could see it changing the HPC game in some corners. Users were reporting fantastic speed-ups in some areas, new users could experiment with existing video cards, and the larger video market was going to keep the cost down. There were, however, some fundamental issues expressed by many of the more traditional HPC users. Until they were resolved, I assumed these issues would limit just how far GPU computing could go in the HPC world. Based on the Fermi technical material I have read, Nvidia has been listening and many of these issues have been addressed head-on. I predict this device and others like it will change the HPC landscape, but first a few details about the Fermi architecture.
One concern voiced by Michael Feldman of HPCwire was the lack of ECC memory. This opinion combined with a recent University of Toronto/Google paper on DRAM Errors in the Wild: A Large-Scale Field Study point to the need for ECC in the data center. (As an aside, I’ll have more to say on this topic in the coming months. I have been rigorously testing non-ECC systems and I find some of them quite robust.) Nvidia has been listening to the market and one of the new features offered by Fermi is support for ECC memory. This new feature is not an afterthought as ECC protection extends down from DRAM to L2 and L1 caches, shared memory, and register files. Scratch that one off the list.
Another issue was double precision floating point performance. The rule of thumb for current GT200 Tesla systems is a double precision FLOPS rate that is about one eighth that of the single precision rate. This situation was due to the number of double precision units on the chip. The Fermi design team has worked to balance this mismatch and reports an eight fold increase in peak double precision floating point performance. Also of note is full IEEE 754-2008 32-bit and 64-bit precision.
Some other features of note are a total of 512 CUDA cores per chip. A CUDA core executes a floating point or integer instruction per clock for a thread. All memory (thread private local, block shared, and global) is now fully addressable which allows for global pointers and clears the way for C++ applications on Fermi. Concurrent kernel execution is another new feature. In past designs CUDA kernels (sections of code that use the GPU) operated one at a time. If a program had multiple small kernels, each would have to wait their turn. With Fermi, multiple small kernels can run at the same time. This combined with much improved context switching support and improved scheduling make for better utilization of the entire GPU. There are many other features in Fermi that I don’t have room to mention. Check out the white paper to get more detail.
Let’s step back and look at the big picture for a minute. When I asked a friend of mine what he thought of GPU computing he said, “Well almost everybody in HPC can use an array processor.” “That is quite true,” I replied, “And, now most everybody can afford them.” Then it dawned on me, one way to look at GPU computing is the return of the vector or array processor to HPC.
In the past, I have made the point that low volume specialized CPUs, like those found in very high end vector supercomputers, had become too expensive to manufacture. In response, the market swung to commodity processors and clusters. Great reductions in price-to-performance ensued, but now we have come to the point of just how big can we scale these clusters before power, space, and even the speed of light becomes and issue? We can continue to add cores to commodity processors, but in effect we are just adding more general purpose scalar nodes to the equation. Collections of general purpose scalar nodes (i.e. the commodity x86 core) can work on SIMD (Single Instruction Multiple Data) problems just fine, but a modern GPU has been optimized for this class of problem which include both graphics and HPC solution spaces. In fairness, the general purpose scalar node can also work on MIMD (Multiple Instruction Multiple Data) problems for which GPUs were not designed. (For those who like to break the rules, however, check out the MIMD on GPU project.)
The GPU, and in particular the Fermi architecture with it’s HPC features, are bringing vector processors back into supercomputing. Enhancing the commodity scalar core with a commodity vector processor makes a boat load of sense. And now that Fermi has planted the GPU-HPC flag firmly in the ground, I see others following. Like the physicist for which it is named, I believe Fermi will contribute in big ways to many areas. I do have one request for NVidia, however. When you sit down to plan the next generation GPU, use the code name Boltzmann. Ludwig and the whole S=k*logW thing could use some of that GPU limelight.