Before we get started, I wanted to mention my recent review of the new Appro Tetra GPU server which is part of the Fermi wave hitting the market. The Appro box is rather unique in that they managed to cram four Fermi cards (M2050) into a 1U box. Speaking of Fermi, yesterday I was at the HPC Financial Markets conference in New York city. As I perused the vendor tables, I noticed a vendor touting 2.01 TeraFLOPS using four M2050 cards in a big tower case. Having just spent a lot of time getting intimate with this hardware, I asked them how they managed to get such a mind-boggling number. They said it was actually the sum of the theoretical FLOPS for all four cards. Sigh. As politely as I could, I suggested that they not advertise such BS (Bovine Solids) because it creates impossible expectations for the market. In addition, real HPC users are focused on their application(s) performance and not Linpack, that is unless they use Linpack in which case they already know what you are telling them.
Another interesting aspect of my trip to the big apple was my bus ride. I live about 80 miles west of NYC in the Lehigh Valley ,PA. There is very good bus service between the valley and the city, so I always take the bus for these HPC field trips. To my surprise the bus line now has wi-fi on the bus. I was able to do email (thanks Squirrel Mail), browse a bit, and even work on this column. I could not ssh however, but hey I’ll take web access on the bus any day. Of course, we now need to start proposing “bus clusters.”
Getting back to HPC and parallel computing, there was some discussion on the Beowulf Mailing List about Google’s Urs Hölzle recent short paper on “wimpy cores” vs “brawny cores.” As I understand his comments, he is discussing the trend of using multiple low power servers in place of bigger faster power hungry server. Let’s assume he means something like the Intel Atom vs the Intel Xeon. I did a very rudimentary comparison of these two processors and found the following (I used data for a i7 model 870 running at 2.93GHz which is close enough to a Xeon 5570).
The Nehalem Xeon runs 1.8 times faster, generates 7.3 times as much heat and costs 22 times as much as the D510 Atom. The Xeon performance is 7.7 times faster, but when you factor in the price-to-performance the Atom is 3 times better than the Xeon solution. Interestingly, the TDP/performance ratios are almost identical for both processors. Thus, there is no real power advantage with either processor.
There are two things to note, the TDP/performance ratios are about the same and the price/performance is 3 times better for the Atom. Presumably, the Atom is low cost because it is easier to manufacture and is produced in more volume than the Xeon processors. Low power laptops, “pads”, routers, set-up boxes, low-end NAS boxes, and other appliances represent huge volumes for processors like the Atom. (The AMD Bobcat, VIA Nano, Qualcom Snapdragon, and ARM A-15 are a few other low power designs).
The second thing to note is the similarity in the TDP/performance ratios. My guess is that most low power processors will be in this range. This means there is no real secret sauce in the low power designs. There can be better efficiency, however, like the ARM processors, because they don’t need to devote transistors to legacy x86 support. Thus, the advantage with “wimpy cores” is probably a better price to performance ratio. If your application is scalable, you may be able to take advantage of this trend.
With this in mind, Urs Hölzle is proposing that for Google work loads the “brawny” Xeons are better then the “wimpy” Atoms. His argument is based on the need to parallelize code on the wimpy cores at the thread level (inside the box) to achieve a comparable response time. The brawny cores are fast enough that they only need to be parallelized at the request level (outside the box). His conclusion is that the extra programming overhead needed for wimpy cores will negate any of the cost savings. He also mentions some scheduling issues and the need for more hardware (fans, power supplies, cases etc.) for wimpy cores.
I certainly believe Urs knows what is best for Google and at the same time remind everyone of a few lessons learned from the house of parallel. First, coding for parallel is indeed expensive, so do it right the first time and plan for the future. This task is tricky. You want to make sure you have scalable code, not just code that works today. There may also be better non-procedural approaches to parallelism (i.e. functional languages). There is a premium price for parallel programming and it may be possible to make programmers more productive with higher level tools.
Second, as he mentions, Amdahl’s law will often trump any lofty dreams. Experience in HPC has shown that the parallel portion of a program often increases as either the data set gets larger or the amount of parallel processing gets larger. Thus, even the brawny cores may need to be parallel at some point. One should not count on increased clock rates to save the day. This remedy may have worked in the past, but increased speeds are going to be a slow in coming, which is why we are having this discussion.
And finally, once the important things are parallel and portable (which may not always be possible by the way) you open yourself up to more flexibility in execution. For instance, MPI codes will run on both clusters and SMP systems. Optimization is another issue, but it is far easier to optimize when you have a working baseline. At some point, even adding cache coherent (SMP) cores is going to hit the wall and parallel designs are going to be the only way to increase performance. Those that plan for a “cluster centric” design will be well rewarded.
The lesson for today — the future is parallel, get used to it. The HPC community has been living and working with parallel clusters for more than a decade. While number crunching may not be on everyone’s list, the mass market lessons of the high end always head downstream. As the pioneers of this new parallel world, it has been our job is to weave the common into the exceptional. Based on the success of the HPC market, I am sure others will follow.