As I mentioned, the Appro 1U Tetra GPU Server is a truck load of number crunching hardware. Indeed, the Appro Tetra GPU was benchmarked at an amazing 1116 GigaFLOPS by NVidia. Yes, that is correct 1.1 TeraFLOPS. According to NVidia this is the first 1U to break the TeraFLOP limit. A little perspective may be helpful. Without the additional GPUs, the 12 cores provided by the dual Intel X5670 can reach 121 GigaFLOPS (I have seen reports of 130 GigaFLOPS, but I have data for the 121 GigaFLOPS number). If you subtract that from 1116 and divide by four, you get 248 GigaFLOPS per Tesla M2050. Of course, your application, which is probably not Linpack based, will determine your actual performance.
I decided to try running HPL (High Performance Linpack) on my own. Appro and NVidia provided me with the software and some pointers on how to run on the Tetra server. Before I present results, I want to state that generating an HPL number takes some time and tuning, which means multiple runs and re-testing. I did not have the time to reach the 1 TeraFLOPS number reported by Appro and NVidia, but I did get a sense for how fast this combination of hardware can be. First, I ran on a single Tesla M2050 and got 370 GigaFLOPS. Next, I tried using all four GPUs and got 597 GigaFLOPS. To generate the xhpl program I used gcc/gfortran (V 4.1.2), Intel MKL (10.2.6.038), and Open MPI (V1.2.7), and CUDA SDK (V3.0). I am sure with some more time I could have reached the TeraFLOPS threshold as well simply based on the individual M2050 performance.
In running HPL, my main goal was to see if I could “crank” up the whole system and run for prolonged periods of time on a “real” application. The Appro 1U Tetra GPU Server hit high marks in this area. There was no overheating, no crashes, and absolutely no problem in delivering the high performance this hardware offers.
Power and Cooling
The Tetra GPU Server is a high performance system and thus is going to have a high power appetite. Reporting power usage without performance data is not very useful, however. Experienced users know that FLOPS are paid for in Watts. In order to get an idea for the power budget for the Appro Tetra, I used my handy Kill-A-Watt wall socket power meter to measure the amount of power used by the system. When idle, the Tetra used 468 Watts, which is understandable given the amount of hardware inside. GPUs are known for their power appetite even when idle. When I was running HPL, the power usage fluctuated and hit 1389 Watts at one point. Using my data (597,000 MegaFLOPS/1389Watts), I get 430 MegaFLOPS/Watt which would be position 9 on the latest Green 500 list. If I were to use the higher TeraFLOPS number, that puts the Appro Tetra at the top of the Green 500 list with 803 MegaFLOPS/watt.
In order to keep the air moving through the box, Appro employs 12 small fans. Small fans can be loud and the Tetra moves a lot of air through the box. The Tetra GPU Server is definitely a data center machine and should be in a professional data center with adequate air flow and a tolerance for fan noise. I would not recommend this system for an office or lab rack unless you can guarantee adequate cooling and noise insulation.
In November 1998, the 1116 GigaFLOPS Appro Tetra GPU 1U server would have been a close second to the 1338 GigaFLOPS ASCI Red machine (9152 nodes) at Sandia National Labs as reported on the the Top500 list. Skipping forward, in November 2004, the same performance, now position 325, could be had with 256 dual Pentium Xeon 2400 nodes. Today, this level of performance has fallen off the list and is equivalent to a 10-12 node dual Xeon X5670 cluster with 12 cores per node.
From my experience, the Appro Tetra GPU is a worthy, well constructed, efficient HPC workhorse. With a starting price of $12,945 it also could be set a new record for cost per HPL-GigaFLOPS (my guess around $10-$12/HPL-GigaFLOPS depending how you configure it) If GP-GPU has been working for your applications, this box can add a serious performance boost to your HPC workload. Here is my quick summary:
Pros: well constructed, very fast, ready to run, cost/energy efficient
Cons: fans are loud, small front panel
As received the box included the Appro Tetra GPU server, a Quick Start Guide, mounting rails/hardware, power cord, and USB stick with updated drivers. To learn more, contact Appro International.
Douglas Eadline is the Senior HPC Editor for Linux Magazine.