Hitting The Cluster Wall

Two months ago, the Kronos “value cluster” set a new record for price-to-performance, yielding 14.53 gigaflops at the cost of $171 per gigaflop. But is that the best Kronos can do? Or can some additional investment of time and effort push the extremes a little further? Discover if Kronos hits the proverbial wall, learning more cluster optimization techniques along the way.

January 2006

Extreme Linux

Hitting The Cluster Wall

A Study in Cluster Optimization

Two months ago, the Kronos “value cluster” set a new record for price-to-performance, yielding 14.53 gigaflops at the cost of $171 per gigaflop. But is that the best Kronos can do? Or can some additional investment of time and effort push the extremes a little further? Discover if Kronos hits the proverbial wall, learning more cluster optimization techniques along the way.

Douglas Eadline

The November 2005 feature story “Life, The Universe, and Your Cluster” (available online at http://www.linux-mag.com/2005-11/value.html) presented Kronos, the “value cluster,” built for a paltry $2,500. While small and inexpensive, Kronos nonetheless delivered 14.53 gigaflops (52 percent of peak) on the famous Top 500 High Performance Linpack (HPL, http://www.top500.org/) benchmark. That’s a record of sorts: only $171 per gigaflop. Can Kronos crunch even faster?

When fixing a car, here’s a general rule of thumb: Replace the cheapest, most accessible part first. Then, if necessary, move again and again to the next easiest and cheapest part to replace, continuing until you (hopefully) correct the problem. As any grease monkey will tell you, it’s far simpler and cheaper to replace a gas filter than it is to replace a carburetor.
When optimizing a cluster, the same rule applies. As was done in November to tweak the value cluster, you can start with the easiest and most accessible components: fiddle with parameters, choose the best Basic Linear Algebra Subprograms (BLAS) library, and tune the Ethernet driver. From there, you can make more substantive changes, switching out interconnects, drivers, Message Passing Interface (MPI) “middleware,” and compilers.
But while parts may be (relatively) cheap, time and effort typically aren’t. For example, if you want an evaluate compiler and MPI implementations, there are many, many to choose from (see the sidebar “Too Much of a Good Thing”). Short of testing them all, you may find that “good enough” suffices.
So, what’s good enough for Kronos? Here are the results.

A Nudge, Not a Bump

To continue to push Kronos’s capabilities, let’s pick up where the last article left off: let’s tune TCP values, change the MPI library, change the compiler, and finally, use a kernel by-pass MPI library. And while it’s impossible to try every possible combination (see Sidebar XX), changing a small numbers of components at a time is still a valuable exercise — and one that promises even the most stalwart cluster creator a few new gray hairs. (For Kronos’s specifications, see the sidebar “The Previous Kronos Record.”)
The easiest thing to try is tuning TCP parameters (see http://www-didc.lbl.gov/TCP-tuning/linux.html). In newer Linux kernels, the kernel fine-tunes the buffer size based on the communication pattern it encounters. Listing One shows settings that increase the maximum tunable size of the buffers. To enable the changes, add the lines to the /etc/sysctl.conf file and run sysctl –p.
LISTING ONE: New parameters to enlarge the maximum tunable size of the TCP buffers

# increase TCP max buffer size
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216

# increase Linux autotuning TCP buffer limits
# min, default, and max number of bytes to use
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216

On Kronos, re-running the best case with the new settings yielded a small increase to 14.57 gigaflops (multiple runs confirmed that this increase is statistically significant). This change was easy and took hardly any time, but didn’t yield much improvement, either.

Pick an MPI

So far, all of the Kronos system benchmarks have been based on LAM/MPI, the default MPI used in the Warewulf distribution. The version tested here was 7.0.6. So, let’s try a different MPI.
The most logical implementation to try is MPICH from the Argonne National Laboratory. The latest version of MPICH is MPICH2, available from http://www-unix.mcs.anl.gov/mpi/mpich.
On the test system, MPICH2 compiled easily enough, but the HPL code failed to run because each of the nodes was missing some shared libraries. However, since a Warewulf cluster uses a minimal RAM disk on each node, it took only five minutes to copy the missing libraries to the cluster’s Virtual Network File System (VNFS) and build a new node RAM disk image.
Unfortunately, MPICH2 also uses daemons (instead of ssh) to start remote jobs, and the daemons require a specific version of Python to run. Rather than expend more effort on MPICH2, the effort switched to focus on the tried-and-true MPICH1.
After a quick configure; make install, MPICH1 ran on the cluster. After some fiddling with environment variables and the HPL Makefile, re-running the benchmark resulted in 13.9 GFLOPS. This result was good, but it wasn’t the best. (For the MPI jihadists out there, this result does not necessarily mean LAM is always better than MPICH. Results vary largely depending on each application, and MPICH can exceed LAM, too.)
Given the MPICH1 results, trying OpenMPI seemed worthwhile. OpenMPI is a new, highly-modular MPI being written by the LAM/MPI, FT-MPI, LA-MPI, and PACX-MPI teams, and it’s very near a final release. However, after downloading and building the libraries, HPL basically stalled. Evidently, this is a known issue. It was time to move on again.

Pick a Compiler and an MPI

The compiler is one of those really cool tools that can give your application a really nice kick in the asymptote. Again, much depends on your application, but many of the commercial compilers have more hardware-specific options than the GNU compilers and are therefore worth a try. Presently, though, most of the commercial compilers are focused on x86-64 level processors and have no great interest in optimizing for Sempron.
To continue with Kronos, The Portland Group (PG, http://www.pgroup.com) compiler was the next choice. It’s reliable and robust and you use it free for fifteen days. In any case, it is just a recompile, right?
There are three basic components in the HPL program: the program itself, the BLAS library, and the MPI library. The cleanest way to build an application with a new compiler is to build all of its supporting libraries as well. Otherwise, you may face a multitude of linking problems.
Building MPI code with alternate compilers is well-documented, so the task shifted to building a new version of MPI with PG, building a new version of Atlas with PG, and building a new version of HPL with PG, finally linking all of the components together.
Building MPI went smoothly, so the next step was building Atlas. Simple enough (seemingly): change the Atlas configuration and enter make. Alas, no joy. According to the HPL documentation, Atlas should not be built with the PG compiler. (Another known issue validated.)
However, the PG compiler is very good at linking in GNU compiled code, so it can link the GNU Atlas libraries and build HPL. After some Makefile magic, HPL built with some cool optimization flags: –fastsse –pc=64 –O2 –tp athlonxp. The run produced 13.92 gigaflops. At this point, the “MPICH1-PG” version is slightly better than the “MPICH1-GNU” version, but worse that the “LAM/MPI-GNU” version.
Undaunted, the next thing to try was a LAM/MPI-PG combination. After some more Makefile madness, the code produces a new record of 14.90 gigaflops. However, that’s no cause to celebrate. The amount of time spent with the MPI/compiler rebuilds was easily two days, yielding a scant improvement of 0.33 gigaflops. Evidently, a new tact was needed.

Bring On The Big Guns

There have been two constants in the tests so far: The GNU/Atlas library and the TCP-based MPI libraries.
A quick check finds another BLAS library from from the Texas Advanced Computing Center called GotoBLAS (http://www.cs.utexas.edu/users/flame/goto). Good things have been reported about these optimized libraries, but, alas, the documentation indicates that GotoBLAS isn’t supported on the Sempron processor.
The other parameter that’s remained static is the use of TCP to communicate between nodes. As mentioned, TCP uses buffers. When a communication takes place, data is copied to the operating system buffer, then across the network to the other node’s operating system buffer, then copied again to the user space application. HPC practitioners have known for years that this extra copying slows things down, so “kernel bypass” software and hardware has been built to copy data directly from user space to user space.
Normally, a kernel bypass requires some fancy and expensive hardware. But since adding new hardware wasn’t a possibility, that left one option: Ethernet kernel by-pass. Fortunately, such a project exists and works on Kronos’s Intel Gigabit Ethernet PCI cards. The project is called the Genoa Active Message Machine (or GAMMA, http://www.disi.unige.it/project/gamma/), and it’s maintained by Giuseppe Ciaccio of Dipartimento di Informatica e Scienze dell’Informazione in Italy.
GAMMA requires a specific kernel (2.6.12) and must be built with some care. The current version of GAMMA takes over the interface card when GAMMA programs are running, but allows standard communication otherwise. The Kronos cluster has a Fast Ethernet administration network to help as well. Of course, Warewulf needed to be configured to use GAMMA.
Without too much trouble, Kronos was soon running the GAMMA “ping-pong” test. The results were as follows:
Average latency 9.54739 microseconds Maximum Throughput: 72 MBytes/sec
Previous tests using Netpipe TCP (http://www.scl.ameslab.gov/netpipe) showed a 29-microsecond latency and 66 MB per second throughput. (Recall that Kronos is using 32-bit, 33 MHz PCI cards, so the top bandwidth is going to be limited by the PCI bus. In any case, such numbers were quite astounding for this level of hardware.)
There is MPI support for GAMMA as well. The GAMMA authors have modified MPICH version 1.1.2 to use the GAMMA API.
However, before seeing the effect of GAMMA on HPL, it was useful to see the difference between GAMMA-MPI, LAM/MPI, and MPICH1. Fortunately, the Netpipe benchmark has an MPI version. By leveling the playing field, GAMMA improvements can really stand out. The results are shown in Figure One, where throughput is charted against block size. At the beginning and the end of the graph, GAMMA-MPI is the clear winner; in the middle portion however, the other MPIs have an advantage of GAMMA-MPI.
Figure Two shows the difference in small packet latency for the various MPIs. In this case, GAMMA-MPI was the clear winner.
Another thing to notice that the TCP latency was previously found to be 29 microseconds. Adding an MPI layer increased this to over 40 microseconds. As is often the case, adding an abstraction layer adds overhead. In this case, the portability of MPI is often a welcome trade-off for the increase in latency.
FIGURE ONE: Throughput vs block size for MPI-GAMMA, LAM/MPI, and MPICH1

In all fairness, each MPI can be tuned somewhat to help with various regions of the curve. In addition, there are other implementation details of each MPI library that come into play when a real code is used. (Hence, the results in Figure One and Figure Two are not the sole predictor of MPI performance).
FIGURE TWO: Latency signature for MPI-GAMMA, LAM/MPI, and MPICH1

Armed with the GAMMA results, let’s create and execute a new version of HPL. On Kronos, the new benchmark ran successfully, but produced lackluster results. When GAMMA ran on the the cluster, the amount of free memory was decreased 20 percent. Some adjustments got this number down to 10 percent, but the HPL problem size needed to be reduced.
To work as fast as possible, GAMMA needs RAM to reserve memory for each connection it creates. So the cost for speed is memory. In the case of a GAMMA HPL, the problem size is smaller and less gigaflops are possible.
Nonetheless, it was possible to run a problem size of 11,650 successfully. This run resulted in 14.33 gigaflops — no where near a new record. To see the real effect, both LAM/MPI and MPICH1 were re-run using this problem size to see how MPI-GAMMA helped performance. At that problem size, MPICH1 returned 13.66 gigaflops and LAM/MPI returned 14.21 gigaflops.
It seems as though Kronos may have hit a wall. Even if GAMMA-MPI could run the original problem size, the improvement probably wouldn’t be that great.

The Wall

A summary of all of the tests is given in Table One. After all the efforts, the best Kronos could do was 14.90 gigaflops. Some tuning, tweaking, and twitching could break 15 gigaflops, but the time needed to eke out another 0.1 gigaflops would probably be two or three days. Hence, the performance is “good enough” for this application on this cluster.
Another indication that Kronos is near its maximum is shown in Figure Three. In this figure, the output of wwtop is shown. You can clearly see that those processors that are not communicating are calculating at close to 100 percent and those that are communication are high as well. wwtop is a cluster top- like application that shows the processor, memory, and network load on the cluster. The head node, which was used in the calculations isn’t shown, but is assumed to have similar data.
FIGURE THREE: Loads while running HPL on cluster

Was It Worth It?

Although the exercise really did not set a new record worth shouting about, it did reveal a few things about the cluster and the application. First, the previous efforts, which required far less time, produced great results. Second, swapping MPI’s and compliers had very little effect, which means that any bottlenecks probably do not reside in these areas. And finally, there are always trade-offs on the road to “good enough.”
TABLE ONE: Summary results for optimization. PG= Portland Group Compilers; +buffers= increase TCP Buffer range
1 LAM/MPI GNU Atlas default 12300 14.53
2 LAM/MPI GNU Atlas + buffers 12300 14.57
3 MPICH1 GNU Atlas + buffers 12300 13.90
4 MPICH1 PG Atlas + buffers 12300 13.92
5 LAM/MPI PG Atlas + buffers 12300 14.60
6 MPICH1 GNU Atlas + buffers 11650 13.66
7 LAM/MPI GNU Atlas + buffers 11650 14.21
8 MPICH-GAMMA GNU Atlas NA 11650 14.33
Is there more performance to be had? Probably. The Atlas library is one candidate. A hand-tuned assembler library might work faster, but clearly isn’t worth the effort. Would profiling the code with the (Performance Application Programming Interface (PAPI) help? Perhaps. If you run an HPL-like application day in and day out, you might consider the effort. For Kronos, there are far more interesting things to pursue.
A final bit of advice: keep an eye on the big picture. The amount of time spent optimizing a $2,500 cluster might lead one ask, “Why not just buy faster hardware?” That’s an utterly excellent point.

Douglas Eadline can be found swinging from the trees at clustermonkey.net. Doug wishes to acknowledge AMD for funding this project.

Comments are closed.