It’s All About Speed

Every cluster builder wants to know how fast his or her computer is. After all, speed is the primary reason to build a Linux cluster -- aside from the gains in data capacity and resource redundancy. But how do you measure speed? CPU clock speed is one thing, but using those cycles to actually get work done is quite another. And in a cluster, the network connections between nodes can quickly become a crippling bottleneck.

Every cluster builder wants to know how fast his or her computer is. After all, speed is the primary reason to build a Linux cluster — aside from the gains in data capacity and resource redundancy. But how do you measure speed? CPU clock speed is one thing, but using those cycles to actually get work done is quite another. And in a cluster, the network connections between nodes can quickly become a crippling bottleneck.

Theoretical peak performance is easy to calculate. However, this number is almost meaningless: it’s the unachievable upper bound of performance (measured in operations per second) for a given computer system. On distributed memory systems, like Beowulf clusters, few computational problems scale linearly, so most applications achieve a decreasing fraction of theoretical peak performance as the number of processors increases.

What’s needed is a standard benchmark for measuring system performance. Unfortunately, there are lots of “standard” benchmarks.

However, for high performance computing, a single package has been the gold standard for gauging achievable computational performance for many years: Linpack, a set of algorithms for solving linear algebra problems. Every hardware vendor quotes Linpack results. Great. So you run Linpack and compare your results against others’, right?

But hold on a minute! Who cares about how fast a computer system runs someone else’s software package? Unless you bought or built your machine to run Linpack day in and day out, your system’s performance on Linpack isn’t all that important. What is important is how fast a computer runs your own real-world application.

If Cluster A runs some benchmark faster than Cluster B, but Cluster B runs your simulation model faster than Cluster A, you’re going to choose Cluster B, all other things being equal. In fact, many cluster owners run only their target applications to gauge and tune performance — they never run a standard benchmark.

That’s not to say that benchmark results are meaningless, only that they have limited meaning. Certainly, computer scientists have come to realize this, and recently there’s been a move afoot to build a table of performance results for standard simulation models — from climate and weather codes to protein folding and quantum chromodynamics models — on a wide variety of hardware platforms.

Running standard benchmarks is akin to drag racing. And the best organized and most well-known, high stakes drag race in the computing world is called the TOP500 Project.

Top of the Computer World, Ma

Released twice each year, the “Top 500 List of Supercomputing Sites” (http://www.top500.org) tracks the world’s most powerful computer systems, where performance on the Linpack benchmark is used to rank the systems. The twenty-first edition of the list, released in June 2003, contains every kind of computer, from the most expensive commercial supercomputers to self-made Linux clusters.

Topping the list for the second time is the Earth Simulator supercomputer built by NEC and installed last year at the Earth Simulator Center in Yokohama, Japan. With its Linpack benchmark performance of 35.86 Tflops (trillions of operations per second), its performance beats that of the number two machine, the ASCI Q machine at Los Alamos National Laboratory, by a significant margin. However, at 13.88 Tflops, ASCI Q is only the second supercomputer to officially exceed the 10 Tflops mark.

Third on the Top 500 list is the MCR Linux Cluster at Lawrence Livermore National Laboratory (LLNL). No Linux cluster has ever achieved a position that high on the list before. Built by Linux Networx and using the Quadrics interconnect, this 1,152-node cluster achieved 7.634 Tflops. The MCR cluster consists of 2,304 Intel 2.4 GHz Xeon processors, 4.6 TB of aggregate memory, and 138.2 TB of aggregate local disk space. The system uses the Lustre open source cluster-wide file system, and utilizes the Quadrics network for IP transfers, providing bulk data delivery to cluster applications.

The number of clusters in the Top 500 continues to grow, and now totals 149 systems. Of these, 23 are designated as self-made, meaning that the machines were designed and constructed by the cluster users themselves. The remaining 126 clusters were built by commercial vendors. These statistics, along with the recent success of the ClusterWorld Conference & Expo, demonstrate the strength of the cluster computing market.

Now you’re probably wondering how your cluster compares with the big boys. What’s the best way to get your cluster souped up for a run around the track? Grab HPL, the High-Performance Linpack benchmark.

The High-Performance Linpack Benchmark

HPL is a free implementation of the Linpack benchmark for distributed-memory computers. It solves a random, dense linear system of equations in double precision arithmetic in parallel. Developed at the Innovative Computing Laboratory at the University of Tennessee, HPL provides a testing and timing program to quantify the accuracy of the obtained solution and the time it took to compute it.

HPL is very scalable (at least to thousands of processors), and can be configured in a numbers of ways so that a maximum performance result can be achieved on a variety of computer architectures. HPL requires either the Basic Linear Algebra Subprograms (BLAS) or the Vector Signal Image Processing Library (VSIPL), and the Message Passing Interface (MPI) for communications among distributed-memory nodes. Generic implementations and machine- and compiler-specific versions of these three libraries are available via the Internet.

The main HPL algorithm solves a linear system of equations of order n (Ax = b) by first computing the LU factorization with row-partial pivoting. The solution is obtained by solving the upper triangular system since the lower triangular factor is applied to the right hand side, b, as factorization progresses. Data is distributed onto a two-dimensional P-by-Q grid of processes in blocks to assure good load balance and scalability of the algorithm. The n-by-n+1 coefficient matrix is partitioned into nb-by-nb blocks that are cyclically distributed onto the P-by-Q process grid. In each iteration, a panel of nb columns is factorized, and the trailing submatrix is updated.

To check the solution, the input matrix and right hand side are regenerated. Three residuals are computed, and the solution is considered “correct” when all three of the quantities are less than a specified threshold value. These residuals are printed in the output from each run.

Building HPL

A gzip‘ed tar file of HPL can be downloaded from the HPL website (see below). Once unpacked, you should create your own Make.<arch> file at the top level; this file specifies which compilers to use, and names the locations for compile-time libraries for BLAS or VSIPL and MPI. The setup/ directory in the HPL distribution contains example files that can serve as models for building your own.

When customizing the Make.<arch> file, you must choose between the BLAS Fortran 77 interface, the BLAS C interface, or the VSIPL library, depending on which is available on the system. Since HPL is written in C, if the BLAS Fortran 77 interface is chosen, you must fill out the machine-specific C to Fortran 77 interface section of the Make.<arch> configuration file.

A hardware-optimized version of BLAS is recommended for benchmarking purposes. The reference implementation will work, but won’t give the best performance on any individual computer system. You can build your own optimized version using something called ATLAS — Automatically Tuned Linear Algebra Software. This package is also freely available, and has been shown to generate well-optimized routines for a wide range of computer platforms.

An optimized version of MPI should also be used to yield the best performance results. If your system has a high bandwidth, low latency interconnect, the manufacturer of your system probably has a specific, customized, highly-optimized implementation of MPI for your hardware. If Fast or Gigabit Ethernet is the interconnect, the latest version of MPICH (from Argonne National Laboratory) should do.

Once the desired libraries are in place and the Make.<arch> file’s been created, type make arch=arch, where arch is the extension of the file you just created. For example, if your file’s called Make.Linux_PIII_FBLAS, you’d type make arch=Linux_PIII_FBLAS to create an executable called xhpl in bin/Linux_PIII_FBLAS/. A starting input file, called HPL.dat, is also placed in the same directory as the executable. Tuning of the parameters in this input file is described next.

Configuring HPL

The three most significant parameters that you can tune are the problem size, N, the block size NB, and the process grid ratio, PxQ. You can set the parameters as you wish to maximize the performance for a given architecture.

To find the best performance for a cluster, the problem size, N, should be as large as possible while still fitting completely in memory. The memory used is approximately the size of the coefficient matrix. Therefore, for four dual-processor nodes each with 1 GB of RAM, you should assume a size of 512 MB per processor — enough memory for 512 million 8-byte array elements. To find the number of rows or columns in the square matrix we take the square root of 512M to obtain 22,627 rows or columns. That means that a problem size of 20,000 is a good upper bound since it leaves some memory for the operating system and other system processes. Picking too large a problem size will cause nodes to swap, resulting in a significant drop in performance.

The block size, NB, is used for data distribution and for computational granularity. A small block size is usually better for load balancing, and large values are generally bad. However, picking too small a block size increases communications, and the best values depend on the computation to communication performance ratio of the cluster being tested. Good block sizes typically range from 32 to 256. A block size around 60 is good for fast Ethernet, and larger values are better for high bandwidth interconnects.

The process grid PxQ also depends on the network interconnect used by the cluster. Usually P and Q are approximately equal, with Q being slightly larger. This rule works well for a mesh or switched network, but on a unswitched simple Ethernet network, performance is limited to flatter process grids like 1×4, 1×8, 2×4, etc.

Listing One contains an example HPL.dat configuration file created for benchmarking four dual-processor 1.0 GHz Pentium III nodes with 1 GB of RAM each using MPICH over fast Ethernet. The first two lines are merely comments. The third line lists the name of the file used to capture output (if stdout is specified on the next line). Line 4 lists the device number where output is sent: 6 for stdout, 7 for stderr, and any other integer sends output to the specified file.

Listing One: HPL.dat, a configuration file for benchmarking HPL

HPLinpack benchmark input file
Innovative Computing Laboratory, University of Tennessee
HPL.out output file name (if any)
6 device out (6=stdout,7=stderr,file)
6 # of problems sizes (N)
1000 2500 5000 7500 10000 20000 Ns
2 # of NBs
50 60 NBs
2 # of process grids (P x Q)
1 2 Ps
8 4 Qs
16.0 threshold
1 # of panel fact
2 PFACTs (0=left, 1=Crout, 2=Right)
1 # of recursive stopping criterium
8 NBMINs (>= 1)
1 # of panels in recursion
1 # of recursive panel fact.
2 RFACTs (0=left, 1=Crout, 2=Right)
1 # of broadcast
1 BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
1 # of lookahead depth
1 DEPTHs (>=0)
2 SWAP (0=bin-exch,1=long,2=mix)
60 swapping threshold
0 L1 in (0=transposed,1=no-transposed) form
0 U in (0=transposed,1=no-transposed) form
1 Equilibration (0=no,1=yes)
8 memory alignment in double (> 0)

Line 5 denotes the number of problem sizes, N, to try, and line 6 lists those problem sizes. Based on the calculations above, 20,000 is the largest problem size appropriate for this configuration. Line 7 lists the number block sizes NB to try, and those block sizes are listed on the eighth line. Since fast Ethernet is being used, block sizes of 50 and 60 will be tested. Line 9 specifies the number of process grids, PxQ, to try, line 10 lists the Ps, and line 11 lists the Qs. In the example file, a flat 1×8 grid and a more balanced 2×4 grid is going to be tested.

Line 12 specifies the residual threshold for answer comparisons. The remaining lines provide for specification of algorithmic features. These options allow you to control the mechanisms used in solving the line equations. They are explained fully on the HPL website.

Running HPL

Now that HPL is built and configured, it’s ready for a test drive. As usual, mpirun kicks off the program. Output One contains an excerpt from the run using the configuration file shown in Listing One. After printing a title and explanation of parameters, the program echoes the parameter values it read from the HPL.dat input file, prints out the equations it uses for residual checking, and then begins solving the dense linear system, cycling through the specified problem sizes, block sizes, and process grids.

Output One

$ mpirun -np 8 xhpl
HPLinpack 1.0 — High-Performance Linpack benchmark — September 27, 2000
Written by A. Petitet and R. Clint Whaley, Innovative Computing Labs., UTK

An explanation of the input/output parameters follows:
T/V : Wall time / encoded variant.
N : The order of the coefficient matrix A.
NB : The partitioning blocking factor.
P : The number of process rows.
Q : The number of process columns.
Time : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.

The following parameter values will be used:

N : 1000 2500 5000 7500 10000 20000
NB : 50 60
P : 1 2
Q : 8 4
PFACT : Right
NDIV : 2
RFACT : Right
BCAST : 1ringM
SWAP : Mix (threshold = 60)
L1 : transposed form
U : transposed form
EQUIL : yes
ALIGN : 8 double precision words


- The matrix A is randomly generated for each test.
- The following scaled residual checks will be computed:
1) ||Ax-b||_oo / ( eps * ||A||_1 * N )
2) ||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 )
3) ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo )
- The relative machine precision (eps) is taken to be 1.110223e-16
- Computational tests pass if scaled residuals are less than 16.0

T/V N NB P Q Time Gflops
W11R2R8 1000 50 1 8 1.79 3.732e-01

||Ax-b||_oo / ( eps * ||A||_1 * N ) = 0.9052085 …… PASSED
||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) = 0.0219706 …… PASSED
||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) = 0.0053079 …… PASSED

Finished 24 tests with the following results:
24 tests completed and passed residual checks,
0 tests completed and failed residual checks,
0 tests skipped because of illegal input values.

End of Tests.

Upon completion of each solution, the relevant parameters are printed along with the time to solution and the performance in units of Gflops (gigaflops, or millions of floating point operations per second). Next, the residuals are calculated and printed. After all tests are completed, a summary is printed.

The results of all 24 tests on the four dual Pentium III nodes are shown in Table One and Table Two. For both block sizes and grid layouts, the largest problem size gave the best results. The 2×4 process grid consistently gave better results than the 1×8 process grid, and a block size of 60 yielded better results than a block size of 50.

Table One: Results of HPL with block size NB=50

PxQ:1×80.371.001.712.262.773.83 Gflop/s
2×40.771.742.522.953.273.88 Gflop/s

Table Two: Results of HPL with block size NB=60

PxQ:1×80.371.021.792.382.894.21 Gflop/s
2×40.751.782.783.333.764.54 Gflop/s

The Checkered Flag

While these four nodes won’t show up in the winner’s circle of the Top 500 drag race — together they’re 2,000 times slower than the MCR Cluster — you’ve now seen one standard method for benchmarking cluster computers.

It is interesting to see how various hardware configurations perform on standard benchmarks like Linpack, and to get a sense of the relative speeds of different clusters and supercomputers. Just keep in mind that what’s important is “time to solution” for your own problem using your own software.

Your own code is the only benchmark that matters.


Top 500 Supercomputer Sites http://www.top500.org/

MCR Linux Cluster http://www.llnl.gov/linux/mcr/mcr.html

High-Performance Linpack (HPL) Benchmark for Distributed-Memory Computers http://www.netlib.org/benchmark/hpl

Automatically Tuned Linear Algebra Software (ATLAS) http://math-atlas.sourceforge.net/

MPICH – A Portable Implementation of MPI http://www-unix.mcs.anl.gov/mpi/mpich

Lawrence Livermore National Laboratory http://www.llnl.gov

The Earth Simulator Center http://www.es.jamstec.go.jp/esc/eng

Linux Networx http://www.linuxnetworx.com

Forrest Hoffman is a computer modeling and simulation researcher at Oak Ridge National Laboratory. He can be reached at forrest@climate.ornl.gov. You can view the entire output of the HPL benchmark run used in this month’s column at http://www.linux-mag.com/downloads/2003-09/extreme.

Comments are closed.