As Beowulf-style clusters have proliferated, many computational scientists have discovered that, although clusters provide adequate CPU performance for most applications, more finely-grained models are often limited by the performance of the network that interconnects nodes. While the network is typically Fast Ethernet (100 Mb/s) or Gigabit Ethernet (1.0 Gb/s), it's still not fast enough to run applications originally developed for shared memory parallel systems or commercial memory systems featuring high performance, custom-designed switched interconnects.
As Beowulf-style clusters have proliferated, many computational scientists have discovered that, although clusters provide adequate CPU performance for most applications, more finely-grained models are often limited by the performance of the network that interconnects nodes. While the network is typically Fast Ethernet (100 Mb/s) or Gigabit Ethernet (1.0 Gb/s), it’s still not fast enough to run applications originally developed for shared memory parallel systems or commercial memory systems featuring high performance, custom-designed switched interconnects.
Such applications have stimulated the market for scalable networks that provide high bandwidth and low latency on commodity clusters. (Bandwidth is the throughput rate of a data channel, while latency is the time it takes to open the channel and initiate the data transfer.) These custom networks — often adding 50% or more to the per-node cost of Linux clusters — are not commodity items, and some people claim that, as a result, a cluster sporting one of these sexy networks can not be called a Beowulf. Religious arguments aside, many researchers consider these high performance interconnects a necessity for their computational clusters.
High Wire Performance
Three of the most popular high performance cluster networks are Myrinet from Myricom, Inc. (http://www.myricom.com/), Scalable Coherent Interface (SCI) from Dolphin (http://www.dolphinics.com/), and QsNet from Quadrics (http://www.quadrics.com/). All three of these networks utilize standard PCI host interfaces, support larger installations (thousands of nodes), require software drivers and/or optimized MPI (Message Passing Interface) implementations to make use of the hardware.
Myrinet (ANSI/VITA 26-1998, http://www.myri.com/open-specs/) is a full-duplex packet-switched network. Each node has an interface card and is connected via fiber optic cables to cut-through crossbar switches that can scale to tens of thousands of hosts. Myrinet host interfaces execute a control program to interact directly with host processes (bypassing the operating system) and directly with the network to send, receive, and buffer packets.
Myricom’s General Messaging (GM) message passing system software is reported to show sustained, one-way data rates of about 245 MBytes/s between processes on different nodes, and short-message latencies as low as seven microseconds. MPI, VI (Virtual Interface Architecture), PVM (Parallel Virtual Machine), and other middleware layered on top of GM is available from Myricom and third parties. Myrinet/GM is supported for Linux, FreeBSD, IRIX, Solaris, Tru64, VxWorks, Mac OS, and Windows on various hardware platforms.
SCI (ANSI/IEEE 1596-1992, http://www.scizzl.com/) is a high speed point-to-point packet link protocol that uses ring or switched-star topologies. Most computational cluster configurations consist of a 2-D or 3-D torus distributed switch topology that eliminates the need for centralized switches. Network interface cards are interconnected point-to-point via multiple 16-bit parallel copper cables.
The software used for SCI clusters is produced by Scali Scalable Linux Systems (http://www.scali.com/), and is called WulfKit or the Scali Software Platform (SSP). SSP consists of tools and applications for configuring and monitoring the network as well as an implementation of MPI optimized for the high speed interface. Scali reports throughput of greater than 300 MBytes/s with latencies of less than four microseconds (for zero byte messages). Supported operating systems and platforms include Red Hat Linux on IA32, IA64, and Alpha; SuSE Linux on IA32; and Solaris on UltraSPARC II and III.
QsNet consists of interface cards connected via parallel copper cables to multistage switches that provide high bisectional bandwidth (no matter how you split the cluster in half, the transfer rate between the two halves is high). QsNet is the interface used by Compaq in its AlphaServer SC series supercomputers. The interface provides direct memory access transfer between processes and can perform I/O to and from paged virtual memory.
Quadrics’ Resource Management System (RMS) software provides monitoring, partitioning, and scheduling tools to manage large scale systems. Other software from Quadrics includes support for their own parallel filesystem and an optimized MPI implementation for Tru64 Unix and Linux. QsNet has a reported peak bandwidth of 340 MBytes/s with 2-5 microsecond latencies. Quadrics supports Linux on Intel and Alpha processors, as well as Tru64 on Alphas.
How Fast Is Fast?
While using different underlying protocols and topologies, all three of these networks offer similar performance with total bandwidth approaching that of the PCI bus itself. The relative merits and costs of these interconnects are frequently debated by Beowulf practitioners, but actual network performance — like computational performance — depends strongly on the application or model being run. Network benchmarks, while useful for gauging the capabilities of the hardware, tell little about the actual realized performance expected for any particular application. Nevertheless, benchmarks can suggest strategies for interprocess communication for particular hardware solutions.
To be a good test of network performance, a benchmark should use the same message passing software used by the target application. Real world models perform certain kinds of message passing usually using MPI or PVM; therefore, a good network benchmark should demonstrate performance using the same software layers and protocols. The most basic network benchmark shows point-to-point performance; others test collective communications performance. Commonly used network benchmarks are listed in Resources.
A typical point-to-point communication benchmark is a ping-pong test. In a ping-pong test, a message is sent from Process A to Process B, which subsequently sends the message back to Process A. Processes A and B should be run on separate nodes to test the interconnect. Because bandwidth and latency are dependent on message size, messages of various lengths are exchanged to understand the characteristic performance of any particular network.
|Figure Two: Latency on an SCI and Fast Ethernet network|
|Figure One: Bandwidth on an SCI and Fast Ethernet network|
A similar benchmark exchanges messages of various lengths simultaneously (using, for instance, MPI_Sendrecv()). Figure One shows bandwidth results of both ping-pong and exchange benchmarks run on an SCI network using Scali ScaMPI and on a Fast Ethernet network using MPICH. Figure Two shows latency results from the same tests.
In general, bandwidth increases with message size up to a point, after which it levels off or drops measurably. Therefore, large messages are considered more cost-effective than small messages. However, network latency also increases with message size. Both factors must be considered when attempting to optimize a message passing strategy for any particular high performance interconnect.
These results show that for messages of a reasonable length, the throughput offered by an SCI network is considerably greater than that from Fast Ethernet. Moreover, the latency for a message of any length is lower on SCI than on Fast Ethernet. In general, the MPI_Sendrecv() exchange provides higher bandwidth and lower latency than the two-part ping-pong exchange for most message sizes.
Results similar to these can be obtained using Myrinet, QsNet, or other high bandwidth, low latency interconnects (but performance curves will be different for each kind of interconnect). More complex network benchmarks can yield additional insight into the characteristic performance of these networks, but no benchmark can replace tests of real-world applications on a cluster. After all, throughput is the most important factor in selecting any computational solution. If you need more bandwidth for your cluster, check out these high performance networks.
Forrest Hoffman is a computer modeling and simulation researcher at Oak Ridge National Laboratory. He can be reached at firstname.lastname@example.org.