Designing a Cluster Solution

Explore some of the factors involved in designing a cluster from scratch. A case study serves as a pedagogical example.
Over the last several months. “Extreme Linux” has reviewed a number of Linux- based cluster operating systems and toolkits. Each cluster distribution has advantages and disadvantages, and features from each could be rolled into a custom-designed toolkit to meet your own particular needs. Unlike proprietary closed source systems, freely-available open source tools offer the ultimate in customizability and utility.
This opportunity for customization extends, to a large extent, to commodity cluster hardware as well. Processors, disks, memory, and network interconnects are merely building blocks that can be combined and arranged in a way to maximize the performance of one or more target applications. Ideally, all clusters should be designed for an intended set of applications. Given the affordability of commodity components, multiple clusters with different configurations could be constructed to meet the diverse requirements of multiple applications in an organization.
Obviously, processor type, quantity, and speed are important, as is sufficient memory. Performance characteristics of both Intel and AMD 32- and 64-bit processors (the most common commodity processors) are all different, and the choice of processor and motherboard architecture are primary concerns for CPU-limited applications. In addition, cache size and speed can have an enormous impact on computational performance, and bandwidth to memory may be important for input- and output-intensive applications.
Another important factor in cluster design is the granularity of the parallel application. Coarse-grained problems require very little communication among parallel processes, while fine-grained problems require very tight coupling among processes and significant communications. As a result, coarse-grained applications tend to be far less latency dependent than fine-grained applications. Coarse-grained models are best served by commodity clusters; fine-grained codes warrant the purchase of expensive low latency, high bandwidth interconnects. The finest-grained models may have their best performance on large commercial supercomputers.
Most scientific models tend to fall somewhere between coarse and fine granularity. They usually require a fair amount of interprocessor communication to coordinate activities and share data. For example, values for cells on a map may depend on neighboring cell values. If the map is decomposed into two pieces, each being processed on a separate CPU, the processors must exchange cell values on adjacent edges of the map.
Problem decomposition — splitting up the space and/or time coordinates of the system being modeled so that calculations can be performed independently and simultaneously on multiple processors — determines the size of the cluster solution required, while the model granularity determines the network topology and interconnect required to meet the needs of the application. A good cluster system designer knows the intricacies of the target application codes — and may even be the developer.
Often overlooked are the storage input/output (I/O) requirements of the application. If all of the I/O is mediated by a master process, then a centralized Redundant Array of Inexpensive Disks (RAID) should do the job. However, if all processes need high-speed access to a large database at runtime, a set of distributed filesystems (using PVFS2, for example) may be required.
For even better I/O performance, shared Fiber Channel or InfiniBand access to a common set of disks may be appropriate. If, on the other hand, shared input data are static, replicating the entire database across a farm of I/O nodes on private networks distributed around the cluster is far less expensive. Although the latter solution is less elegant than having a centralized data server, the scheme is scalable to any sized cluster by simply adding I/O nodes as needed.

A Real-World Example

As an example, let’s look at a system designed and constructed for production use at Oak Ridge National Laboratory about a year ago. The first consideration is the application. In this case, the application was a terrestrial biogeochemistry model that had be to run at a high spatial resolution (about 100 square meters) for the entire continental United States.
Fortunately, this terrestrial model can be run independently for individual geographic cells, so it is a very coarse-grained application (sometimes called embarrassingly parallel). As a result, a Beowulf- style cluster with commodity components can perform these simulations about as quickly as an expensive commercial supercomputer, albeit at one-tenth the cost. Moreover, a low-latency interconnect is not needed for interprocess communication since cells are simulated without regard to results from neighboring cells.
Unfortunately, this code is rather I/O-bound. The model is formed with 18 years of stored meteorological data and these data are needed to “spin up” each cell (individually) for thousands of simulated years to reach a steady state equilibrium for modeled pools of carbon in the biosphere and soil. After the spin-up phase is complete, the model is run for the 18 years corresponding to the meteorological data set. The model outputs 18 years of biome productivity data, which are stored to disk for later statistical analysis.
A parallel wrapper was developed using Message Passing Interface (MPI) routines to orchestrate the processing of each map cell. The wrapper assigns groups of cells to each MPI process for simulation. The code — written in C, just like the model — was developed and tested on a small cluster of dual-processor, 32-bit Intel Xeon nodes accessing data from a RAID filesystem via the Network File System (NFS). Two MPI processes were run on each dual-processor node, and no shared memory parallelism was exploited.
Later, the model and wrapper were ported to 64-bit AMD Opteron systems. On dual-processor Opteron nodes, the model performance increased significantly, primarily because of the independent, high-bandwidth HyperTransport channels to memory. For this I/O-bound application, the AMD Opteron nodes were faster, cheaper, and cooler to run. The same may not be true for other applications; model testing on various processors is needed to make the optimum choice for your application.
Given the production turn-around time required for the model results and the estimated performance on dual-processor Opteron nodes, it was estimated that a 128-processor (64-node) cluster would be required. This estimate assumed no contention for the 250 GB meteorological data set. While an expensive I/O interconnect to a single copy of the data could have been purchased, it was cheaper to replicate the data onto multiple I/O nodes and continue using NFS to distribute the data.
Tests showed that NFS performance for this application scaled to about 32 processes or 16 dual-processor nodes. Therefore, four I/O nodes would be required to support the 64 compute node cluster. Individual Gigabit Ethernet I/O networks would interconnect each of the compute nodes to its respective I/O node.
As shown in the drawing in Figure One, the final cluster solution employs a hierarchical design consisting of four computational “constellations” (named after stellar constellations), each with 16 dual-processor Opteron 246 compute nodes with 4 GB of PC3200 ECC DDR memory and a dual-processor Opteron 246 I/O node also with 4 GB of memory and 1.0 TB of RAID5 disk capacity using a multi-channel serial ATA (SATA) hardware RAID interface. Each constellation has its own Gigabit Ethernet switch (the red network), and the four switches are interconnected with each other through a back-plane cable at 32 Gb/s.
Figure One: A schematic drawing of a hierarchical cluster with four constellations of compute nodes designed for high resolution terrestrial biogeochemical simulations

A second Gigabit Ethernet network consisting of two cooperating switches is used to interconnect all the nodes (the yellow network). With this design, the input and output data are sent over the red network to the local I/O node; messages and other traffic pass over the yellow network. Each I/O node is individually connected to the routed, building Gigabit Ethernet network (the white network).
To guard against failures and ensure timely delivery of results in this production environment, a hot spare drive was obtained for each RAID5 system and two extra compute nodes were purchased to serve as cold spares. Whenever a compute node shows signs that it may fail, one of the cold spares can be dropped in its place while the failed node is returned to the vendor for repair.
Using rack-mounted 1U compute nodes and 3U I/O nodes, two constellations fit in a 47U rack (90 inches tall) along with switches, a terminal server, and a pull-out flat panel display and keyboard. All four constellations fill two 47U racks, use eight 208 volt 20 amp circuits, and require 7 tons of HVAC cooling. Each rack also has a set of fans on top and a blower at the bottom to push air up the front of the rack to help keep nodes cool.
Given this cluster system design, it is possible to enlarge this operation indefinitely (assuming adequate space, power and cooling are available). Because the meteorological data are replicated on each I/O node, additional constellations of the same design can be brought to bear as needed to meet shortened timelines.
Moreover, since the software wrapper for the model employs a master/slave organization in which blocks of map cells are assigned to compute nodes, new constellations containing newer and faster processors can be added without unbalancing the processing. The new faster constellation would simply do more than its share of the computational work, guaranteeing maximum performance even in a cluster being incrementally upgraded or grown.
Figure Two shows the completed cluster as it was delivered in November 2004. The system has performed better than expected for production terrestrial biogeochemistry simulations. It is a testament to careful system design and planning.
Figure Two: (Left) “Extreme Linux” columnist Forrest Hoffman stands next to the cluster; (right) a rear view of the cluster. (Photos courtesy of Oak Ridge National Laboratory.)

As this example shows, applying careful application performance and system analysis can result in the design of a cluster solution that meets all of your application’s requirements. While it may not be possible to anticipate all the requirements of a cluster, particularly in an academic environment, the factors discussed here should at least be considered so that certain capacity and capability expectations can be achieved.

Forrest Hoffman is a computer modeling and simulation researcher at Oak Ridge National Laboratory. He can be reached at class="emailaddress">forrest@climate.ornl.gov. Reference to any specific commercial product, process, or service by trade name, trademark, manufacturer, or otherwise does not necessarily constitute or imply an endorsement or recommendation. Only a careful analysis of your own computational usage and requirements can determine if a product is suitable for your needs.

Comments are closed.