Cluster Computing: Linux Taken to the Extreme

The free availability, high reliability, and relative efficiency of Linux has been a boon to computational science in the 1990s, and grows ever-popular with computational scientists everywhere. Using Linux, scientists have been able to turn off-the-shelf personal computers into effective UNIX workstations suitable for a number of tasks, including number-crunching for scientific models. Beowulf-style cluster computing -- pioneered by Thomas Sterling, Donald Becker, and others at NASA's Goddard Space Flight Center (http://www.beowulf.org) -- has extended the utility of Linux to the realm of high performance parallel computing. Additionally, the open source nature of Linux has allowed programmers to add features directly to the operating system to meet the unique needs of cluster computing. A collection of these enhancements is now distributed under the name Extreme Linux (http://www.extremelinux.org) -- "It's Hot and It's Cool" -- by Red Hat Software, Inc. (http://www.redhat.com).

Extreme Linux 1
These Guys Don’t Like Chairs: Authors Forrest Hoffman (left) and Bill
prefer lounging on their massive creation, the stone SouperComputer.

The free availability, high reliability, and relative efficiency of Linux has
been a boon to computational science in the 1990s, and grows ever-popular with computational
scientists everywhere. Using Linux, scientists have been able to turn off-the-shelf personal
computers into effective UNIX workstations suitable for a number of tasks, including
number-crunching for scientific models. Beowulf-style cluster computing — pioneered by Thomas
Sterling, Donald Becker, and others at NASA’s Goddard Space Flight Center (http://www.beowulf.org)
– has extended the utility of Linux to the realm of high performance parallel computing.
Additionally, the open source nature of Linux has allowed programmers to add features directly to
the operating system to meet the unique needs of cluster computing. A collection of these
enhancements is now distributed under the name Extreme Linux (http://www.extremelinux.org) — “It’s
Hot and It’s Cool” — by Red Hat Software, Inc. (http://www.redhat.com).

We became involved in cluster computing in the summer of 1997 by heading a proposal to fund the
construction of a then-large Beowulf-style cluster of new PCs. The proposed cluster would support
the development of parallel environmental applications, and would be used for research runs by Oak
Ridge National Laboratory staff. The proposal itself was rejected. But with significant effort
already expended toward the design of a new high-resolution landscape ecology application, we
decided to continue building, using the resources that were available: surplus Intel 486 PCs
destined for salvage.

Commandeering a nearly-abandoned computer room and scavenging as many surplus machines as
possible — from Oak Ridge National Laboratory, the Y-12 production plant, and the former K-25 site
(all federal facilities in Oak Ridge, Tennessee) — we setup a “chop shop” to process machines, and
by September 16, 1997, had a functional parallel computer system built out of no-cost hardware.

Aptly named the Stone SouperComputer, after the children’s fable entitled Stone Soup, the highly
heterogeneous cluster grew slowly to 126 nodes as surplus PCs became available. The nodes, arranged
in back-to-back rows on the raised floor of the old computer room, contain a wide variety of
motherboards, processors (of varying speed and design, primarily 66 MHz 486DX2s, some 100 MHz
486DX4s, and 16 Pentiums), controllers, disk drives, and cases (see Figure 1). Each node has at
least 20 MB of memory (most have 32 MB), at least 400 MB of disk space (for booting and local file
access), and is connected to a private 10 MB/s Ethernet network for inter-node communications. The
first node in the cluster is also connected to the external network through a second Ethernet card
for logins and file transfers.

Extreme Linux 2
It’s (Not) Dead, Jim: Note the “toe tags” that describe each computer’s
“illness,” and the crash carts for recovering sick nodes or building new ones.

We’re always on the lookout for unused PC hardware which can be added to the Stone
SouperComputer. When a surplus PC is collected, it undergoes a diagnostic process to determine if it
can be made into a node. In homage to the almost Frankensteinian process by which we’ve constructed
our lab, we use a wealth of emergency room metaphors to describe the steps in this process, which we
call, appropriately enough, “triage.” Component parts taken from a large number of “organ donor”
machines in a “morgue” are combined to build a node meeting the minimum criteria for inclusion in
the collective machine. We’ve developed a “triage disk” containing utilities which can be used to
boot prospective nodes, determine their CPU characteristics, and configure their network cards.
Masking tape “toe tags” on the top of each CPU case allow for easy identification of internal
components. And because nodes are “headless” (no keyboards or monitors), a pair of “crash carts” –
each containing a monitor, a keyboard, a triage disk, and Linux boot disks — are wheeled around to
“sick” nodes for diagnosis or to load Linux onto new nodes (see Figure 2).

If a new machine or an existing node offers resistance — if it behaves erratically, becomes
unstable, or exhibits signs of hardware failure, we cannibalize it, thereby making it an “organ
donor”, and then retire it to the “morgue”. In this world, hardware is disposable and no maintenance
contracts are required. As new versions of Microsoft Windows are released, better hardware becomes
available for assimilation into the cluster, since users must frequently buy new desktop PCs in
order to run the latest release. Staying just behind the curve means the Stone SouperComputer will
have an endless supply of upgrades.

Red Hat Linux is loaded onto new or replacement nodes over the private Ethernet network from the
first node in the collective. Unlike the configuration of some Beowulf clusters in which the root
filesystem is NFS-mounted from a central server, a complete operating system is loaded onto every
node of the Stone SouperComputer so that we minimize unnecessary traffic on the private network and
give each node some autonomy. Swap partitions twice the size of real memory are created on the local
disk of each node for the fastest virtual memory access. After the base operating system is loaded
onto a node, an “assimilation script” is executed. This loads additional components and configures
the node for coordinated cluster operation. While some large disks scattered throughout the cluster
are NFS-mounted onto every node, each node has its own work disk which is used by parallel
applications for local storage. This requires additional work when setting up each node, but results
in much better application performance. Large disks are used for long term storage of model results,
or for large input datasets which cannot easily be distributed or held in memory.

The Story of Stone Soup

Once upon a time, somewhere in post-war Eastern Europe, there was a great famine in which people
jealously hoarded whatever food they could find, hiding it even from their friends and neighbors.
One day a wandering soldier came into a village and began asking questions as if he planned to stay
for the night.

“There’s not a bite to eat in the whole province,” he was told, “Better keep moving on.”

“Oh, I have everything I need,” he said. “In fact, I was thinking of making some stone soup to
share with all of you.” He pulled an iron cauldron from his wagon, filled it with water, and built a
fire under it. Then, with great ceremony, he drew an ordinary-looking stone from a velvet bag and
dropped it into the water.

By now, hearing the rumor of food, most of the villagers had come to the square or watched from
their windows. As the peddler sniffed the “broth” and licked his lips in anticipation, hunger began
to overcome their skepticism.

“Ahh,” the soldier said to himself rather loudly, “I do like a tasty stone soup. Of course, stone
soup with cabbage — that’s hard to beat.”

Soon a villager approached hesitantly, holding a cabbage he’d retrieved from its hiding place, and
added it to the pot. “Capital!,” cried the soldier. “You know, I once had stone soup with cabbage
and a bit of salt beef as well, and it was fit for a king.”

The village butcher managed to find some salt beef… and so it went, through potatoes, onions,
carrots, mushrooms, and so on, until there was indeed a delicious meal for all. The villagers
offered the soldier a great deal of money for the magic stone, but he refused to sell, and traveled
on the next day. The moral is that by working together, with everyone contributing what they can, a
greater good is achieved.

We have encountered very few problems in the construction and operation of the Stone
SouperComputer. Red Hat Linux is extremely stable, and nodes rarely crash or drop out due to
software problems. Occasionally, as expected with such a large number of machines, nodes experience
hardware failures — most often a hard disk crash or a fried Ethernet card. Such nodes undergo an
“organ transplant”, are closed back up, and are slid back into line.

Management of such
a large, heterogeneous cluster poses many challenges. In order to keep the machine synchronized as
users, NFS disks and software are added or removed, and a script is run by each node every hour.
This script does a number of housekeeping tasks, including updating the password and group files,
checking for new NFS disk resources to mount, creating directories on the local work disk for users,
copying configuration files, and executing any other series of commands which are added to the
script kept on the master node. Script execution on the other nodes is staggered in time so that all
the nodes are not trying to get information from the master node simultaneously. While it isn’t
elegant, this method of keeping nodes in sync is fairly efficient and eliminates the need for
multiple daemons which would have to run continuously. Cluster management tools are now being
developed in many Computer Science departments to assist with these routine tasks.

System backups are performed only on home directories and NFS filesystems which contain valuable
datasets. Since local work disks on nodes are used only for temporary storage, these partitions are
not backed up. When these disks fail, we merely replace, reformat, and reinstall them.

The most unusual problem we’ve encountered is an intermittent loss of performance on certain 486
systems with Northgate or Mylex BIOS versions 6.xx on the motherboard. This problem drove us to
develop a very simple benchmark, which performs many multiplication operations, just to check CPU
performance on a regular basis. While most machines in the cluster run this benchmark in
approximately the same amount of time on every execution, these machines “flip” between a normal
user time and a much longer user time — up to 14 times slower. These unusual nodes alternate
between normal and slow execution apparently at random, and never exhibit intermediate execution
times. Rebooting these nodes does not necessarily improve the sluggish behavior of the machines, and
within 24 hours all nodes can be observed to operate slowly at least once. Fortunately, we have
never experienced this kind of problem with AMI BIOS 486 motherboards or any kind of Pentium
motherboard. We no longer include the questionable motherboards in the Stone SouperComputer.

By tracking system performance in compiling and running the benchmark, we can identify realized
CPU speeds and, to a lesser extent, I/O speeds. In this heterogeneous environment, these speeds fall
into a few recognizable tiers representing the various CPU types (66 MHz 486s, 100 MHz 486s, 90 MHz
Pentiums, 120 MHz Pentiums, 166 MHz Pentiums, etc.). Small differences within these tiers reflect
configuration differences or minor hardware differences. The benchmark results are often useful in
identifying poorly configured machines, and are also used to generate a list of nodes sorted by
approximate performance from fastest to slowest. Parallel applications use this “machines” list to
choose which nodes will be used for model runs. Since the list is sorted by speed, if ten nodes are
desired for a particular run, the top ten performers are automatically selected and used.

Software and Applications

The Stone SouperComputer uses the GNU C (gcc), C++, and FORTRAN (g77) compilers, as well as a
commercial Pro Fortran compiler provided by Absoft Corporation (http://www.absoft.com). For
inter-node communication within parallel applications, we use either PVM (Parallel Virtual Machine,
(http://www.epm.ornl.gov/pvm) or MPI (Message Passing Interface, (http://www.mcs.anl.gov/mpi).
These freely-available APIs allow the programmer to define the set of cluster nodes that will be
used for an application, and to make simple function calls in order to pass data between nodes
during computation.

Because ecological applications tend to be statistical in nature, they are particularly amenable
to parallelization. This means that these kinds of problems can often be decomposed into smaller
pieces which are solved independently. As a result, we frequently write parallel applications in a
master/slave organization in which a single node (the master) distributes the work to many slave
nodes. The analogy of a card game is useful here. The master node acts as a card dealer and the card
deck represents the entire problem which is divisible into smaller parcels of work — the cards.
This kind of organization is particularly good for a heterogeneous cluster of nodes if the work is
split into many more parcels than there are players (slave nodes), because as players finish
processing their “cards”, they return to the dealer for another “hit”. This results in dynamic load
balancing because the faster nodes end up doing more work than the slower nodes, and the problem
finishes much sooner than it would if the work were evenly distributed to all nodes. Moreover, the
heterogeneity works in our favor because no two machines will finish at exactly the same time. This
reduces network contention for communications with the master.

We generally use the Single Program Multiple Data (SPMD) method for writing parallel codes. This
means that a single program is run on all nodes, but the program branches to different instructions
depending on whether it is being run by a master or a slave. SPMD codes are usually easier to write,
run, and debug than codes which use separate programs for master and slave operations. We prefer
writing parallel codes in C using MPI for message passing. Codes developed on the Stone
SouperComputer are directly portable to even the largest and most expensive parallel computers
because C and MPI are available on all such systems.


Since its creation, the Stone SouperComputer has been successfully used for applications as
diverse as temperature and humidity interpolation, multivariate geographic clustering,
individual-based fish simulation modeling, finite-element groundwater simulation, and continental
vegetation modeling. A few of these applications are not inherently parallel, but because they must
be run on multiple datasets or multiple times with varying parameters, they benefit from the
availability of many nodes on this parallel system. Although denigrated by some, this mode of
operation requires virtually no additional coding time and enjoys nearly perfect linear speedup.

While not offering the performance of newer Beowulf clusters, this machine is an excellent
platform for developing parallel models which will port directly to other systems, and for easily
solving parallelizable problems like multivariate geographic clustering
(http://www.esd.ornl.gov~hnw/esri98). The Stone SouperComputer has proven to be a fast, cheap, and
robust scalable parallel machine which is dedicated to our ecological applications and controlled by
our own priorities. It has inspired a number of small universities and colleges to build their own
clusters using existing equipment. Working with a highly heterogeneous machine where node memory is
tight forces one to produce clean, efficient, load-balancing, fault-tolerant parallel code, and the
advantages of being forced into good coding practices are many.

We continue upgrading the Stone SouperComputer as newer surplus hardware becomes available, and
expect to add new hardware in the near future. In addition, new environmental applications are being
developed for the system. The latest description and photos of the machine as well as more
information about its applications are available at http://www.esd.ornl.gov/facilities/beowulf.

Commodity cluster computing is moving into the mainstream. Today hardware vendors — including
VA Research (http://www.varesearch.com) and others — are offering a range of commodity computer
systems featuring the Linux operating system, and many are even selling fully configured parallel
cluster systems. These offerings, combined with the availability of easy-to-use tools and the
increasing popularity of the Linux operating system, mean that grass-roots cluster computing is now
accessible to industry as well as academia.

Forrest M. Hoffman is a computer specialist in the Environmental Sciences Division at Oak Ridge
National Laboratory in Oak Ridge, Tennessee.
William W. Hargrove is a member of the research faculty
at the University of Tennessee’s Energy, Environment, and Resources Center, serving on contract to
the Oak Ridge National Laboratory’s Geographic Information and Spatial Technologies

[*] Oak Ridge National Laboratory, managed by Lockheed Martin Energy Research Corp. for the U.S.
Department of Energy under contract number DE-AC05-96OR22464. “The submitted manuscript has been
authored by a contractor of the U.S. Government under contract No. DE-AC05-96OR-22464. Accordingly,
the U.S. Government retains a nonexclusive, royalty-free license to publish or reproduce the
published form of this contribution, or allow others to do so, for U.S. Government

Comments are closed.