Since presenting a case study for a low cost cluster in the March column (available online at http://www.linux-mag.com/2005-03/extreme_01.html
), the “Extreme Linux” column has been focusing on Linux-
based cluster software distributions suitable for such a system. While the modest four-node cluster — which cost only about $1,600 to build — isn’t very powerful, it is well-suited to testing clustering technologies and for developing scalable parallel software applications or models. This month, let’s round out our discussion of cluster software with a look at Scyld Beowulf,
a commercial Linux cluster distribution.
While a majority of enthusiasts and some researchers prefer completely free cluster software toolkits, businesses and many research labs may be better served by a commercially-developed, –tested, and –supported alternative. Having a vendor to call is important to organizations with limited in-house expertise, and turn-key cluster solutions can deliver the power of high-performance computing to those who might not otherwise be able to afford to deploy the technology.
While a variety of commercial cluster toolkits and distributions have popped up in recent years, the first notable and best-known commercial package is Scyld Beowulf by Scyld Software. Scyld Software (http://www.scyld.com/
) was founded in 1998 by Donald Becker, a member of the team at NASA’s Goddard Space Flight Center that created the first commodity-based, high-performance cluster computer for the original Beowulf project. Since its initial release, the Scyld Beowulf product has matured and the company has been acquired by Penguin Computing, Inc. As Founder and Chief Scientist, Don continues to lead research and development at Scyld Software.
Scyld Beowulf Series 29
software is based on the same source used in Red Hat Enterprise Linux ES
with a 2.4 kernel on the master node and a lightweight operating system on compute nodes. Only the master node requires a full Linux distribution, and the master provides a single point of management for the entire cluster. Scyld Beowulf provides a unified process space for the entire cluster on the master node using BProc
). As such, Scyld Beowulf is architecturally most similar to the Clustermatic Linux
distribution described in the April column (http://www.linux-mag.com/2005-04/extreme_01.html
Scyld Beowulf runs on Intel Pentium 4, Xeon,
and Xeon EM64T,
as well as AMD Opteron
platforms, and it supports Ethernet, Gigabit Ethernet,
interconnects. Bundled with the product are MPI
and LAM 7.1.2
) and PVM
) parallel libraries, the GNU compilers (C, C++,
), the PVFS
cluster file system (http://www.linux-mag.com/2002-07/extreme_01.html
), and the Ganglia system monitoring package (http://www.linux-mag.com/2003-08/extreme_01.html
An overview of the Scyld Beowulf software components is shown in Table One. In addition, commercial compilers and libraries supporting other interconnects (such as Myrinet) can be added to the system.
||A graphical user interface for configuring the cluster
||A utility for remote partitioning of compute node hard disks
||A set of utilities for booting compute nodes
||The BeoBoot server which runs on the master node
||The Beowulf distributed process space package
||The BProc front-end daemon which runs on the master node
||The BProc compute node daemon
||A program to display cluster status information
||A program to control node ownership and node state
||A program to execute commands on a compute node
||A program to copy files between nodes
||A version of the Message Passing Interface modified to work with BProc
||A parallel job creation package
||A graphical user interface for monitoring status of the entire cluster
||A text-based tool for monitoring status of the entire cluster
Scyld Beowulf is distributed on a set of CD-ROM discs or on DVD. The first disk is used to install the operating system on the master or head node of the cluster, while the second disk may be used as a boot CD-ROM for compute nodes. Compute nodes are preferably booted over the network (using the Preboot eXecution Environment, or PXE) or from a boot floppy or CD-ROM if the nodes do not have PXE boot firmware. Alternatively, if hard disks are available on the compute nodes, they can be partitioned and loaded with a boot image that initiates the BeoBoot process.
When the new master node is booted from the first Scyld Beowulf disk, the operating system may be installed in either graphical or text mode. The installation program is based on the Anaconda installer used by Red Hat and other Linux distributions. Since the master node is basically a complete Linux system with some additional services to support compute nodes and parallel programs, the installation process looks very much like installing Red Hat Enterprise Linux or any other distribution.
In the installation options, a pre-defined “Beowulf Master Node Installation Type” can be used for automatic package selection. It installs the Scyld Beowulf Cluster Operating System and the GNOME X Window System environment. Disk partitioning can be performed with Disk Druid or fdisk, and either Grub or LILO may be chosen as the bootloader.
The master node typically has two Ethernet interfaces, one for connection to the normal routed network and another for connection to the private network containing only cluster nodes. In the network configuration portion of the installation, the first interface (usually eth0) can be set to use DHCP if it’s supported on the routed local area network, but the second interface connected to the private cluster network (usually eth1) should be set to a static IP address. This address should be a non-routable address, usually in the 192.168.0.0 or 10.0.0.0 address blocks.
In the firewall configuration, the private Ethernet device (usually eth1) must be set up as a trusted device. In general, the public Ethernet device should not be a trusted device, since it can receive malicious traffic. However, DHCP and SSH should be selected in the “Allow Incoming” box so that the master node can obtain its public IP address (if using DHCP) and so that users can login to the master node to compile and run programs. The remaining steps of the installation are standard for any Linux distribution.
Once the master node is installed and running, compute nodes can be booted and provisioned into the cluster very rapidly. The easiest and fastest way to boot compute nodes is using PXE; however, if the nodes do not have boot ROMs, Scyld provides the BeoBoot system for creating boot media such as floppy disks or CD-ROMs. Creating this media and configuring compute nodes into the cluster is accomplished using the BeoSetup cluster configuration tool.
BeoSetup is a graphical user interface (started by typing beosetup) for configuring and controlling a Scyld Beowulf cluster. The root user may use it to monitor cluster node state, run commands on nodes, and read node logs. Node floppy disks and CD-ROMs can be created by clicking on the appropriate buttons, entering the correct drive and kernel boot flags in the pop up window, and clicking OK. If needed, this media should be loaded into compute nodes.
Next, compute nodes should be booted in the order in which they should be numbered (so that they do not have to be rearranged in the BeoSetup list). In BeoSetup, compute nodes will be listed by MAC address in the order in which they contact the master node. These addresses will be listed in the box labeled “Unknown.” From there they may be dragged over to the node list box to incorporate them into the cluster.
The list of compute nodes will include the ID, MAC address, IP address, and state of the node, as well as the user, group, and mode of the resource. When all the compute nodes are properly listed with IP addresses in the node list, click the “Apply” button for changes to take effect. This saves the changes to the configuration file and signals the Beowulf daemons to re-read the configuration file.
If compute nodes have hard disks that are already partitioned, then the /etc/beowulf/fstab file on the master node should be updated to record the mapping of the partitions to filesystems. If compute node hard disks haven’t been partitioned, default or custom partitions may be created from the master node using beofdisk.
Finally, as the root user on the master node, reboot all of the compute nodes as follows:
$ bpctl –S all –R
The status of each compute node should be updated in the BeoSetup window. All nodes should show a status of “up” when they have successfully rebooted.
Once the cluster is installed and running, a few checks should be performed to ensure that it’s operating correctly. The bpstat command displays status information for each compute node in the cluster, as shown in Figure One:
[root@bladerunner root]# bpstat
Node(s) Status Mode User Group
4-31 down ---------- root root
0-3 up ---x--x--x root root
Here, nodes 0 through 3 are up while nodes 4 through 31 are down. Any nodes listed with a status of boot are in the process of booting while nodes listed with a status of error have experienced a problem with initialization and are not available for use.
The number of nodes expected in the cluster is determined from the iprange specified in BeoSetup. Having a large range (for example, for 32 nodes) doesn’t require you to have that many nodes in your cluster.
Another useful tool for cluster monitoring is the Beostatus graphical user interface program. It shows current CPU utilization, memory and swap utilization, disk consumption, and network bandwidth for every node in the cluster (including the master node). Beostatus may be run by typing beostatus at a shell prompt.
Figure Two shows an example of the Beostatus monitoring interface. Six dual-processor nodes are up, including the master node, which is listed as Node- 1 at the top. Nodes 5 through 31 are down. The system is quiescent (nothing is running). Some memory and disk are consumed only on the master node. Compute nodes have no swap space available.
Figure Three shows the Beostatus display for the same cluster with a 10-processor parallel job running across all five compute nodes. The CPUs are all 100% busy except for CPU0 on node 0, which is partially involved in some network operation as evidenced by the activity shown in the network column.
Using the Cluster
Now that the cluster appears to be working, let’s see how to build and run MPI codes on it. A modified version of Argonne National Laboratory’s MPICH is usually used on Scyld Beowulf clusters. Compiling programs is done in the standard way using mpicc, mpif77, and so on. Running programs on the Scyld cluster is somewhat non-standard because a job mapper is used to “map” tasks onto compute nodes.
[forrest@bladerunner forrest]$ mpicc -O -o hello-world hello-world.c
Instead of using mpirun to execute the program on the Scyld Beowulf cluster, we simply set the NP environment variable to the number of desired processes/tasks and execute the program as follows:
$ NP=4 ./hello-world
Hello world! I’m rank 0 of 4 on bladerunner.ornl.gov
Hello world! I’m rank 3 of 4 on .2
Hello world! I’m rank 2 of 4 on .1
Hello world! I’m rank 1 of 4 on .0
As you can see, the first process ran on the master node while the remaining processes ran on nodes 0, 1, and 2. The mapper figures out how to distribute tasks across nodes, minimizing the number of tasks on each node. To run the code without using the master node, the NO_LOCAL environment variable may be set as follows:
$ NP=4 NO_LOCAL=1 ./hello-world
Hello world! I’m rank 2 of 4 on .2
Hello world! I’m rank 3 of 4 on .3
Hello world! I’m rank 1 of 4 on .1
Hello world! I’m rank 0 of 4 on .0
The four tasks were mapped only onto compute nodes.
To run an MPI code filling the entire cluster (including the master node), the ALL_CPUS environment variable can be set, eliminating the need to specify the number of processes altogether. For example, hello-world can be run on every CPU as follows:
$ ALL_CPUS=1 ./hello-world
Hello world! I’m rank 0 of 10 on bladerunner.ornl.gov
Hello world! I’m rank 1 of 10 on bladerunner.ornl.gov
Hello world! I’m rank 9 of 10 on .3
Hello world! I’m rank 8 of 10 on .3
Hello world! I’m rank 6 of 10 on .2
Hello world! I’m rank 5 of 10 on .1
Hello world! I’m rank 4 of 10 on .1
Hello world! I’m rank 3 of 10 on .0
Hello world! I’m rank 2 of 10 on .0
Hello world! I’m rank 7 of 10 on .2
Additional environment variables include ALL_LOCAL, which runs every process on the master node (for debugging purposes); EXCLUDE, which when set to a colon-delimited list of nodes causes the mapper to avoid the specified nodes during assignment; and BEOWULF_JOB_MAP, which allows you to specify an explicit list of nodes (colon-delimited) on which to run the tasks.
Is Scyld the Right Solution for You?
The Scyld Beowulf Cluster Operating System offers a nice, easy-to-use and easy-to-manage platform for running parallel programs on commodity clusters. Since it’s a commercial offering, technical assistance and support are available through Scyld Software, making it easier for many small- to mid-sized organizations to take advantage of commodity cluster technology.
If you aren’t confident enough to go solo with the other cluster software distributions discussed in previous columns (Clustermatic, Rocks, and OSCAR), a little professional help is the perfect prescription.
Forrest Hoffman is a computer modeling and simulation researcher at Oak Ridge National Laboratory. He can be reached at