Clustermatic: A Complete Cluster Solution

The Clustermatic Linux distribution, produced by the Cluster Research Lab at Los Alamos National Laboratory, is a collection of software packages that provides an infrastructure for small- to large-scale cluster computing.

The Clustermatic Linux distribution, produced by the Cluster Research Lab at Los Alamos National Laboratory, is a collection of software packages that provides an infrastructure for small- to large-scale cluster computing. Consisting of LinuxBIOS, BProc, and a job scheduler, Clustermatic 5 (released in November 2004) runs on 32- and 64-bit x86 platforms, as well as the 64-bit PowerPC processor. A number of large Linux clusters are running Clustermatic, including the 2,816 AMD Opteron processor cluster called “Lightning” at Los Alamos.

This month, let’s look at Clustermatic and build a small cluster using the software.

LinuxBIOS

LinuxBIOS (http://www.linuxbios.org) replaces the system BIOS with a little bit of hardware initialization code and a compressed Linux kernel that can be booted directly from a cold start. LinuxBIOS allows the operating system to control a cluster node from power on, and bypasses the proprietary, antiquated, slow, and often buggy BIOS in common use today. As a result, a cluster node can be booted and operational in as little as three seconds!

LinuxBIOS gunzip s a small Linux kernel straight out of non-volatile RAM (NVRAM), then loads a full kernel over Ethernet, Myrinet, Quadrics, SCI, or from some other device. Even this “full” kernel can be rather lightweight: nodes can be as simple as just a CPU and memory, with no hard disks, no floppies, and no filesystems, making for fast and efficient compute nodes with little autonomy.

A variety of motherboards are known to work with LinuxBIOS. (A list of these is available on the LinuxBIOS web site.) Additionally, a number of cluster integrators support LinuxBIOS, including Linux Networx, which built the 11.2 teraflop per second “MCR Cluster” at Lawrence Livermore National Laboratory using LinuxBIOS and Red Hat Linux.

Some skill with hardware and using the right motherboard are required to flash a Disk on a Chip. Alternatively, a flash burner can put the LinuxBIOS image into firmware. Again, instructions for a few hardware combinations are contained on the web site. Some of these procedures involve potential risks to hardware and even to personal safety, so take care when you work with energized equipment. (And if you’re not comfortable with handling sensitive electronic components, you shouldn’t attempt to install LinuxBIOS onto PROMs or NVRAM by yourself. Instead, hire a qualified electronics technicians or find a systems vendor to burn and install the ROM image desired. \)

Fortunately, LinuxBIOS isn’t required to get a Clustermatic cluster going. Nodes can be booted over the network or from CD or floppy media after the traditional BIOS has done its thing.

BProc

BProc, the Beowulf Distributed Process Space (http://bproc.sourceforge.net), was introduced recently in the February 2005 “Extreme Linux” column (available online in May 2005 at http://www.linux-mag.com/2005-05/extreme_01.htm). It provides a single process space (akin to a single system image, or SSI) across an entire cluster, meaning that all application processes show up in the process table of the master node and can be controlled directly from the master.

BProc consists of a set of kernel patches, kernel modules, master and slave daemons, and utility programs used to start, migrate, and manage application processes across an entire cluster. In addition, a library of BProc system calls is available for controlling process migration and performing a variety of functions on cluster nodes.

February’s column included installation instructions for BProc and introduced the utilities used to start programs on nodes (bpsh), copy files between nodes (bpcp), and check the status of nodes (bpstat). When using the Clustermatic distribution, building BProc separately isn’t necessary, as it’s provided as a set of RPMs along with a modified kernel RPM and patched kernel sources. Also included are beoboot, software for booting and configuring cluster nodes; beonss, a node nameservice; bjs, the BProc job scheduler; and mpich, a free MPI implementation modified to work with BProc.

Installing Clustermatic

The easiest way to get Clustermatic up and running is to download the ISO CD-ROM image from the Clustermatic website (http://www.clustermatic.org/) and burn it onto media. The Clustermatic 5 disk, the latest release as of this writing, contains all of the kernel and software package RPMs and SRPMs for the x86, x86_64, ppc, and ppc64 platforms. These packages should be installed on a functioning Linux distribution on the machine that’s intended to be the master node.

As an example, let’s install Clustermatic on an x86 system running Fedora Core 2.

First, you need a modified kernel that supports BProc. The i686 kernel package provided on the Clustermatic CD includes BProc and has SMP and 64 GB memory support built-in, but it may lack some of the features of kernels from various distributions. For standard x86 systems, especially those running Fedora Core, from whence the Clustermatic kernel was derived, the stock Clustermatic kernel should be fine. However, the patched kernel source is provided (in the noarch/ directory) in case you need to build a custom kernel is necessary. The PowerPC kernel is built for the Power4/970/G5.

The new kernel can be installed without removing the existing kernel on the system by typing:

[root@master1 i686]# rpm –ivh kernel-2.6.9-cm46.i686.rpm

Unlike normal Fedora kernel RPMs, this RPM does not create initrd images or reconfigure the boot loader. These steps, which are different for different Linux distributions, must be performed manually.

On Fedora, you can build the initrd image as follows:

[root@master1 root]# /sbin/mkinitrd /boot/initrd-2.6.9-cm46 2.6.9-cm46

Next, edit /etc/grub.conf to point to the new kernel by appending the

following lines:

title Clustermatic (2.6.9-cm46)
root (hd0,0)
kernel /vmlinuz-2.6.9-cm46 ro root=LABEL=/ rhgb quiet
initrd /initrd-2.6.9-cm46

If you want to boot this kernel by default, change the default line in /etc/grub.conf to default=1 (assuming the Clustermatic entry is the second entry in that file). Next, reboot the system to load the new kernel, login, and type uname –r to verify that the 2.6.9-cm46 kernel is running. (Similar installation instructions for SuSE and Yellowdog Linux are provided in the README on the CD.)

Next, the BProc and associated RPMs should be loaded (from the i586/ directory on x86 systems) as follows:

[root@master1 i586]# rpm –ivh b*.rpm m*.rpm

This installs the beoboot, beonss, bjs, bproc, bproc-devel, bproc-libs, and mpich-p4 packages.

Now that all of the packages are installed, the system must be configured. Edit /etc/clustermatic/config to establish your configuration. The interface directive must refer to the correct network interface (s) on which the BProc master daemon should listen. The default port can be changed using the bprocport directive. A master directive line should be included for each master node in the cluster. And a range of IP addresses for cluster nodes can be specified using the iprange directive. The slave node boot file should be specified with the bootfile directive. A list of libraries to export to slave nodes can be specified with the librariesfrombinary and libraries directives. Finally, a series of node directives specifying their node numbers and MAC addresses should be included so that beoboot will respond to their RARP requests. The node list can be added later using the nodeadd utility.

Listing One contains an example configuration file (sans comments).

Listing One: A sample /etc/clustermatic/config file (without most comments) for a cluster with three slave nodes

interface eth1
master master1
iprange 0 10.0.2.2 10.0.2.4 # Nodes 0-2 have addresses from this
range.
bootfile /var/clustermatic/boot.img
librariesfrombinary /bin/sleep /bin/ps /bin/ping /bin/ls # get
libc,resolver
libraries /usr/lib/libstdc++* /usr/lib64/libstdc++* # C++ support
libraries /usr/lib/libbproc.so* /usr/lib64/libbproc.so* # BProc, of
course.
libraries /lib/libnss_bproc* /lib64/libnss_bproc* # BProc
resolver
libraries /lib/libnsl* # Added by Forrest for portmap for NFS
support
node 0 00:e0:4c:c3:d3:32
node 1 00:e0:4c:c3:ca:06
node 2 00:e0:4c:c3:d8:18

The boot images for the slave nodes should be created next. Clustermatic uses its own network boot scheme, at least for x86 and x86_64. Booting occurs in two phases. The phase 1 image can be in loaded from firmware (using LinuxBIOS), over the network using PXE, or from CD or floppy media. The phase 1 kernel then downloads the phase 2 image, usually from beoserv, the beoboot image server.

The Clustermatic CD can be used to boot slave nodes, and burning a handful of CDs may be the simplest way to get a small test cluster up quickly. The CD contains the phase 1 boot image for x86 and x86_64, and it can be used to load the phase 2 boot image for ppc64. In fact, the CD can be booted in a laptop to provide another slave node in a pinch without affecting the hard disk or operating system on the laptop. When you’re done, simply eject the CD and reboot to get your normal laptop back.

Assuming the slave nodes use a CD to boot phase 1, only a phase 2 image is needed. To create a phase 2 boot image of the currently running kernel, use the beoboot program:

[root@master1 root]# beoboot –2 –n

beoboot supports a wide variety of options for creating boot images. For example, special kernel command arguments can be added using the –c option:

[root@master1 root]# beoboot –2 –n –c “console=ttyS0 apm=off”

The phase 2 boot image is written to /var/clustermatic/boot.img, which was specified in the configuration file in Listing One.

At this point, Clustermatic (and the BProc job scheduler) should be started either by rebooting the master node or by executing the startup scripts:

[root@master1 root]# /etc/init.d/clustermatic start
[root@master1 root]# /etc/init.d/bjs start

If the MAC addresses of nodes weren’t included in the configuration file, these can be added by phase 1 booting the slaves and running /usr/lib/beoboot/bin/nodeadd on the master. As nodes are detected, they’re added to the configuration file. If the –a option is specified, nodeadd automatically sends HUP signals to beoserv to force it to re-read the configuration file as nodes are added. Otherwise, beoserv should be signaled manually as nodes are added:

[root@master1 root]# killall –HUP beoserv

Using the Cluster

Once the master node is configured, the Clustermatic daemons are running, and the slave nodes are booted, the status of the cluster can be reported using bpstat.

[root@master1 root]# bpstat
Node(s) Status Mode User
Group
2 down ———- root
root
0-1 up —x—— root
root

Here, nodes 0 and 1 are up, while node 2 is down. The nodes are “owned” by the root user and root group with the execute bit set only for the user. Slave nodes can be controlled, individually or en masse, using the bpctl program.

For example, to set the permission bits of the “up” nodes so that all users can execute jobs on them, type:

[root@master1 root]# bpctl –S allup –m 111

Running bpstat again shows the updated status:

[root@master1 root]# bpstat
Node(s) Status Mode User
Group
2 down ———- root
root
0-1 up —x–x–x root
root

The bpstat command can also be used to change user and group IDs of nodes, or to reboot, halt, or power off one or more nodes.

Now you need to be sure MPI is working. The simplest test is to compile and run the “Hello World! ” code included in this column way back in March, 2002 (http://www.linux-mag.com/2002-03/extreme_02.html). Compiling and running that code should yield:

[forrest@master1 forrest]$ mpicc –O –o mpi_hello mpi_hello.c
[forrest@master1 forrest]$ mpirun –np 2 ./mpi_hello
Hello world! I’m rank 0 of 2 on n0
Hello world! I’m rank 1 of 2 on n1

And it works!

That was pretty easy. Given a standard Fedora Core 2 distribution, you loaded a new kernel, did a little configuration, and booted the slaves off of the Clustermatic CD to create a little compute cluster.

At this point, bjs (the BProc job scheduler) can optionally be configured, more nodes can be added, and the system is ready to run parallel MPI codes. However, if your parallel programs expect to have access to data files contained on the master node, some additional configuration is required.

Adding On to Clustermatic

Clustermatic systems can support almost any network interconnect, weird library, or shared filesystem. These capabilities — in the form of kernel modules, shared libraries, and/or binaries — merely need to be added onto the configuration file (/etc/clustermatic/config), the config.boot file, or the node_up script. This can be a little tricky and involve some trial and error, but checking the /var/log/clustermatic/node.X files often reveals the problems encountered by slave nodes as you modify the configuration files.

Clustermatic can support a variety of high performance shared filesystems, but small clusters often use the plain old Network File System (NFS). Developer Michal Jaegermann has provided some sample scripts for enabling NFS with Clustermatic 5.

To use these scripts, the /etc/clustermatic/node_up script must be modified as shown in Listing Two.

Listing Two: Modified version of /etc/clustermatic/node_up

#!/bin/sh
#
# This shell script is called automatically by BProc to perform any
# steps necessary to bring up the nodes. This is just a stub script
# pointing to the program that does the real work.
#
# $Id: node_up.stub,v 1.3 2003/11/12 23:30:59 mkdist Exp $

seterror () {
bpctl –S $1 –s error
exit 1
}

if [ –x /usr/lib64/beoboot/bin/node_up ] ; then
/usr/lib64/beoboot/bin/node_up $* || seterror $*
else
/usr/lib/beoboot/bin/node_up $* || seterror $*
fi

# we are “sourcing” these scripts so variable assignments
# remain like in here; pass a node number as an argument
# if you want to _run_ them from a shell and wrap in a loop
# for multiple nodes
#
# Turn the next line on for NFS support on nodes
. /etc/clustermatic/nfs.init $*

exit

This new script catches errors from the node_up binary, but allows other scripts to be sourced afterward. Listing Two sources a new script called nfs.init at boot. Listing Three contains the nfs.init script.

Listing 3: /etc/clustermatic/nfs.init script for loading NFS support and mounting /home from the master node

#!/bin/sh
#
# A sample how to get NFS modules on a node.
# Make sure that /etc/modules.conf.dist for a node does not
# define any ’install’ actions for these
#
# Michal Jaegermann, 2004/Aug/19, michal@harddata.com
#

node=$1
# get the list of modules, and copy them to the node
mod=nfs
modules=$( grep $mod.ko /lib/modules/$(uname –r)/modules.dep)
modules=${modules/:/}
modules=$(
for m in $modules ; do
echo $m
done | tac )
( cd /
for m in $modules ; do
echo $m
done
) | ( cd / ; cpio –o –c –quiet ) | bpsh $node cpio –imd –quiet
bpsh $node depmod –a
# fix the permissions after cpio
bpsh $node chmod –R a+rX /lib
# load the modules
for m in $modules ; do
m=$(basename $m .ko)
m=${m/_/-}
case $m in
sunrpc)
bpsh $node modprobe –i sunrpc
bpsh $node mkdir –p /var/lib/nfs/rpc_pipefs
bpsh $node mount | grep –q rpc_pipefs || \
bpsh $node mount –t rpc_pipefs sunrpc /var/lib/nfs/rpc_pipefs
;;
*) bpsh $node modprobe –i $m
esac
done
# these are for the benefit of rpc.statd
bpsh $node mkdir –p /var/lib/nfs/statd/
bpsh $node mkdir –p /var/run
bpsh $node portmap
bpsh $node rpc.statd
bpsh $node mkdir /home
bpsh $node mount –t nfs –o nfsvers=3,rw,noac 10.0.2.1:/home /home
# mount the swap partition
#bpsh $node /sbin/swapon –p 0 /dev/hda2

This script first obtains a list of modules required by NFS and copies those modules over to the slave node (using cpio through a pipe to bpsh). Next, it runs depmod –a on the slave to create a list of module dependencies prior to loading the required modules using modprobe. A special directory and filesystem are needed to get sunrpc working correctly.

Two directories (/var/lib/nfs/statd and /var/run) are created, and then portmap and rpc.statd are executed. Finally, the /home directory is created on the slave node, and /home from the master (using an IP address of 10.0.2.1) is mounted there via NFS. (The correct IP address of the master node should be used in place of the one shown in Listing Three.)

The command at the bottom of the listing is commented out. If enabled, it would enable swap on the slave node (if a swap partition were available on /dev/hda2). If slave nodes have disks and limited memory, enabling swap may be a desirable thing to do.

A couple more configuration changes are required to enable NFS. Since modprobe requires a couple of configuration files (specifically /etc/modprobe.conf.dist and /etc/modprobe.conf), these need to be copied onto slave nodes before nfs.init runs. This is easily accomplished by adding the following two lines to /etc/clustermatic/node_up.conf.

plugin miscfiles /etc/clustermatic/node/modprobe.conf.dist > /etc/modprobe.conf.dist
plugin miscfiles /etc/clustermatic/node/modprobe.conf > /etc/modprobe.conf

Next, modprobe.conf.dist and modprobe.conf need to be created in /etc/clustermatic/node/ on the master node. The stock versions of these two files should be modified so that they do not define any” install” actions for modules that might confuse slave nodes. Versions of these files tested with Fedora Core 2 are available on-line at http://www.linux-mag.com/downloads/2005-04/extreme. It was necessary to add the NSL library to /etc/clustermatic/config (as shown in Listing One) to get NFS working with Fedora Core 2.

Finally, /home on the master should be exported by adding a line to /etc/exports like this…

/home 10.0.0.0/8(rw,sync,no_root_squash)

… and by running exportfs –ra to export the filesystem. If the NFS server is not operating on a master node running Fedora, run chkconfig nfs on and service nfs start to get it going. If these commands fail, NFS may not have been installed when Fedora was loading on the system.

Now the slave nodes can be rebooted using bpctl –S allup –R. Once they’re reported as “up” by bpstat, check to see if they successfully mounted the home directory from the master node using bpsh:

[root@master1 root]# bpsh –a –L –p df
1: Filesystem 1K-blocks Used Available Use% Mounted on
1: 10.0.2.1:/home 77718080 16219008 57551168 22% /home
0: Filesystem 1K-blocks Used Available Use% Mounted on
0: 10.0.2.1:/home 77718080 16219008 57551168 22% /home

Ready to Roll!

Clustermatic is a relatively simple and elegant cluster distribution that’s easy to install. Using LinuxBIOS can further simplify and speed operation of thousands of cluster nodes. Since the operating system is loaded from the master node, it’s easy to operate and maintain. To upgrade the entire cluster, simply upgrade the master, load a new patched kernel, create a new phase 2 image with beoboot, and reboot all the slaves with bpctl.

While this system may not meet everyone’s needs, it’s very extensible and easy to use. When evaluating compute cluster solutions, Clustermatic is worth a look!

Forrest Hoffman is a computer modeling and simulation researcher at Oak Ridge National Laboratory. He can be reached at class="emailaddress">forrest@climate.ornl.gov.

Fatal error: Call to undefined function aa_author_bios() in /opt/apache/dms/b2b/linux-mag.com/site/www/htdocs/wp-content/themes/linuxmag/single.php on line 62