Using BProc

This month’s column focuses on building and using Beowulf Distributed Process Space (BProc) software used by the commercial Scyld Beowulf and the Clustermatic Linux distributions for high performance computing (HPC) clusters.

This month’s column focuses on building and using Beowulf Distributed Process Space (BProc) software used by the commercial Scyld Beowulf and the Clustermatic Linux distributions for high performance computing (HPC) clusters. BProc provides a single process space across an entire cluster of slave or compute nodes, meaning that all application processes show up in the process table of the master node and can be controlled from the master even though they’re actually running on slave nodes.

BProc consists of a set of kernel patches, kernel modules, master and slave daemons, and client programs used to start, migrate. and manage application processes across the entire cluster. A library of BProc system calls is also available for controlling process migration and performing a variety of functions on cluster nodes. In addition, commands are provided for running programs on and copying files to individual nodes or all nodes at once.

The History of BProc

First developed about five years ago on a 64-node cluster at NASA’s Goddard Space Flight Center (GSFC) by Erik Hendriks and colleagues, BProc initially provided a means for observing and controlling remote processes. Using ghost processes on a master (front end) node to represent processes executing on remote slave nodes, it offered an early step toward a single system image (SSI) running across a Linux cluster. The ability to remotely fork processes on compute nodes through process migrate was added later by the group at Goddard.

Dan Ridge, also at NASA, was the first to build a cluster with lightweight compute nodes based on BProc. His work set the stage for production use of BProc in Beowulf-style clusters where the compute nodes have just enough of the Linux operating system to run programs spawned from the master node. Later, both Erik and Dan joined Scyld Computing, which incorporated BProc into its professional turn-key cluster management system called Scyld Beowulf.

Beoboot, software for loading the lightweight system onto compute nodes over the network, was developed to get rid of having to install any software onto compute nodes. The root file system was simply a ramdisk sent over the network from the master node when the slave node booted. This advance greatly simplified cluster installation and administration, and allowed for use of diskless compute nodes without the necessity of mounting root filesystems via the Network File System (NFS).

Some time later, Erik left Scyld to join Ron Minnich at Los Alamos National Laboratory (LANL) where they combined LinuxBIOS and BProc into a single suite of software. LinuxBIOS places the first phase boot image into the flash memory of slave nodes, bypassing the awkward system BIOS used on most PC-based systems. Now Erik periodically releases these software tools as a single package he calls “Clustermatic,” a name inspired by Ronco’s Vegematic. Clustermatic is used on a variety clusters at LANL and other sites.

Installing BProc

BProc ships with the Clustermatic distribution or may be built and used independently under Linux. It is an open source project released under the GNU Public License, and can be downloaded from http://bproc.sourceforge.net/. In addition, a mailing list with online archives is available at the same location. As of this writing, BProc 4.0.0pre8 is the latest version, and it includes patches for the stock 2.6.9 kernel available from http://www.kernel.org/.

The BProc kernel patch should be applied to the kernel source code, and then the kernel should be configured using any of the standard methods. To start with a configuration used previously or supplied with a Linux distribution, that configuration file should be copied to the top of the kernel tree as .config. Then make oldconfig will use this configuration and prompt for any new options. Be sure to enable CONFIG_BPROC. Use make to compile the sources, make modules_install to install the new kernel modules, and make bzImage to build the compressed Linux kernel. The compressed kernel is usually placed in /boot, and an entry for booting this kernel should be added to the loader (usually LILO or GRUB) configuration file.

Once a new kernel is built and booted, the BProc kernel module, libraries, and client programs can be compiled and installed. The main configuration file, located in /etc/clustermatic/config, should be created or modified to reflect the configuration of the cluster. The interface to be used for communication should be specified on the interface line, and the range of IP addresses which the slave nodes use may be specified on individual ip lines or using the iprange option.

The port number used by the master daemon may be changed using the bprocport option. By default it uses 2223. In addition, a list of libraries to be made available to compute nodes is specified, along with a list of binaries which should be exported to these nodes. Additional options in this file, like the list of MAC hardware addresses used on slave nodes, apply to Beoboot which is often used in conjunction with BProc.

After the configuration file is setup, the bproc module should be loaded (modprobe bproc) and the master daemon, bpmaster, should be started on the front end node. Similarly, the slave daemons should be started on the compute nodes using bpslave master_ip[port].

BProc Clients

Once BProc is running, a number of useful client programs are immediately available. One of those is bpsh, a replacement for rsh that can be used to run a program on another node. bpsh can be used to simultaneously run a program on one node, a subset of nodes, or all nodes.

For example, the command…

$ bpsh 3 ls –l /tmp
total 224856
-rw-rw-r– 1 forrest 9220 230021709 Dec 17 01:38 myoutput

… shows a directory listing of /tmp on node 3 where the file myoutput resides. A comma-separated list of nodes can be specified or a range of nodes can be specified as follows:

$ bpsh 1,3 date
Fri Dec 17 01:41:12 EST 2004
Fri Dec 17 01:41:12 EST 2004
$ bpsh 1-3 date
Fri Dec 17 01:41:28 EST 2004
Fri Dec 17 01:41:26 EST 2004
Fri Dec 17 01:41:28 EST 2004

The –a flag can be used to run commands on all available nodes:

$ bpsh –a ls –l /tmp
total 0
total 0
total 0
total 0
total 224856
-rw-rw-r– 1 forrest 9220 230021709 Dec 17 01:38 myoutput

However, this may not be useful if it isn’t known where the displayed results originated. A series of other flags solve similar problems. The –L option makes bpsh wait for an entire line of output before printing it; the –p option prefixes each line with the node number; the –s option sorts the results by node number; and the –d option inserts a divider between output from each node. To find out where the myoutput file it located, one could do the following:

$ bpsh –a –p –s –d –L ls –l /tmp
0 ———————————————————————
0: total 0
1 ———————————————————————
1: total 0
2 ———————————————————————
2: total 0
3 ———————————————————————
3: total 224856
3: –rw-rw-r– 1 forrest 9220 230021709 Dec 17 01:38 myoutput
4 ———————————————————————
4: total 0

A similar replacement for rcp is also provided by BProc. bpcp can be used to copy files between master and slave nodes or among slave nodes. For example, the commands…

[forrest@master forrest]$ bpcp input.dat 2:/tmp
[forrest@master forrest]$ bpcp 2:/tmp/input.dat 4:/tmp

… copy the file input.dat from the master to slave node 2, placing it in the /tmp directory, and then copy that file from node 2 to node 4. These two commands can also be combined. To copy the myoutput file from /tmp on all nodes without knowing where it resides, one could do the following:

$ bpsh –a –p bpcp /tmp/myoutput
1: rcp: /tmp/myoutput: No such file or directory
2: rcp: /tmp/myoutput: No such file or directory
0: rcp: /tmp/myoutput: No such file or directory
4: rcp: /tmp/myoutput: No such file or directory

Notice the errors on nodes 0, 1, 2, and 4. The file existed only on node 3, and that node copied it to the master successfully. Had that file existed on more than one node, this command would likely have produced undesired results.

The bpstat command can be used to check the status of the BProc nodes. For example:

$ bpstat
Node(s) Status Mode User Group
5 down ———- root root
0-4 up —x–x–x root root
$ bpstat –l
Node Address Status Mode User Group
0 up —x–x–x root root
1 up —x–x–x root root
2 up —x–x–x root root
3 up —x–x–x root root
4 up —x–x–x root root
5 down ———- root root

This shows that nodes 1 through 4 are up, while node 5 is down. It also the modes of slave nodes, the owning user, and the owning group. Like files in a Linux filesystem, modes can be set for individual nodes, and these nodes can be owned by individual users or groups, so they can be generally available to anyone or dedicated to a particular set of users. Using the –l flag displays the IP address and status of each slave node on a separate line. Using –P, bpstat can take the output from ps and display it with a column of node numbers. For example:

$ ps xf | bpstat –P
6647 ? S 0:00 sshd: forrest@pts/0
6649 pts/0 Ss 0:00 \_ –bash
1130 ? S 0:00 sshd: forrest@pts/2
1133 pts/2 Ss 0:00 \_ –bash
1343 pts/2 R 0:08 \_ ./hello-world
1344 pts/2 S 0:00 | \_ ./hello-world
0 1345 pts/2 R 0:08 | \_ [hello-world]
0 1346 pts/2 R 0:00 | | \_ [hello-world]
1 1347 pts/2 R 0:08 | \_ [hello-world]
1 1348 pts/2 R 0:00 | | \_ [hello-world]
2 1349 pts/2 R 0:08 | \_ [hello-world]
2 1350 pts/2 R 0:00 | | \_ [hello-world]
3 1351 pts/2 R 0:08 | \_ [hello-world]
3 1352 pts/2 R 0:00 | | \_ [hello-world]
4 1353 pts/2 R 0:00 | \_ [hello-world]
4 1354 pts/2 S 0:00 | \_ [hello-world]
1355 pts/2 R+ 0:00 \_ ps xf
1356 pts/2 R+ 0:00 \_ –bash

Here, hello-world is running in parallel on nodes 0 through 4. Since all application processes are represented in the process table on the master node, a ps listing includes these processes even though they are actually running remotely. Likewise, these processes appear when running top or any other process monitoring package. In addition, signals sent to these processes on the master node are propagated to the slave nodes producing the desired result.

Stay Tuned for More

BProc is a relatively simple package that provides powerful features for Beowulf-style clusters. It offers the advantages of a single system image and avoids the scalability problems and communications overhead incurred when using diskless slaves with NFS-mounted root filesystems. It has a few basic and familiar tools for copying files, running programs, and displaying status information.

Future columns will further explore BProc as well as the other components of the Clustermatic and Scyld Linux cluster distributions.

Forrest Hoffman is a computer modeling and simulation researcher at Oak Ridge National Laboratory. He can be reached at class="emailaddress">forrest@climate.ornl.gov.

Comments are closed.