dcsimg

Running Commands on Many Nodes

Since this column became a regular part of Linux Magazine in January (2002), it's covered a wide variety of Linux high-performance computing topics. In the past year, we've configured network topologies, deployed job schedulers, explored high-bandwidth, low-latency networking hardware, and used open source tools such as as MPI and PVM to develop parallel software applications.

Since this column became a regular part of Linux Magazine in January (2002), it’s covered a wide variety of Linux high-performance computing topics. In the past year, we’ve configured network topologies, deployed job schedulers, explored high-bandwidth, low-latency networking hardware, and used open source tools such as as MPI and PVM to develop parallel software applications.

While other popular chronicles stop after presenting general philosophies or recommending turn-key collections of packages on a single CD, this column has taken a different tack. Each month, this column has presented specific working examples of code or configuration files to help you get your feet wet and your hands dirty. Hopefully, you agree that this more pedagogical approach has provided you with vital, helpful, and insightful information that has helped you incrementally improve your own Beowulf-style cluster.

Given that broad exposure to parallel computing, we’re now ready to evaluate custom cluster distributions and pre-packaged cluster solutions. During the course of the next year, this column will review a few of the more popular “cluster in a box” collections. Of course, this column will also continue to provide the kind of in-depth discussions of system configuration and parallel programming methods that you’ve become accustomed to.

Thanks to sophisticated but commodity hardware components, it’s pretty cheap and easy to hook lots of computers together to build a reasonably powerful supercomputer. Unfortunately, getting them all to work together is a challenge for both system administrators and application developers. Managing a large number of identical nodes in a cluster is the tedious and thankless job of a Beowulf system administrator. As mentioned in previous columns, an assortment of new tools can help simplify the otherwise repetitive nature of cluster management. Among those tools is the C3 Toolkit.

The C3 Toolkit

The Cluster Command and Control (C3) toolkit is a collection of Python scripts that perform rudimentary tasks across every node in a cluster. Included as a component of the Open Source Cluster Application Resources (OSCAR) tool set (http:// oscar.sourceforge.net), put together by the Open Cluster Group (http://www.openclustergroup.org), the C3 scripts are designed to make the task of administering identical nodes in a small- to medium-sized cluster easier. A few of the scripts are also of use to cluster users. The C3 scripts were developed at the Oak Ridge National Laboratory, and are available separately at http://www.csm.ornl.gov/torc/C3.

The dozen or so scripts that make up the C3 toolkit may be invoked from the command line or included in higher-level scripts. The most fundamental script, cexec, executes a command string or sequence on all compute nodes in the cluster. Similar to but more sophisticated than the brsh shell script developed in the February column (available online at http://www.linux-mag.com/2002-02/extreme_01.html), the C3 version comes in two flavors: cexec for parallel operation, and cexecs for serial operation.

Using the C3 Utilities

Invoking cexec command causes command to execute on all compute nodes simultaneously while cexecs command causes command to execute on each compute node sequentially, one after the other. cexecs forks separate processes, opens connections to every compute node at once, and then arranges the output by node number for display. cexecs opens one connection at a time, and immediately shows the output from a node before moving on to the next node. The parallel version is useful for executing long-running processes, such as compiling and installing software from local disks on all machines simultaneously. The serial version is better for quick tasks where the output is short but important, like checking system status as shown in Figure One.




Figure One: The results of running cexecs uptime


——— node02———
10:36pm up 3 days, 3:05, 0 users, load average: 0.00, 0.00, 0.00
——— node03———
10:36pm up 3 days, 3:35, 0 users, load average: 0.00, 0.00, 0.00
.
.
.
——— node64———
10:36pm up 3 days, 3:12, 0 users, load average: 0.00, 0.00, 0.0

Most of the other C3 scripts operate in parallel by default. This reduces the total time to complete a single operation since the operation occurs on all nodes simultaneously. Unfortunately, this limits the scalability of the toolkit to clusters having at most a few hundred nodes. On clusters with thousands of nodes, thousands of processes and network connections would be opened simultaneously, resulting in the symptoms of a denial of service attack. Moreover, anything that copies data from compute nodes to the master, like cget (described below), would cause the master to grind to a halt. The C3 authors are considering these scalability problems, and plan to develop a mechanism to solve them.

The cpush, cget, crm, and ckill scripts are useful for normal users as well as system administrators. The cpush script sends one or more files to every compute node in the cluster. It uses rsync to perform the transfer because rsync only sends the instructions and new data needed to bring files in sync, thereby reducing network load for files that are only slightly different. The cget script retrieves a file from every node in the cluster. When saved on the local disk, the cluster name and node name will be appended to the file name.

The C3 scripts can operate across multiple clusters, or can be limited to a subset of nodes in a single cluster. For example, both cpush and cget can be forced to operate on all nodes in all clusters using the –all option.

Alternatively, scripts may be forced to operate only on a subset of nodes in the local cluster by specifying an index range (e.g., cget :6-8 /tmp/job_output /home/forrest), or on a subset of clusters by specifying a list of clusters (e.g., cget cluster1: cluster2: /tmp/output /home/ forrest). Be aware that node indices, like the 6-8 shown above, represent the sixth, seventh, and eighth nodes listed in the configuration file (described below). These numbers are independent of the numbers included in the node names. Node index 6 may not map to host node06.

The crm utility deletes one or more files from nodes in one or more clusters. Deletes can be performed recursively by using the –r option and may be “forced” using the –o option (equivalent to -f in rm). Figure Two shows a simple example of the use of cpush, cexecs, cget, and crm.




Figure Two: Output of some C3 commands


[root@node01 root]# cpush /etc/hosts /root/
[root@node01 root]# cexecs ls /root
************************ penguin ************************
——— node02———
hosts
——— node03———
hosts
.
.
.
——— node64———
hosts
[root@node01 root]# cget /root/hosts
[root@node01 root]# ls
hosts_penguin_node02 hosts_penguin_node23 hosts_penguin_node44
hosts_penguin_node03 hosts_penguin_node24 hosts_penguin_node45
.
.
.
hosts_penguin_node22 hosts_penguin_node43 hosts_penguin_node64
[root@node01 root]# crm /root/hosts

The ckill utility runs kill on each node, sending an optionally-specified signal to the named process. For example, ckill a.out sends the normal TERM signal to all processes named a.out on every node in the default/local cluster. Typing ckill -s HUP sendmail sends the HUP signal to the sendmail process on every node.

The root user may also specify a user name, thereby limiting signals to processes owned by the specified user having the specified process name. For instance, ckill –user bob a.out kills any processes called a.out that are owned by user bob on every node.

The cpushimage and cshutdown scripts are especially useful for system administrators. cpushimage pushes a hard drive image created with SystemImager to one or more nodes in the cluster. This causes the nodes to rebuild their file systems using the pushed image, and (optionally) reboot, offering system administrators an easy method for updating or upgrading operating systems or large software packages on all nodes at once. (SystemImager is now a part of the System Installation Suite, which includes System Configurator, SystemImager, and System Installer. The System Installation Suite is available at http://www.sisuite.org.)

The cshutdown script runs the shutdown command on every node at a specified time with an optional warning message. Although rarely needed on most clusters, this utility comes in handy for scheduled outages. cshutdown supports various options including -r for reboot, -H for halt, -k to “fake” a shutdown, as well as others. The –onreboot option will reboot all nodes making them load the image associated with the specified lilo label.

Installing and Configuring C3

The C3 scripts make certain assumptions about the layout of the cluster, and depend on a number of other packages. It’s assumed that the master node is not a computational node. As a result, all script functions execute only on (or against) the computational nodes, not the master. Care must be taken if the master is included in the list of computational nodes in the configuration file (discussed below).

As mentioned above, these scripts rely on other packages to function. They are written in Python2, which must be installed on all nodes. The scripts use rsync for file transfers/updates, and use OpenSSH (and therefore OpenSSL) to start remote processes by default. The utilities can be instructed to use rsh or any other similar remote shell by setting the C3_RSH environment variable (as in export C3_RSH=rsh). To use the disk image features for installing Linux, DHCP must be running on the master/server, and SystemImager must be installed.

Once these packages are installed, the C3 Toolkit can be downloaded and installed on the master using the install script provided. Next, a configuration file must be created to inform the scripts about the accessible clusters and nodes. Called /etc/c3.conf, this configuration file lists all available clusters and the nodes contained within each cluster. Figure Three shows an example configuration file.




Figure Three: The contents of /etc/c3.conf


cluster penguin {
penguin01:node01 # head node
node0[2-9] # compute nodes
exclude 4 # do not do anything to node04
node[10-60] # more compute nodes
exclude [45-55] # do not do anything to node45-node55
dead node61 # node61 is down for the count
node[62-64] # more compute nodes
}

cluster doom {
doom:zim # head node
gir
dib
gaz
dead skooge
}

In this example, two clusters are known: penguin and doom. The penguin cluster is the local cluster, and its head or master node is called penguin01 on the public/routed network and node01 on the internal network. The cluster consists of compute nodes called node02 through node64. Because the nodes prior to node10 contain a leading zero, the list must be split into at least two lines — specifying node[02-64] will not work.

The nodes are assigned an index for the purposes of specifying ranges of nodes for the C3 scripts described earlier. In this case, node02 will be 0, node03 will be 1, node04 will be 2, and so on. Nodes may be excluded from access by the C3 scripts by listing the appropriate node number or range as an exclude entry immediately below the line in which the nodes are included. In this example, node04 as well as node45 through node55 will not be touched by the C3 scripts.

Some nodes may be down for long periods of time. To avoid the renumbering of nodes or altering their index assignments, such nodes may be listed as dead in the configuration file. Here node61 is listed as dead, and will not be accessed by C3 scripts. Note that the node is not otherwise included in other entries in the cluster list.

In the second cluster, called doom, the head node is named doom on its external network interface, while it’s known as zim on its internal interface. Other compute nodes in the cluster are called gir, dib, gaz, and skooge. Note that skooge is dead.

At this point, the master node is setup and ready to go. The only piece that must be installed on the compute nodes is the ckillnode script. Because the master is already setup, the C3 scripts may be used to make the appropriate directory on all the compute nodes and to copy (or push) the ckillnode script to each node. Figure Four shows all the steps involved in installing the C3 scripts on the local cluster.




Figure Four: Pushing the C3 scripts to all nodes


[root@node01 src]# wget http://www.csm.ornl.gov/torc/C3/Software/c3-3.1.tar.gz
[root@node01 src]# tar xzf c3-3.1.tar.gz
[root@node01 src]# cd c3-3.1
[root@node01 src]# ./install
[root@node01 src]# cexec mkdir /opt/c3-3
[root@node01 src]# cpush /opt/c3-3/ckillnode

Once installation and configuration are complete, the remaining scripts can be used to analyze the configuration file and understand the numbering of nodes. The clist script describes all the clusters listed in the configuration file as either “direct local,” meaning the command was issued on the master of the local cluster, “direct remote,” meaning the cluster is directly accessible on the network and is not the local cluster, or “indirect remote,” meaning the cluster is accessible remotely only through another (front-end) host.

The cname utility returns the node names implied by the ranges specified, while cnum returns the node index of any node. Figure Five shows some results from these scripts when used with the example configuration file shown in Figure Three.




Figure Five: An example script using C3 scripts


#!/bin/sh

/opt/c3-3/cpush /etc/passwd
/opt/c3-3/cpush /etc/group
if [ -f /etc/shadow ]; then
/opt/c3-3/cpush /etc/shadow
fi
if [ -f /etc/gshadow ]; then
/opt/c3-3/cpush /etc/gshadow
fi

exit 0

The C3 scripts can be used in user or administrator scripts invoked manually or on a scheduled basis by cron. Users may find that using C3 tools to distribute data prior to a parallel job and to collect data back to a single node afterward is most easily handled with cpush and cget in their own job scripts. System administrators may build scripts, like the push_passwd script shown in Figure Five, to perform simple operations like syncing up password and group files across the entire cluster.

While many of the functions provided by the C3 scripts could as easily be performed by scripts written by your own hand, having a standard set of utilities simplifies administration of multiple clusters and provides consistency for users. Table One lists the name and purpose of each of the C3 tools. As you can see, the toolkit automates many typical cluster administration tasks and many common user tasks including “pushing” and “pulling” files from multiple nodes, and cleaning up after distributed job runs. C3 is especially handy when the individual utilities are combined in custom-made scripts. Try the C3 toolkit on your own cluster and see how you like it.





Table One: The C3 Toolkit scripts












cexec

Executes a given command string on each node simultaneously

cexecs

Executes a given command string on each node in turn

cget

Retrieves a file from each node and places it in a target directory

ckill

Sends a specified signal to the specified process on each node

clist

Lists the names and types of clusters in the configuration file

cname

Returns the node names of clusters, optionally from a specified range

cnum

Returns a node index from a specified cluster and node name

cpush

Transfers file(s) from the local node to each node

cpushimage

Transfers hard drive images (created by SystemImager) to each node

crm

Deletes file(s) from each node

cshutdown

Halts or reboots each node


And thanks for reading “Extreme Linux” over the past year. I hope you’ve enjoyed reading it as much as I have enjopyed writing it. Comments and feedback are always welcome. Have an EXTREME holiday season, and tune in again next year!



Forrest Hoffman is a computer modeling and simulation researcher at Oak Ridge National Laboratory. He can be reached at forrest@climate.ornl.gov.

Fatal error: Call to undefined function aa_author_bios() in /opt/apache/dms/b2b/linux-mag.com/site/www/htdocs/wp-content/themes/linuxmag/single.php on line 62