The March 2005 column discussed how to get started in cluster computing with a (very) low cost cluster. The sample cluster shown was a four-node, diskful cluster, costing only about $1,600. And while that system consisted of modest 2.3 GHz Intel Celeron processors interconnected via 100 Mbps Ethernet, it was quite suitable for testing the technology and the scalability of parallel software applications.
Last month’s column discussed the Clustermatic Linux
distribution, a collection of software packages — including BProc
(presented in the February column, available online at http://www.linuxmagazine.com/2005-02/
) — that provides infrastructure for running a cluster. Last month’s column showed how to install Clustermatic on the low-cost cluster created previously, and presented a detailed procedure for setting up shared filesystems using NFS.
While Clustermatic runs well on all kinds of compute clusters, it is certainly not your only choice for a turnkey cluster solution. A variety of distributions or software suites for existing Linux installations are available for free or from Linux cluster integrators. In addition to Clustermatic, the more popular free distributions include Rocks, Oscar, and Warewulf.
These cluster distributions tend to be combinations of individual packages (many of which have been discussed previously in this column) and automated installation and configuration programs and scripts. Each suite has its own strengths and weaknesses and idiosyncrasies, but over the last few years they have become more usable, useful, and customizable.
This month, let’s continue our investigation of these toolkits by looking at Rocks.
Rocks (often referred to as NPACI Rocks,
) is distributed by the San Diego Supercomputing Center (SDSC) at the University of California at San Diego (UCSD) with support from the National Science Foundation’s National Partnership for Advanced Computation Infrastructure (NPACI). Partnering with SDSC on the development of Rocks are Scalable Systems in Singapore, the HPC Group at the University of Tromsø in Norway, the SCE Group at Kasetsart University in Thailand, and the cluster development group at KISTI in Korea. SDSC also receives equipment donations from Sun Microsystems, Dell, AMD, Infinicon Systems, and Intel.
Rocks is based on Red Hat Linux releases and uses Red Hat Kickstart files to load cluster cluster nodes. ISOs for the Rocks distribution are available on the Rocks web site. Rocks supports 32- and 64-bit Intel architecture processors (including x86, x86_64, and ia64) and Ethernet and Myrinet interconnects. With the Rocks 3.3.0 (“Makalu”) release on October 19, 2004, support was added for the EM64T (Intel’s Nocona processor) and Infiniband interconnects.
Rocks clusters are highly customizable. In fact, the distribution is available in a number of flavors depending on how a cluster is to be used. While 80 percent of the software is the same, a variety of packages may be alternatively loaded if, for example, the cluster will be used for high performance computing (HPC), for visualization, or as a grid resource. Most of these package combinations can be downloaded separately (cafeteria-style) from the Rocks web site as packages called rolls, so that a tailored configuration can be easily constructed.
Rolls are available for the base system, HPC, the kernel, AMD library support (amd roll), security services (area51 roll), Condor, the Globus Toolkit (grid roll), Infiniband support for Infinicon hardware (ib roll), LAM MPI and MPICH built with Intel compilers (intel roll), Java SDK and JVM (java roll), Portable Back System job scheduler (pbs roll), web-based graphical cluster management tools from Scalable Systems (RxC roll), Open Scalable Cluster Environment (sce roll), Sun Grid Engine job scheduler (sge roll), visualization support (viz roll), small boot disk for WAN installation (boot roll), workload management for Platform Rocks (lava roll), and Infiniband support for Voltaire hardware (ib-voltaire roll). The base, HPC, and kernel rolls are required; all others are optional.
Scalable Systems in Singapore offers what it calls Scalable Rocks, a version of Rocks 3.3.0 that includes the Intel Software Development products and the RxC management tools mentioned above. It’s available as a Community Edition (free with restrictions on commercial and OEM/VAR use), an Open Edition, with no support or updates, and supported Academic and Commercial Editions that also include full Red Hat Enterprise Linux licenses.
The Rocks Philosophy
Like many cluster toolkits, Rocks takes the position that it is easier to reload an operating system (OS) than it is to maintain identical disk images across compute nodes through individual updates. But unlike some toolkits, Rocks actually loads the OS onto a hard drive in each compute node. (Remember that Clustermatic took a true minimalist approach: compute nodes have only minimal kernels and OS files running right out of a RAMdisk.) Rocks compute nodes must build their OS using a Kickstart file passed to it from the front end node at boot time. Since Kickstart installations can automatically detect hardware and install appropriate drivers or modules, Rocks can better support heterogeneous hardware; all nodes need not be identical.
Essential configuration information about cluster nodes is maintained in an MySQL database on the front-end node. This information is used to build DHCP configuration files, /etc/hosts files, PBS node files, etc. In a similar fashion, information required to build Kickstart files is maintained in a set of XML files (in the Rocks Kickstart Graph) associated with the various rolls. This simplifies adding new rolls into the distribution.
Rocks requires x86 (Intel Pentium 3/4/Xeon, AMD Athlon), x86_64 (AMD Opteron, Intel EM64T), or IA-64 (Intel Itanium) processors. Supported networks are Ethernet (whatever is supported by Red Hat), Myrinet (Lanai 9.x), and Infiniband (Infinicon or Voltaire). The front end node must have at least 16 GB of disk capacity, 512 MB of memory, and two physical Ethernet interfaces (for example, eth0 and eth1). Compute nodes also require 16 GB of disk capacity (since an OS will be built on each node), 512 MB of memory, and a single Ethernet interface.
The first Ethernet interface on all nodes, including the front end, must be interconnected with an appropriate Ethernet switch. It is assumed that the second Ethernet interface on the front end node is connected to the outside Internet. The front end node will run a variety of services (NFS, DHCP, NTP, MySQL, HTTP, and so on), which should only be accessible on the internal network for the benefit of the compute nodes. An additional high performance, low latency interconnect like Myrinet or Infiniband may optionally be used to enable high performance message passing for parallel codes.
Installing and Configuring Rocks
To have a functional Rocks cluster, the Rocks Base CD, the HPC roll, and the kernel roll must be installed. The Base image should be burned onto a CD for installation. The HPC and kernel rolls are available as a single ISO, which should also be put onto a CD for installation. The front end node must be loaded first.
The front end node should be booted using the Base CD. When the boot prompt appears, type front end to begin the process. Note that pressing return or waiting without typing in front end causes the compute node OS to begin to be installed. Once booted, the node will enter a text-based installation process.
The installer will prompt with “Do you have a roll CD/DVD?” Select “Yes” and the system will eject the Base CD. Next, insert the HPC and kernel roll CD and select “OK.” A window should be displayed showing that the system found the HPC and kernel rolls. When prompted for another roll CD, select “No” to continue with the installation.
The system will request that the Base CD be loaded again into the drive. Insert the Base CD and select “OK.” At this point, a Cluster Information screen will appear requesting information about the cluster. Only the “Fully Qualified Hostname” is required, and a real hostname with domain name should be entered in this field (not an alias). Optionally, the Country, Cluster Name, Contact e-mail address, Organization name, URL, Locality (city), State, and even latitude and longitude of the cluster may be provided. Select “OK” to continue.
Next, a disk partitioning screen will appear. The “Autopartition” option is recommended, and it will create a 6 GB root partition, a 1 GB swap partition, and put the remainder of the drive in an /export partition. If “Disk Druid” is selected to manually partition the drive, at least a 6 GB root partition and a separate /export partition must be created.
The network configuration for the two Ethernet devices must be performed next. The installer defaults for eth0(” Activate on boot” with an address of 10.1.1.1 with a netmask of 255.0.0.0) should be accepted, unless special network configurations are required on what should be a network internal to the cluster. For eth1, the appropriate static IP address and netmask should be entered. While DHCP is presented as an option for the public network connection, it is not actually supported by Rocks at this time.
The Gateway and Domain Name Servers should be entered on the next screen, and the Time Zone and Network Time Server should be entered on the Time Configuration screen. Next, the root password must be set, then the system will begin installing all the necessary packages onto the front end node. Once installation is complete, the front end node can be booted from its hard disk.
Now the OS can be installed on the compute nodes. The Base CD will be needed for loading the compute nodes. First, login to the front end node as root and run…
… to begin constructing the database of MAC addresses from compute nodes. If a managed switch is being used on the internal network, it may send DHCP requests which could cause it to be considered a compute node. To prevent this problem, first select “Ethernet Switches” from the “Appliance Type” list and allow that DHCP request to be registered as a switch. Then quit the program using the F10 key, and run insert-ethers again to begin configuring compute nodes.
Choose “Compute” from the Appliance Type list and select “OK” to begin recording MAC addresses of compute nodes. A blank box will be presented. As DHCP requests are observed, nodes will be added to this list. Now boot the first compute node using the Base CD (just press return at the boot: prompt).
Once the CD has been booted and the compute node has made a DHCP request, its address is added to the “Inserted Appliances” list on the front end node along with a node name (such as compute-0-0 for node 0 in cabinet 0). The compute node should receive an IP address assignment from the front end node and should then request a Kickstart file. Once that file has been sent to the new compute node, an asterisk is shown in parentheses to the right of the node name.
The installation process should take about 10 minutes. When the installation of the compute node is complete, the Base CD is ejected. All the remaining nodes in a group or a rack should be loaded in similar fashion. Once the first rack of nodes is loaded, quit insert-ethers by pressing F10, and run it again as…
# insert-ethers ––cabinet=1
… to load and name the compute nodes in the next cabinet (compute-1-0, compute-1-1, and so on). Once all nodes are booted, you have yourself a working Rocks cluster.
Additional rolls may now be applied to the cluster by inserting the desired roll CD into the drive on the front end node, mounting it as /mnt/cdrom, and running:
# rocks-dist copyroll
Next, unmount the CD and repeat for each additional desired roll. Then distribute the roll as follows:
# cd /home/install
# rocks-dist dist
Give It a Spin!
Now that the cluster is running, try compiling the usual “Hello, World! ” MPI code and running it with mpirun. To spawn processes on compute nodes, use cluster-fork. For instance, try:
$ cluster-fork ps –u$USER
The cluster-fork command is a powerful tool for controlling compute nodes. Check out all the command line flags on its man page.
All the additional packages for scheduling jobs (SGE, pbs, and others) and monitoring the cluster (Ganglia, cluster top, and the like) may now be configured and run as desired. And don’t forget to run HPL to see how your cluster ranks on this (one specific) benchmark.
These instructions are meant to get you started in using Rocks on your own cluster, but they are in no way comprehensive. Rocks includes a wide variety of cluster packages, each of which can be individually configured to meet your needs. The documentation on the Rocks web site can help guide you to the right set of rolls and options to meet your specific needs. In addition, several Rocks mailing lists are available for keeping up with development and releases or for discussing experiences or problems with Rocks.
Rocks is a (rock) solid and highly customizable cluster distribution based on the well-established Red Hat Linux distribution. Since it includes all the free job schedulers and monitoring tools in common use, it can save you time and help get that cluster up and running quickly. Try it out for yourself!
Forrest Hoffman is a computer modeling and simulation researcher at Oak Ridge National Laboratory. He can be reached at