A Nodal Philosophy

What’s on your nodes? Do you really need all that stuff? Uh, it depends.
The question “What makes a cluster a cluster?” seems to be one of those one hand clapping questions — there’s not a right answer, yet the endless debates can easily turn philosophers into pugilists. But since I’m feeling scrappy this month, I’ll take the question on and challenge all comers.

Let’s Get Ready to Rumble

To begin, let’s go back to a definition introduced in the original How to Build a Beowulf book (Thomas Sterling, John Salmon, Donald J. Becker and Daniel F. Savarese, MIT Press, ISBN 0-262-69218-X):
A Beowulf is a collection of personal computers interconnected by widely-available networking technology running one of several open source, Unix- like operating systems.
I should point out a few things about this definition. First, the definition has plenty of room for interpretation and thus arguments. Everything from today’s Knoppix clusters (http://clusterknoppix.sw.be/) to fail-over web servers can be included in the definition.
Notice, too, the word Linux isn’t even mentioned! However, the key phrases are “open source,” which is now taken to mean Linux, and “personal computers,” which translates to commodity hardware. Open source and personal computers, along with the Internet, allowed the cluster to become the preferred way to solve high-performance computing problems.
Notice also that there’s no mention of an application using any or all of the PCs at once. This omission invites both capacity and capability clusters to the party. More on this distinction will follow.
Another subtle point in the definition is that the PCs only need to be running the operating system. There is no mention of a full distribution on the nodes.
Thus, the cluster definition is rather elusive. As I consider my mission to both inspire and incite, let’s push the envelope to see what is possible with today’s technology and hopefully still fit in the old definition.

The Hard Disk Issue

One of the nagging issues with clusters is the abundance of disk space. A typical hard drive today is 100 gigabytes (or more). This size is about ten times more space than most people tend to use on cluster nodes.
Diskless cluster owners will argue that the hard disk isn’t needed — spinning platters just add to an ever increasing power budget and add yet more components that can break. Although worthy arguments, it seems that manufactures and customers find it hard to omit the disk drive. I suppose it’s like asking a car dealer to not install the radio. The dealer could do it, of course, but when you are done with the car, maybe the next person would like a radio, so you leave it in, but don’t turn it on.
Traditionally, cluster nodes are setup to have a swap partition, a Linux distribution, and scratch space. The /home directory is usually mounted via NFS. Several gigabytes of swap space makes for a nice safety net, but if everything is running correctly, a node shouldn’t hit the swap partition with any frequency. The OS distribution can take up to 2-3 GB and gives a node a bit of independence (it can boot and run in or out of a cluster). Finally, the scratch space can be used for local storage or even things like the Parallel Virtual File System (PVFS). Adding up the needed space, a small 10 GB drive should do nicely. Unfortunately, such a small disk is almost non-existent and higher density drives end up in nodes.
The big problem with hard disks on clusters is keeping every hard disk in sync across the cluster. Synchronization isn’t an easy task. In response to this and other issues, “diskless” methods have been developed that don’t rely on the nodes booting and running from a local hard drive. The most notable of these is the Scyld distribution (http://www.scyld.com/). Other projects, like the Warewulf Toolkit (http://www.warewulf-cluster.org/), have also demonstrated that a local hard drive is not necessary for many clusters.
As in any good debate, sometimes issues get confused or perhaps forgotten. One of the more subtle points is that “diskless booting nodes” do not preclude having a hard disk on the node. Indeed, getting your OS essentials (kernel and libraries) from one central location has many benefits, including simplified administration. Somehow the “We can boot the nodes without a hard disk” got translated to “We don’t want need no stinkin’ hard disks.”
In any case, it’s more about the OS than the hard drive. You can have diskless booting and local swap and scratch space if you want. You can even have an OS distribution on the hard drive read for booting. The two methods are not mutually exclusive.
The real question that you should be asking is, “What do I need on my nodes?” Having a large disk drive doesn’t mean the answer is “anything I can think of including the Korean Ham Radio HowTo. ” At first blush, this whole issue falls into the “So What?” category, but as we shall see, there may be more to it than meets the eye.

Capacity Versus Capability

Let’s pause for a moment and make a distinction about how clusters are used.
Typically, a “capacity” cluster is one where single process user jobs are run on the cluster like it were a big SMP machine. There are some restrictions, but the user’s program may transparently run on another node or the user may end up transparently logging into the least loaded worker node. The decisions are usually up to a migration or scheduler program (packages like OpenMosix (http://http://openmosix.sourceforge.net/) are particularly good at dynamically balancing the work load across a cluster). Additionally, applications like Sun Grid Engine (http://gridengine.sunsource.net/) can be set up to statically balance single interactive jobs as well. With a capacity cluster you want to ensure that the software environment on the worker nodes is the same all over the cluster. The environment could be thought of as a workstation farm. Users don’t care which one they use as long as all the right “stuff” is there.
The other type of cluster is the “capability” cluster, where a parallel program is are spread across cluster nodes. Users never use these nodes directly as interactive login nodes, although nodes with a full OS distribution are often used in this way. Beyond libraries and the OS, these type of cluster nodes usually have very little need for any extra stuff. A binary is started on a node, libraries are dynamically linked, and the crunching begins. The OS is contacted if there is I/O, unless the nodes are using kernel bypass communication common with many high speed interconnects. If programs are statically linked, there’s little need for things other than the OS on the nodes.
Because clusters are used in different ways, these two approaches to the nodes have developed. But there is a third approach: “less is better.” In this approach, only the bare necessities are made available to the nodes. Read that again. The operative term is available — dependencies may or may not originate on the node disk drive. More importantly, only services that are absolutely required are run on the node. Things like HTTP, FTP, and rsync don’t run on the node.
System administrators understand this approach very well. “If you don’t need it, get rid of it, because it can only cause problems,” is the battle cry of many an overworked administrator. A capability cluster can work well with this type of configuration.

The More Is Better Approach

With all the potential goodness that comes from the minimalist approach, why are nodes still carrying the weight of a full OS? There can be several reasons. First, as mentioned, the cluster owner may want the cluster to work in a capacity mode (workstation farm) as well as capability mode. Second, “diskful” nodes make cluster setup easier: just snapshot or do a Kickstart install, run yum once a week, and hope every thing stays in sync. Third, and perhaps most interesting is the assumption, “Why not, what could it hurt?”
Sure, a full OS won’t stop anything from working, but in the quest for the “high” in high-performance, the assumption above is worth checking out.

What’s Running on Your Nodes?

In September of 2005, there was an interesting thread on the Beowulf mailing list. The discussion started when Greg Kurtzer remarked that, for some reason, a new cluster running his Warewulf cluster distribution was getting 30 percent better performance than a ROCKS (http://www.rocksclusters.org/) installation on the same cluster (a Dell cluster with Infiniband).
These results were curious, as the kernel and much of the underlying middleware was the same. Looking at the differences, the Warewulf Toolkit is based on a minimal approach to the nodes, where the ROCKS Distribution is based more on a maximal approach.
There was some discussion about these results. More data was presented and assumptions were refined. First, the reported speedup was not for every application. The “Top 500” benchmark (HPL) ran as expected, but one code, an in-house geophysics application ran much better under Warewulf. The in-house code was more fine-grained and required more barriers (synchronization points) than HPL.
The story may have seemed to end there, but a link to a research paper was posted by Mark Hahn that seemed to support a possible theory as to why this would happen. In the paper, entitled “The Case of the Missing Supercomputer Performance: Achieving Optimal Performance on the 8,192 Processors of ASCI Q” (http://www.sc-conference.org/sc2003/paperpdfs/pap301.pdf), authors Fabrizio Petrini, Darren J. Kerbyson, and Scott Pakin are puzzled by the the performance of the ASCI Q cluster when running a program called SAGE (a hydrodynamics code consisting of about 150,000 lines of Fortran and MPI code). A performance model for the program predicted a cycle time of. 6 seconds on ASCI Q, but the authors realized something far worse, 1.2 seconds, for large numbers of processors.
In summary, the authors found a significant difference between the expected performance and that which they observed; the performance difference only occurred when using four processors per node; there was a high variably in the performance from cycle to cycle; and the performance problems seemed to originate from the collective operations, especially in allreduce.
Surprisingly, making allreduce seven times faster had little effect on performance. Their investigation focused on the “noise” on each node. Noise was defined as periodic system activity not needed by the computation at hand. Based on the results of the analysis, they removed 10 daemons from each node, decreased the frequency of monitoring, and centralized some other cluster daemons. The changes doubled SAGE performance, close to their performance model predictions.
The overall finding bears repeating here:
Substantial performance loss occurs when an application resonates with system noise. High frequency, fine-grained noise effects only fine-grained applications; low frequency, course-grained noise affects only course-grained applications.
Without any more detailed evidence, one could speculate that the same effect explains the Warewulf/ROCKS difference and why a minimalist approach may be best if performance is important for fine-grained codes. Indeed, looking at what is actually running on your nodes may be revealing and offer a simple optimization.

Edge Of the Envelope

In all fairness, there’s more work to be done to understand the “collective effects” that may develop on clusters. The move to “diskless booting” has allowed nodes to be easily tuned and still take advantage of a local disk drive.
Perhaps the real goal here is not so much the right way or the wrong way to build a cluster, but a flexible way that can address the many different needs facing users of both capability and capacity clusters, all the while remaining true to the spirit of the original Beowulf definition. I’ll talk more about this next month.
Of course, in hindsight, it all seems so simple. There are good reasons for both minimum and maximum provisioning of nodes. I can almost here the sound of one hand slapping my forehead as the obviousness of it all dances in front of me.

Douglas Eadline can be found swinging through the trees at ClusterMonkey.net.

Comments are closed.