The Art of Linux HPC Clustering

In case you haven’t noticed, the high-performance computing (HPC) market is now ruled by the Linux cluster. And while Linux clusters have made serious number crunching affordable, this disruptive change still has perils. Unlike more traditional HPC methods, a cluster presents a myriad of variables and trade-offs to the cluster designer and end-user. However, whenever there are choices that aren’t completely right or wrong, there is an opportunity for the artist and engineer to shine in all of us.

July 2006

Extreme Linux

The Art of Linux HPC Clustering

Building machines around problems requires both halves of your brain

In case you haven’t noticed, the high-performance computing (HPC) market is now ruled by the Linux cluster. And while Linux clusters have made serious number crunching affordable, this disruptive change still has perils. Unlike more traditional HPC methods, a cluster presents a myriad of variables and trade-offs to the cluster designer and end-user. However, whenever there are choices that aren’t completely right or wrong, there is an opportunity for the artist and engineer to shine in all of us.

Douglas Eadline

Linux high performance computing clusters have generated quite a bit of interest over the last decade. While the notion of tying many computers together to solve a single problem isn’t really all that new (as anyone who used a DEC VAX Cluster will quickly remind you), the last ten years have seen a remarkable evolution in clustering, as processor, networking, memory, and many other technologies advanced. Indeed, with each revision of the fabled “Top 500” list, the cluster paradigm was seemingly recreated anew.
What lies at the root of such upheaval and what can you expect in the future? Both questions and others are becoming more imporant, not only for the cluster craft’s high-end practitioners, but to the computer community as whole. A new engineering art is at hand, and the paints, brushes, and canvases are now commodity hardware, open source, and the Internet. And, as with any art form, there are masterpieces and there are, well, velvet paintings. Welcome, students, to Cluster Art Appreciation 101.

The relatively recent birth of cluster computing is well-known to most. In the beginning, Thomas Sterling and Don Becker built a “commodity, off the shelf” (COTS) cluster using 486 processors, Ethernet, and Linux. At the time, it was something of a fringe effort largely dismissed by the mainstream, although COTS systems were gaining popularity. For example, while at Purdue University, Hank Dietz wrote the Linux Parallel Processing HOWTO, in addition to doing some pioneering work with commodity hardware. (See Dietz’s Is Parallel Computing Dead? at dynamo.ecn.purdue.edu/~hankd/Opinions/pardead.html.)
Before COTS, high performance computer systems (nee supercomputers) were generally assembled and delivered as a single, monolithic system. Hardware and software were carefully and intricately knit together by artisans employed by the supercomputer manufacturers. Support was autocratic. And while supercomputers were expensive, the systems worked well and offered a level of performance that was unmatched by smaller business computers. Even when the PC appeared on desktops everywhere, the performance gap between supercomputer and simple computer was immense. Supercomputers were king.
But in the race between the supercomputer and then commodity semiconductor, the latter soon edged ahead — and then expanded its lead rapidly. In the world of semiconductors, the laws of physics and economics are at least fair. The designers of both supercomputers and commodity CPUs had to obey the same rules. As the cost to fabricate a CPU increased, so did the need to sell it in large quantities to recover the investment in research and development. In such battles, commodity always wins. Commodity processors got faster, and a hungry market justified further investments in the technology. Meanwhile, the supercomputer trudged along and things were not so upbeat. Companies were bought and sold and those pesky commodity processors were getting pretty darned fast.
Yet fast processors alone do not make a supercomputer. Memory subsystems and processor communication are just as important. There, too, the commodity market delivered as well. Ethernet snaked it’s way across the land, as well as reasonably fast memory subsystems. Commodity hardware building blocks were available and inexpensive, thanks to the price pressures brought about by market forces.
In addition to hardware improvements, parallel computing was considered a promising avenue to extend the reach of any underlying hardware. The road to modern parallel computing is littered with casualties (see Resources Sidebar), but the writing was on the wall. Parallel holds promise because it leverages the strengths and savings of commodity hardware. So, in spite of the difficulties, parallel computing made its way to the forefront of supercomputing. Systems still had six to seven digit price tags, but performance was being delivered.
The planets were almost aligned. Fast commodity hardware was widely available, parallel architectures were shown to be useful for many problems domains, and switched Ethernet was emerging in the network world. Taking note of these emerging hardware trends, Sterling and Becker set out to build a COTS computer from commodity hardware.
But what software would power the machines? Unix delivered the capabilities sought by end-users, but BSD was embroiled in a legal dispute that made it unavailable. The other option? A nascent Linux, which still needed an Ethernet driver. Now a part of computing history, The project was called “Beowulf” (see Sterling’s own account of the project at http://www.linux-mag.com/2003-06/breakthroughs_01.html) and Becker’s Ethernet driver was a significant contribution to the advancement of Linux.
HPC has never really been quite the same.

After the Storm

While Sterling and Becker ‘s work would have surely been noticed anyway, it was also discussed and made available openly on the Internet. With a diverse and intensely-interested constituency collaborating out in the open, the growth of COTS systems accelerated, with many notable price-to-performance records falling by the wayside. Many of the most interesting discussions took place on the Beowulf Mailing List (http://www.beowulf.org/mailman/listinfo/beowulf/). To this day it remains an invaluable source for new and experienced users.
In all, three forces seemed to be marshaled by the original Beowulf project: fast commodity hardware, open source, and open collaboration. A perfect and disruptive storm if there ever was one. The storm laid waste to many a price/performance curve. In the early days of clustering, an improvement by a factor of ten or more was not uncommon. System cost went from the high six-digits to the low fives. It was a jaw dropping time for HPC. Problems that could run as fast as or faster than more expensive hardware fueled a competitive environment.
However, the once highly-integrated system was replaced with a multi-vendor, hands-on kind of machine. Instead of dealing with one vendor, end-users had to manage many aspects that were previously handled by the artisans at the supercomputer company. In addition, a whole new raft of choices were now available. Questions such as “What compiler?”, “Which MPI version?”, and “What motherboard?” started to become re-occurring conversations on mailing lists. (These conversations continue even now.)
There were some real costs saving due to commodity hardware and freely-available software, but there was also quite a bit of cost shifting. Administrators and end-users became responsible for more than they really wanted to admit, and the art of cluster began to emerge. So did a joke of sorts. Often echoed on the Beowulf Mailing List, the right answer to any clustering question is, “It depends[ on the application]. ” Ironically, that is both a true answer and a worthless one at the same time. The devil lives in the cluster details.
Interestingly, actual answers come in various forms. One might think that performance would determine the answer, but the cluster equation almost always has a price dimension. After all, the time/hassle factor shows up indirectly in the price.
For instance, choosing an MPI library is often based on ease of use and past experience. There are easily over fifty MPI/compiler combinations available to cluster users. In many cases, testing all fifty isn’t worth the time (cost) if the program runs “adequately” and doesn’t crash. Here, less hassle wins out over performance. However, there are cases where spending time and hassle can pay off. For instance, choosing another variant can lead to significant improvements in large, long-running programs. Enter the cluster artist.

Problems First, Machines Second

In a way, the cluster revolution has reversed the way we look at computing. In monolithic days, a user had to fit the problem to the machine, crafting the application to take advantage of the specialized hardware. With clusters, the tables have turned. Instead of trying to fit the problem to the machine, the machine is often designed around the problem Of course, the problem must still be cast in the parallel mold, but the cluster artist can pick and choose from a variety of media.
For example, one of the most important decisions is the nodes-to-network trade-off. Given a fixed budget, the number of nodes (servers) depends on the type of interconnect to be used. If the interconnect is something standard (and embedded in the motherboard), such as Gigabit Ethernet, then the cost of the network is small compared to the node. If on the other hand, a high-speed network is needed, the network cost per node reduces the amount of nodes that can be purchased. Similar analyses can be made for memory, processors, and any other variable that is part of the cluster equation.
File systems are another area where the application tends to dictate the design of the machine. There is no sense in buying a large number of nodes if the single Network File System (NFS) server is a bottleneck. Your effective performance will be quite low, even though each nodes can quickly chew through the data once it gets it.
Many machines are also built for a range or class of problem and therefore must be general in design and performance. The design process is still the same, however. Unless you have an unlimited budget, deciding how much or how little of something is going to be important for the success of your cluster.
The “buy only what you need” approach to clustering is one of the reasons it is so economically attractive. The traditional supercomputer has many engineered features that provide great performance for some programs. If your application is not one of the lucky ones, though, you’re paying for technology you don’t really need. For example, in some problem domains, global shared memory is highly desirable. In others, global shared memory isn’t needed, as a parallel, message passing approach works just as well.
Similar augments can be made for the software used by HPC clusters. The use of a freely-available, open source infrastructure has allowed interoperability and lawyer-free interaction among applications and end-users. In the case of Linux or more properly, a Linux distribution, cluster users only need to deploy software that is important to their particular problem.

Setting Cluster Expectations

When a new technology becomes available, it’s important to set reasonable expectations. Of course, given enough time, most things can be “made” to work as you expect, but time is usually never a luxury.
A cluster can be used in a variety of ways. First, it can used as a compute or capacity server where single process batch jobs are submitted for execution. Many bioinformatics clusters operate in this way. Capacity clusters can also be used as interactive machines. Process migration packages, such as OpenMosix, can make a cluster look like a single, large SMP machine to end users. (There are some limitations to this approach, but for many applications it works quite well.) Another method is to use something like Condor to manage and load balance the cluster.
The important thing to remember about capacity clustering is that none of the codes need to be modified to use the cluster. There may be some recompilation, but in general, migrating single process jobs from a workstation to a cluster is straightforward. On the other hand, if your application is multithreaded and previously ran on an n- way SMP machine, you may have difficulties or performance issues running on single cluster node, as threads do not travel well across the cluster.
A cluster can also be used to run parallel jobs. This mode, often referred to as capability computing, is where many of the price-to-performance breakthroughs have been realized. Parallel tasks require more than one CPU, and thus are spread across the cluster at run-time. The most important thing to remember about parallel cluster computing is the following: Creating or moving existing sequential codes to a parallel environment requires reprogramming.
There is no silver bullet, which is why understanding your application is important. Knowing what is feasible and practical is essential before you invest the time to create a parallel code. There are algorithmic or performance issues that may be complete show stoppers. (Because writing or converting a parallel program isn’t easy, the time spent understanding how you can run your code in parallel is very important.) Moreover, a parallel code doesn’t automatically imply faster execution. Many a programmer has gone back to the drawing board after running his newly converted code on a parallel computer.
Fortunately, there are many popular and useful codes that do run well in parallel. Again, this is no guarantee that a specific application will run well on every cluster. Some codes are designed to run on Gigabit Ethernet, while others are designed to run over faster, low-latency networks. Looking at previous benchmark data is quite valuable in this case.
Finally, with some configurations, clusters can be designed to support both capacity and capability work loads. Segmentation and scheduling polices are important in this scenario.
Some other issues to keep in mind are:
*Administrators face a learning curve, but if set up properly, subsequent administration of the cluster is not that difficult.
*Plan for a “shake-in” period. Large clusters purchased from vendors are literally built on-site. Allow a month for “burn-in” and testing after the vendor leaves (they will be back). If you are building your own or have purchased a bare rack of servers, then plan for some burn in time as well. As a general rule, the large the cluster the longer the “burn in”.
*Expect heat. Unless specifically designed for a cubical or small office, a cluster needs to be placed in a machine room with adequate cooling. An empty corner is not necessarily a good place for dissipating 5,000 Watts of heat.
*Things break, so be prepared. If you want 128 nodes running 24/7, buy a few extra nodes and have them running so that you never dip below 128 nodes. If one breaks, have it repaired right away as part of a service contract. Also, you want to assure that losing a node or two doesn’t bring down the cluster. If uptime is critical, single points of failure can be made redundant for additional cost.
*Unless you have access to a support organization or consultant, don’t expect answers to problems to be immediate. The community does provide answers, but there is no guarantee. If you go the community support route, allow system administrators time to find answers.
Finally, be prepared for performance. Designed and built correctly, clusters deliver astounding performance and good reliability. Your investment in Linux and open source can pay off nicely because now you control your cluster destiny. More nodes are sure to follow if you can demonstrate how well the concept works.

Some Fundamentals

Like any art form, there are certain clustering fundamentals that guide decisions. Weighing the fundamentals is the first thing to do, and well before you put brush to canvas. Is absolute performance important or is reliability more crucial?
The following are some important criteria to consider before you commit to any hardware or software:
*Performance. Performance is why you want a cluster in the first place. First and foremost, performance should be determined by what you want to do with the cluster and not how it compares to the “Top 500.” Unless you’re running codes similar to the HPL benchmark, don’t bother running the program. Indeed, there are faster parallel algorithms to consider instead. Additionally, performance is best measured in terms of some other metric. Price and power are often used in conjunction with performance, but sustained performance or throughput is also a good measure. Remember, define performance in terms of your application (s) and not an artificial or low-level measure of cluster performance.
Scalability. A scalable program is one that you can speed up by adding more processors. Ultimately limited by Amdahl’s law, all applications have an optimum number of processors for a given interconnect and processor (or more properly a point of diminishing return). Interestingly, many applications find it hard to scale to over sixty-four nodes.
Reliability. By its very nature, a cluster is rather robust. If high reliability is important, then engineering redundancy from the beginning is important. In general, extra reliability may mean a reduction in performance, as redundancy often introduces overhead. The trade-off is a function of how important or timely the result is to your organization. Typical single points of failure are switches, file servers, and queuing systems. Quality components can make a big difference in many of these areas and many clusters run reliably using non-redundant systems.
There is also another school of thought concerning reliability. When administrators think of reliability they often think of monitoring as well. In pedestrian terms, redundancy and monitoring actually add another layer of things that can break, and may actually reduce reliability. To this end, many clusters nodes are designed without hard disk drives because moving parts break more often than non-moving parts. Here’s another opportunity for art. Engineered correctly and with the right software, diskless nodes can offer all of the advantages of diskful nodes, but reduce large number of failure modes. Of course, the other side of the argument is that that nodes with two hard drives in a redundant, RAID 1 configuration also guards against problems.
Balanced Computing. Ultimately, parallel computing is about balance. There is a cost to send and retrieve data from another node. The cost to transmit the data should be less than the return in performance increase. That is, if you execute in parallel, it better be faster than doing it sequentially. Data transfer takes time and contributes to the overall runtime of your parallel program. Balancing communication time with compute time is where the parallel pay-off lives. If you really cannot afford a fast interconnect, then buying the fastest processors available may not be the best use of your funds. If your applications scale well, using a larger number of slower processors with a slower interconnect may actually yield better performance.
Power and Cooling. Modern CPU’s run hot. One way to visualize the heat is to replace (in you mind) each processor with a 100-Watt light bulb. Removing heat from a cluster is important for proper operation and reliability. Prolonged overheating may not cause any problems initially, but studies have shown that it can increase failure rates. Performance per watt is becoming an important measure of clustered systems.
Local integration. Clusters are almost never a standalone affair, yet integrating a cluster in to a local IT environment is often an afterthought. The first thing to consider is file sharing. Desktops need to work with NFS, OpenAFS, and Samba shares. User interaction is often best handled with a web browser for standard repetitive tasks. (And yes, part of your population of people will want to telnet into the cluster.)
Weaving these factors together is where the art comes into play. Individually, each factor can be easily addressed, but collectively, the factors tend to be very nonlinear. For instance, scalability effects price-to-performance, but better scalability may require reduced node counts, as more expensive interconnects are needed, which in turn may reduce the throughput for some programs.

The Latest is Not Always the Greatest

There is a rule of thumb in the automotive industry that says, “Never buy a new model car the year it’s introduced.” Wait a year, the adage goes, so problems can be identified and fixed (hopefully).
Similarly, clusters call for conservative and well- ested decisions. Waiting a year may be a bit extreme, but letting the new hardware “breathe” for several months is a wise choice.
Every choice you make for cluster node will be multiplied by N, the number of nodes. Generally speaking, the latest and greatest computer hardware, in addition to carrying a premium price point, often needs at least three to four months of channel exposure before it’s ready for prime time use. Let some other brave soul find problems and issues before you. Loading up a cluster with brand-new hardware is a risky play, even if you think it will get you bragging rights in your corner of the HPC world.
In a similar vein, low-cost parts can also be risky. Clusters normally push hardware harder than most other applications. An application may run for hours, days, and even weeks before it finishes. The cluster community is rife with stories of cheap memory failing randomly after the cluster has been running fine for several days. Similarly, compiling the latest kernel, driver, or library may not mean better performance or reliability. There have been plenty of cases where the exact opposite has been true.

A Paradigm Shift

Clusters have been officially called a “disruptive technology,” which is “marketoid speak” for “Boy, we never saw this coming”. In the fall of 2005, it was reported by IDC that clusters have grown 49 percent in the last two years, while traditional supercomputers are down 29 percent in the same period. In 2004, clusters accounted for one third of the market, yet in 2005, the systems account for over one-half the market.
When changes of this magnitude come about there is usually something fundamental at work. Here, that “something” is the “cluster paradigm,” where commodity hardware and open source software are combined to solve problems or meet specific needs of end-users. As clusters have shown, it is a powerful combination.
As networks continue to grow, technologies like grid computing represent the next level of this paradigm shift. Open, virtual, and on-demand computing systems designed for specific problem domains are already on-line.

The Artist in You

In general, while many of aspects of cluster design points are important, few should be considered absolute show stoppers. Placing thirty two servers in a rack and connecting them with Gigabit Ethernet is a pretty safe way to proceed. The cluster will work, so the question becomes how well does it work? Would twenty four nodes and a faster interconnect, at the came cost, actually provide better performance for some codes?
There may not be a totally right or wrong answer, as “it all depends.” Navigating these questions while weighing costs, performance, power, heat, convenience, and support has become somewhat of an art form in the cluster community. Keeping the key issues in mind and studying previous works is a good way to get started with clusters. And like any great medium, there is a place for the good, the bad, and the ugly. The good news is you can decide. After all, your cluster is your masterpiece.

Douglas Eadline can be reached at class="emailaddress">deadline@clustermonkey.net. If you would like to learn more about the “The Art of Linux HPC Clusters and Building Machines Around Problems” you are invited to visit Doug’s book preview page at http://www.clustermonkey.net/content/view/128/53/.

Comments are closed.