Just a short ten years ago, "big iron" ruled the world of high performance computing. But by combining then-nascent technologies -- the PC, Ethernet, and Linux -- Dr. Thomas Sterling and others created the Beowulf cluster, forever shifting the accepted norms and economies of high performance computing. Here, Dr. Sterling gives a personal account of the rise of the Linux commodity cluster.
A commodity cluster of almost 10,000 PC processors running Linux has been installed and is operational at Lawrence Livermore National Laboratory (LLNL), making it one of the world’s most powerful computing systems. While this new system’s performance is extraordinary, its components are rather ordinary, and commodity clusters are quite commonplace.
In fact, about half of the world’s highest performing computing systems (as measured by the Linpack benchmark and registered on the “Top 500″ list at http://www.top500.org) are commodity clusters, with more than 10% of these fastest machines running Linux.
In light of this emerging dominance of commodity clusters, including Beowulf-class PC/Linux clusters, it’s difficult to imagine a time when such systems were of limited scale, utility, and acceptance. Yet, only a decade ago, the world of supercomputing was very different, an age when “big iron” dominated and vectors ruled. But like the dinosaurs they were to metaphorically emulate in their subsequent demise, those costly, once-traditional, high performance computing (HPC) solutions created the environment for their successors, small and inconsequential as they were, to evolve.
The genesis of Beowulf-class clusters was ignited by the inadequacies of classical supercomputers and the opportunities of emerging hardware and software technologies. Indeed, Beowulf itself would catalyze a paradigm shift in high performance computing that would dominate the beginning of the 21st Century. But in its most incipient phase, nothing appeared more unlikely.
In the Beginning
Ten years ago, the fastest computer in the world was delivering about 60 gigaflops on Linpack. [LINPACK is a collection of Fortran subroutines that analyze and solve linear equations and linear least-squares problems.] Vector processors, massively parallel processors (MPPs), and some single instruction, multiple data (SIMD) computers made up the vast majority of top performing machines. But those machines were very expensive and not very accessible. Moreover, because the architectures of each family of machines were so different, porting code between machines was very difficult.
Commercially, a peak gigaflop of performance cost a million dollars or more. A Convex SPP-1000 or a MasPar-2, both capable of a gigaflop, could cost that much, while a Cray Research C-90 single-headed system with the same peak capability could cost two to four times more. Routinely, these systems were employed to simulate complex physical phenomena, from atmospheric conditions on Earth, through plasma convection on the Sun, to the evolution of entire galaxies. In engineering, simulation of materials, electronic circuits, chemical processes (including fire and explosions), and structural deformation under stress (read: automobile crashes) required extensive simulation.
The output data resulting from these large computations were themselves very large, and their manipulation and visualization also required significant computational resources. Scientific workstations had been developed to address this challenge to some degree, albeit at a cost of as much as $50,000 for a single user. But often, neither the computational ability nor the storage capacity of workstations was sufficient to handle the wealth of result data coming from the time-consuming and costly simulations performed on the big supercomputers.
Another approach was necessary. But first the enabling concept and technology was required. Enter Metcalf, the creator of Ethernet.
An Unlikely Source of Inspiration
A confluence of technologies and methodologies had been accruing that would offer the alternative, but the solutions came from an unexpected sector of the computing world: the highest end computing community would look to the lowest end for help.
The concept of clustering had been employed since almost the beginning of the era of the digital, electronic, stored program computer. The “Sage” cluster, developed for the US Air Force to support NORAD, was a set of loosely integrated derivatives of the Whirlwind vacuum tube computer developed at MIT in the late ’40s and early ’50s. IBM developed this system and later combined pairs of computers to provide independent computing and I/O capability (e.g., 7090/7040).
By the mid-1980s, Digital Equipment Corporation (DEC) had experimented with collections of VAX mini-computers, being the first to use the term “cluster” to represent a coordinated ensemble of standalone systems (previously developed for the broader market) employed to scale some critical metric like reliability, accessibility, memory capacity, throughput, or time-to-solution performance. The emergence of commercial local area networks (LANs) in the 1980s, principally Ethernet, permitted the sharing of common resources such as printers, file servers, and Internet ports. By the end of that decade, workstation farms integrated with Ethernet were ubiquitous.
For many of us, the first attempts to harness collections of desktop computers were focused on conducting parametric studies, running the same application on many workstations at the same time, but with different input data sets. This was particularly valuable for conducting ensemble calculations like Monte Carlo simulations or deriving a range of resultant behavior for some simulated physical or engineering system across a range of input values. This kind of so-called embarrassingly parallel workload, while limited in utility, nonetheless is an important niche. And by the beginning of the 1990s, I, and colleagues at the IDA, were using this approach to meet a number of computational demands, spreading the same code among dozens of workstations over Ethernet and running these problems with different data sets. The University of Wisconsin developed Condor, a widely used software system that enables and supports the use of workstation farms for cycle harvesting of this type.
My first exposure to using a cluster for a more tightly coupled computation was at the NASA Lewis Research Center (now Glenn) in 1992, where a steady state simulation of a jet propulsion engine was being performed on eight IBM workstations. While the computation was partitioned to minimize inter-workstation communication, it nonetheless represented real work being performed on a single application across multiple desktop nodes. I was intrigued.
Enter Oak Ridge National Laboratory and Argonne National Laboratory.
Clusters at the National Laboratories
Throughout the previous decade and more, the Communicating Sequential Processes model of computation had evolved, formulating a strategy of message passing among otherwise logically independent processes running on separate nodes of a distributed computer. In a number of forms and implementations, this model had gained adherents. The message-passing model, as it has become more popularly known, was embodied by workers at Oak Ridge National Laboratory and Emory University in the PVM system, and subsequently made available (for free) to the research community. It provided a set of tools for writing and executing distributed application programs.
Later, the computer science community would come together to specify a new standard, the Message Passing Interface or MPI, which would ultimately dominate parallel processing, thanks in significant degree to Bill Gropp, Rusty Lusk, and Argonne National Laboratory, who developed MPICH, perhaps the most widely distributed and ported implementation of MPI. But PVM was an important step, and it’s a system that’s still used today. It was also the first widely-available, free, message passing software package.
A logical means for coordinating actions on separate computing nodes had been provided, but what were those nodes? Enter Intel and IBM.
The personal computer, or PC, emerged at the end of the 1970s from several vendors as a low cost, entry-level platform for education, word processing, and spreadsheets (not to mention games). But it was not until IBM introduced the IBM-PC that it achieved sufficient credibility to become a dominant force in the commercial sector. IBM opened the architecture to third party hardware and software vendors, a strategy that also created the clone market, resulting in a de facto standard known as IBM compatible. Based on the Intel x86 microprocessor architecture family, these systems became increasingly sophisticated until, by the Intel 80386 and the more powerful and successful Intel 80486, they were capable of supporting both significant applications and operating systems.
Enter Linus Torvalds and Don Becker.
While the PC market was dominated by the MS-DOS operating system, which enabled the mass-market software revolution, the technical computing community adopted Unix, originally developed by Bell Labs. Inspired by MIT’s Multics, and significantly enhanced by the DARPA-sponsored University of California at Berkeley Unix (BSD) operating system, Unix became ubiquitous, hosted on many platforms, including the pervasive DEC PDP-11 and VAX-11, and eventually migrated in various forms to Sun Microsystems workstations, IBM mainframes and workstations, and even Cray supercomputers.
But legal issues related to BSD made it inaccessible. So when Linux was introduced, it provided the framework for collaboration enabled by the Internet. Contributors around the world worked to develop a new Unix-like operating system that was both rich in capability and offered as free, open source. To be sure, Linux wasn’t the first Unix-like operating system to run on a PC, and in the beginning, it wasn’t even the best. But unlike anything that had come before it, Linux was the focus and consequence of a revolutionary phenomenon made possible through the Internet: hundreds of coders from around the world, most of which had never met each other, working together, sharing expertise on a new operating system.
Linux, initiated by Linus Torvalds, harnessed the talents of many skilled in the art of operating system kernel development. Under the GPL, Linux evolved at an accelerated pace. Today, it’s become the second most widely used operating system in the world, embraced by researchers, industry, vendors, commerce, and government. But in 1994, it was almost unusable. Yet it provided the foundation and opportunity for supporting the operation of a cluster of PCs.
Surely, you don’t require yet another short history of the extraordinary Linux phenomenon. But I would like to mention one contributor, my friend and colleague Donald Becker, who appreciated the potential of this new environment in spite of its many early limitations, dare we say, cruftiness. Don began by developing Ethernet drivers for different vendors’ Ethernet NICs and contributing them to the maturing body of Linux distributions.
By mid-1993, all of the pieces were in place. Some early work on clusters of workstations had begun at the University of Wisconsin and UC Berkeley. But if necessity motivates invention, money fuels it. Enter Jim Fischer and NASA.
Blasting Off from NASA
I was part of the NASA Earth and Space Sciences (ESS) project at the Goddard Space Flight Center (GSFC). The project manager was Jim Fischer. The machine room at GSFC had a number of the classic supercomputers, with more than one system from Cray Research, Seymour Cray’s old company. Goddard is the home of most Earth orbit space missions, as well as a significant number of projects in climate modeling, data assimilation and storage, scientific visualization, and computational techniques.
Years earlier, Goddard had been instrumental in the development of the MPP produced by Goodyear, perhaps the first of the truly large-scale SIMD computers that ultimately led to the MasPar-1 and -2 computers. The same machine rooms that had the Cray machines also had the smaller MasPar and a new Convex SPP-1000, two distributed, shared-memory computers, both capable of about a gigaflop sustained performance, and both costing Goddard about a million dollars. Compared to a Cray C-90 with a single head (one vector processor), the MasPar and Convex computers exhibited very good performance to cost, at least for certain classes and scale applications.
All of these systems could perform substantial simulations, but once the result data had been acquired, there was the challenge of how to store, organize, analyze, test, and visualize it. Even with expensive (up to $50,000) scientific workstations, which were available with various accelerator cards, sometimes the data sets were simply too large or the amount of computing power required too great. The expanse of result data required that the original supercomputers analyze and visualize the data they produced instead of performing new simulations.
Something new was needed: a personal supercomputer. Something as inexpensive as a scientific workstation, but able to do the work of a gigaflops-scale supercomputer like a small Cray, MasPar, or Convex machine.
Jim Fischer needed it, knew he needed it, wanted it, and went one step further, further than anyone else: he made it a goal of his ESS project to find it. Even before he or his staff figured out what it would be, they determined what it had to do and they established the budget to make it happen. They had the money, but they didn’t have the seminal idea.
I was fortunate to be at the right place at the right time.
The unique aspect of the NASA requirement was that they needed high capability and capacity computing but at bounded cost, one that was ten times better than those of typical, vendor-provided solutions. I had considered this challenge for several months prior to proposing what was to become the Beowulf project. But the moment of conception came from an unlikely event: Don Becker alerted me to the possibility that he might not be able to continue his work with Linux at his current site of employment. Forced to think of Don as a resource rather than colleague and friend, I realized that his skill was the single missing link (almost literally) to a process that would start with a pile of PCs and end up with something — we did not have a name for it yet — that could achieve the low end of the supercomputing range of one gigaflop.
I did an analysis and determined that there were several factors required to fit the NASA requirements. Besides peak performance, there was local disk capacity, internal communication bandwidth, memory capacity, and of course, cost. The numbers worked.
At $50,000, everything but peak performance was consistent with the stated requirements. And while, in late 1993, I could not achieve one gigaflop using the existing mass market PC technology, I could reach one gigaops. Close enough. I wrote the proposal and submitted it to Jim Fischer, later briefing him in person after the first shock wore off. I proposed to build a pile of PCs and do real scientific computation with it. Jim got it. Almost no one else did. I got the money.
It’s possible that you’ve read this far for the single purpose of finding out where the name “Beowulf” came from. For a long time, we used the cover story about seeking a metaphor for the little guy defeating the big bully, as in “David and Goliath.” The cover story continued that “Goliath” had already been used by a famous historic computer, and the giant was the wrong side of the metaphor anyway. However, I didn’t think I would ever get famous building a computer named “David.”
Searching other cultures for equivalent folklore, I recalled the epic saga of Beowulf, the Scandinavian hero who saved the Hrothgar’s Danish kingdom by defeating the monster Grendel. (I didn’t need to search the archives of ancient civilizations to discover the saga of Beowulf. My mother did her graduate studies in old and Middle English literature and I was brought up on the stuff.) Good story, but completely untrue; so, I discarded the name as a possibility.
But in truth, I’d been struggling to come up with some cutesy acronym and failing miserably. With some small embarrassment, you can find examples of this in our early papers, which included such terms as “piles of PCs” and even “PoPC.” The first term was picked up by others at least briefly. Thankfully, the second never was.
Then one afternoon, Lisa, Jim Fischer’s accounts manager, called me and said, “I’ve got to file paperwork in 15 minutes and I need the name of your project fast!” or some words to that effect. I was desperate. I looked around my office for inspiration, which had eluded me the entire previous month, and my eyes happened on my old, hardbound copy of Beowulf, which was lying on top of a pile of boxes in the corner. Honestly, I haven’t a clue why it was there. As I said, I was desperate. With the phone still in my hand and Lisa waiting not all that patiently on the other end, I said, “What the hell, call it ‘Beowulf.’ No one will ever hear of it anyway.” End of story.
But the other truth is I didn’t actually name Beowulf, the computer. I only named Beowulf, the project. Someone out there in the land of the press coined the term “Beowulf-class system,” not me. I would love to know who it was. That’s the real irony: I get the credit for naming Beowulf clusters and actually I didn’t do it.
The Beowulf Project was formerly organized and funded in November of 1993, and a small team was assembled over the next few months, including Don, Daniel Savarese, Chance Reschke, Dan Ridge, John Dorband, and others. By mid-1994, we had built the first operational, Beowulf-class PC cluster, named “Wiglaf,” at the NASA Goddard Space Flight Center. The system was applied to Earth and space science applications under the sponsorship of the NASA High Performance Computing and Communications program. Comprising sixteen Intel 100 MHz 80486-based PCs, each with 32 Mbytes of memory and a gigabyte hard disk, and interconnected by means of two, parallel 10-Base-T Ethernet LANs, this first PC cluster delivered sustained performance on real world, numerically-intensive, scientific applications (e.g., PPM) in the range of 70 Mflops. While modest by today’s standards, on a per-node basis it was within a factor of two and sometimes comparable to such contemporary systems as the Intel Paragon, yet cost a factor of 10 to 40 times less than the MPPs of the same period (in spite of the much higher peak floating point capability of the i860 processor employed by the Paragon).
At this point in time, the biggest challenge facing Beowulf was inter-processor communication. Ethernet, although widely available, was not cheap, with routers being especially expensive. Furthermore, in spite of the relatively modest performance of the PC nodes, the 10 Mbps channel bandwidth offered by Ethernet was not always adequate. Channel bonding addressed these challenges to some degree by allowing more than one Ethernet port to operate simultaneously and transparently to the user application, providing sustained aggregate bandwidth significantly greater than a single channel could deliver. Initially, switches were too expensive, so hubs were used, which were single point blocking devices where network contention could become serious. Channel bonding addressed the contention issue as well. During this time, third party hardware vendors offered new Ethernet boards, and the Beowulf project invested significant resources to provide Linux drivers for them, releasing these as open source with each successive Linux distribution.
A second Beowulf cluster was implemented in 1995 using sixteen, new Intel Pentium processors (100 MHz), with 64 Mbytes per processor, but using a new Fast Ethernet, 100-Base-T LAN with a hub rather than previous multidrop. The combination of these improvements resulted in a significant sustained performance gain to approximately 240 Mflops. “Hrothgar” was the first of our machines exported to the community, with a clone machine going to Drexel University in Philadelphia. These results were very encouraging and a number of our colleagues at GSFC who were computational scientists were more than interested to port their various applications to Hrothgar, demonstrating consistent performance to cost benefits that often approached a factor of 50 with respect to the alternatives available. An Era ends. Or so I thought.
Enter Joe Bredekamp and John Salmon.
The End of an Era?
My research interests, while almost always related to high-end computing, have varied over the years, and in 1996, I had a unique opportunity to work on a petaflop-scale systems architecture at Caltech and the NASA Jet Propulsion Laboratory (JPL). Sadly, I’d have to leave my involvement in PC clusters behind — or so I assumed. The Beowulf experiment had been performed, the results were in, and the concept basically demonstrated. In any case, I had never intended to spend more than a couple of years on the Beowulf project. Only, things don’t always work out as planned.
As I was departing the east coast for southern California, Joe Bredekamp of NASA HQ gave me a going away present. (Joe is one of those rare individuals who can deal with the day-to-day challenges of managing government sponsored R&D while keeping the dream alive. Joe ran NASA’s Space Science Computation research programs, and supported not only Goddard, but JPL as well.) He gave me a new Beowulf. However, there was little enthusiasm on the part of my immediate JPL management to assemble, install, and run this new system at JPL. Things are very different now, but then, most people in charge did not appreciate the potential of this low-grade computing medium.
However, one very bright senior researcher at Caltech, John Salmon, had been thinking along these lines for some time with his collaborator, Mike Warren of Los Alamos National Laboratory. Together, they had developed the most advanced scientific code to perform large-scale N-body problems, such as you would use to simulate the evolution of the Andromeda Galaxy (M31). Five years earlier, they had their code running on the Intel Touchstone Delta at Caltech, then the world’s most powerful computer.
With Joe’s generous gift, I, John, Jan Lindheim, and others implemented “Hyglac,” yet another sixteen processor, Beowulf-class cluster. But the technology had progressed to a critical point in both processing and networking that we were able to build a 200 MHz Pentium Pro-based system with Fast Ethernet using a true non-blocking switch for under $50,000. By the end of 1996, Hyglac was fully operational in the machine room at the Caltech Center for Advanced Computing Research (CACR) and was running John and Mike’s N-body code. It sustained 1.26 gigaflops performance on this complicated application, a breakthrough in performance to cost. But still no one was paying very much attention. That was about to change dramatically.
The big annual meeting in the supercomputing community is the IEEE Supercomputing Conference (SC), held every November since 1988. This conference mixes industry, academic, and government lab contributions in a number of forms, including a large show of booths from just a few square feet for academic projects to enormous corporate displays.
At SC96, both Caltech and Los Alamos National Laboratory (LANL) had such a booth. By coincidence, both institutions decided to bring their Beowulf-class PC clusters — Hyglac from Caltech and “Loki” from LANL — both systems ran the N-body code, and both booths were separated only by a few feet of carpet.
Needless to say, there were many spare parts in case things broke and lots of extra cable. We realized that we had the means to fully interconnect Loki and Hyglac. Running sixteen Fast Ethernet cables under the narrow carpet separating the two research booths, the two machines became one. Without any additional tuning, this aggregate system delivered a sustained performance of 2.13 Gflops.
Unfortunately, in doing so, we had intruded on the prerogatives of the local unions. Caltech was required to pay $3000, which was how much the unions would have charged us to run those 16 wires under the carpet. We could have added another two nodes to our system for that amount of money!
But the systems attracted a writer for Science, the preeminent, weekly US science journal. Two weeks later, Science published a full page story, complete with a color picture, entitled, “Do It Yourself Supercomputing.” A new saga of Beowulf had begun. The world was about to change.
Epilogue or Foreword?
The epilogue of this story is longer than the story itself. But allow me to mention only one window in time, almost exactly a year later, at SC97.
Again, CACR had a booth and again Chip brought a Beowulf, but not Hyglac. “Naegling” comprised over a hundred PCs and towered over everything else on the floor nearby. I, John, Don, and a number of our colleagues gave a full day tutorial at the conference on “How to Build a Beowulf.” It was the most heavily subscribed of the 14 tutorials at the conference that year. It also led to the MIT Press book, “How to Build a Beowulf,” published in 1999, and co-authored by John, Don, Daniel, and myself.
But the most important event at that meeting was centered on Mike Warren, who led a large team, including those of us at CACR, in a submission to the Gordon Bell Prize committee. That year, on behalf of all of us, Mike accepted the Gordon Bell Prize for Performance-to-Cost. Beowulf had been accepted by the mainstream, high performance computing community.
Today, commodity clusters dominate high performance computing, and Beowulf and Linux are — and were — an important part of the cluster success story.
Dr. Thomas Sterling is a Principal Scientist at the NASA Jet Propulsion Laboratory and a Faculty Associate at the California Institute of Technology Center for Advanced Computing Research. In 1994, Dr. Sterling initiated the Beowulf Project at the NASA Goddard Space Flight Center and shared the Gordon Bell Prize for this work in 1997. He is a principal investigator on the DARPA-sponsored Cray Cascade Project to develop the world’s first advanced-concepts petaflops computer, and has been a long-time advocate of the use of Linux for high performance computing. You can reach Dr. Sterling at email@example.com.