The Future of HPC Clusters

What’s next (and current) in cluster computing? Multiple cores, blades, new networks, plus plenty of small, traditional clusters.

High-performance computing (HPC) using clusters has come a long, long way from its early days. Back then, a cluster was a network of disparate workstations, which often sat on people’s desks, harnessed together into a Parallel Virtual Machine (PVM) computation. Back then, nascent Beowulf clusters consisted of cheap tower PCs literally stacked up on shelves.

Early on, cluster computing was only being done by a tiny handful of people — usually by computer scientists who were working on building the future or by people who were doing real science (with computers) who needed the future a bit before the future was ready for them. To those pioneers, a “network” generally meant 10-Base-2 Ethernet — a daisy chain network terminated with resistors at both ends that used these nifty (and expensive) little AUI doohickeys to connect very expensive Unix workstations to RG-58 coaxial cable — or worse, a thickwire Ethernet, which used a bloodsucking device known as a “vampire tap” to make the actual connection to the wire (preferably spaced on a half-meter mark to minimize reflections) to insert a coaxial “T” connector.

The Unix used in those pre-Beowulf days almost certainly belonged to a vendor. Very little in the arrangement was free, very little was open source, but it was still much cheaper than getting compute cycles in large quantities any other way.

Now, of course, using clusters to do HPC is absolutely ubiquitous. Cluster computers have actually beaten chess masters and competitions are held every year that pit cluster (and programs written for the cluster!) against cluster. Cluster computers have “always” been used to perform research in physics, chemistry, mathematics, biology, statistics, and so on, but over the last three or four years, the list of applications has expanded to include economics, sociology, music composition, mainstream gaming, political science, and medical research of all kinds. Even in the world of business, a similar expansion has occurred, with clusters assuming “traditional” HPC tasks (largely inherited from their counterparts in academia) in engineering, simulation, and design, and most recently to do advanced modeling, visualization and graphic arts, the ever-popular making of movies, and far more.

Six or seven years ago, commodity cluster computers — computers built out of commodity off the shelf (COTS) parts — would appear, however briefly, in the “Top 500” list of supercomputers. Now cluster computers of one sort or another are the Top 500. (One can argue a bit here and there about whether all of the Top 500 are strictly made with COTS components per se, but even the exceptions are usually architected using “generic” components that are, at heart, just PCs on a network, regardless of how fancy their packaging is.)

An interesting fact is that the majority of all HPC cluster computers in the world run Linux. Linux has absolutely dominated cluster computing since the mid-Nineties, when the Beowulf project in particular demonstrated how easy and cheap it was to purchase COTS computers and turn them into a supercomputer — at a tenth of the cost per floating point instruction per second (FLOPS) of its “big iron” equivalent from supercomputer vendors and half the cost of network clusters made from proprietary Unix-based systems. Equally telling is the observation that the second-most ppopular cluster operating system is probably held by FreeBSD, another open source operating system. “Open source operating system” is essentially synonymous with cluster computing for good and fairly obvious reasons.

Clusters of every size and for nearly any purpose imaginable are clearly here to stay — they are an established part of mainstream computing in both the corporate world and academia, in both the developed world and in the developing world. In all cases, cluster computing using COTS components forms an inexpensive, accessible, and scalable route to the kind of computing power required to be competitive in academia and research or world markets.

What then of the future of cluster computing? In the coming years, how will it differ from the way it is today and how it’s been in the past?

Please enter my tent and take a seat. I hope the incense doesn’t bother you, rather heady isn’t it? A cup of tea? No? Please put on these funny glasses and stare at this monitor. See the round, three-dimensional crystal sphere with precisely rendered shadows and reflections? No? Uncross your eyes a bit. There. Now look deep beyond the virtual surface, and see…

Hardware of the Future and Present

The performance of cluster computers on “most” tasks is bounded by one of several relatively simple bottlenecks that appear in the nodes of nearly every cluster design. In order of importance, the bottlenecks are:

1.The CPU’s number of instructions per second, usually floating point instructions.

2.Memory bandwidth and latency, how efficiently memory accesses can be parallelized with CPU function, and how much memory can be addressed per task thread.

3.Network bandwidth and latency, how efficiently network utilization can be parallelized with CPU and memory function, and how things like network topology and details of the network architecture act as discrete rate-limiting parameters.

4.Reading and writing to a “hard store,” factors that include disk bandwidth, latency, and total capacity.

5.Packaging, such as towers, rackmounts, blades, and networked PDAs (just kidding).

Let’s take a brief look at each of these. (Since the breadth of architectural options available even in the present is truly dazzling, the discussion will necessarily be pretty shallow, lest we take over the entire magazine.)

The CPU

Let’s start with CPUs. Over the next five years, the crystal ball tells us that cluster processors will have the following features:

*64-bit. Yes, there are still 32-bit CPUs going into new clusters and will be for another year, but three years from now? Unlikely.

The AMD64 family are cheap, COTS, mainstream 64-bit CPUs. Intel’s EM64T is a newer, albeit more expensive and consequently less mainstream, but competition and time will inevitably correct this. The 64-bit Power PC G5 is also available, as are 64-bit variants of non-COTS CPUs (such as SPARC). 128-bit system buses are already in the works (including Itanium 2); however, the shadowed pixels of the cracked crystal ball shows 128-bit processors (and the Itanium 2) not being relevant to HPC cluster computing during the foreseeable future, at least not until Intel realizes that for most HPC applications, a single processor all by itself shouldn’t cost more than two AMD64 systems.

*Multicore. This, if anything, is the big news of the immediate present and the foreseeable future. Processor makers have so embraced cluster computing that they are putting a “cluster” of processor cores on each processor. Multiprocessing in a uniprocessor packaging.

AMD is shipping the first dual-core Opterons and dual-core Athlon 64 X2, with clocks unreduced from the single-core versions. Intel’s dual-core Pentium family is also just beginning to appear, but at a reduced clock rate compared to their single-core processors. An alliance consisting of IBM, Sony, and Toshiba will soon release the CELL processor system.

*Vector Support. I’m using this term to wrap up a lot of things — Streaming SIMD Extensions (SSE 1,2,3) support, 3DNow! instructions, embedded or attached digital signal processor (DSP) units, and VMX/Altivec. Important for some, not for others, yet if it’s important for you, you’ll want to pay especial attention to memory issues, below, such as a…

*Multilayered Cache. In particular, the addition of an L3 cache to mainstream processors, on top of the already standard L1 and L2 caches. This process has already begun, and is necessary to help narrow the ever widening gap between processor speed and memory speed.

Typically, L1 and L2 are on the actual CPU core and L3 sits between processor and memory, but this is very much open to change in future designs.

Single-core processors won’t go away, but a lot of cluster builders will find multi-cores to be an extremely useful building block. AMD promises 8-way, dual-core Opteron support, yielding up to 16 general purpose cores in a single chassis. Intel promises 4-way.

Memory

In addition to the currently standard DDR memory, vendors are just starting to introduce DDR-2 memory that should just barely enable multi-cores to function without a significant increase to the already wide gap between CPU clock scaling and memory speed scaling.

To put it another way, there are plenty of HPC tasks that are already bottlenecked by memory speed for single-CPU systems; others are bottlenecked by memory size available to a single CPU. The disparity will only continue to rise geometrically over the next few years.

CPU memory architecture is also a critical difference in the various multicore architectures. Each has its own solution to the CPU-memory bottleneck (and its associated problem, pipeline stalls in ever-longer pipelines). Intel’s solution is Hyperthreading, a processor-centric approach with a fairly traditional bridge, which, when combined with multiple cores, permits efficiently parallelized access to heterogeneous resources. AMD’s solution is called HyperTransport. Hypertransport can be thought of as putting CPU, memory, and other peripherals on a very small, low latency, high-bandwidth network, doing away with the bridge concept altogether. CELL is very new; what, if any, new CPU-memory concepts will be used in its actual application are unknown, although the chip is quite radically different itself.

(It’s interesting to note that AMD and other members of the HyperTransport consortium, including Sun, Transmeta, Nvidia, Cisco, Apple, and Broadcom, view HyperTransport as an eventual replacement for the entire peripheral interface, making individual computers very much like a mini-network where anything can talk to anything else without requiring the use of the CPU. An impressive idea, if it works.)

Power consumption, heat production, and CPU clock are gnarly topics, worthy of an entire article. In a nutshell, CPU and memory design has reached something of a power-dissipation, clock-speed-scaling crisis as ever more transistors switch ever more times in a limited volume.

For cluster design, ignore features that throttle back power during idle periods, as there likely won’t be any, and beware of designs that throttle back clock during busy periods, as that can seriously affect performance expectations. Also recognize that Hyperthreading is less useful in a homogeneous tasking environment like an HPC cluster, where all the processors (or processor cores) in a system are more likely to be in contention for a single resource more often than trying to access different resources at the same time.

System Packaging

There are currently three generic cluster package formats.

*Tower units stacked on ordinary shelving (or the floor) will continue to be popular in small clusters, home clusters, and clusters with very tight budgets.

*Rackmount units, usually 1U enclosures containing one to four processors, will dominate “professional” grade clusters and larger clusters with pro-grade infrastructure support in the cluster server rooms.

*Blade clusters will continue to gain ground in this latter market, but the fact that fully-loaded blade enclosures are relatively expensive per CPU, difficult to reliably power and cool (at something like 12 KW per square meter of occupied floor space), and come with only limited configurability (such as networking) will also continue to limit their growth.

Two to three years out things could look very different, and blades may be taking over the universe, or people may have gotten disgusted with them and returned back to SMP multicores in traditional racks.

Connectivity

As 1000-Base-T Ethernet (especially its switches) continues to get cheaper and cheaper, it will gradually displace 100-Base-T Ethernet as the lowest-common-denominator networking standard for vanilla clusters and grids (which will continue to dominate the cluster/grid marketplace because of sheer cost issues, however interesting and powerful the high-end cluster networks become).

Beyond that, the best that can be said is that today’s bleeding edge cluster networks will continue to evolve as long as they command enough market share to fund competitive development. Key technologies are (and will continue to be) Myricom’s Myrinet, Dolphinics’s Scalable Channel Interface (SCI), Intel’s Infiniband (with its own consortium of developers and cluster applications that extend beyond a mere node interconnect), and Quadrics’s QsNet. All of these offer higher bandwidth, lower latency, and less CPU involvement in message transfer than does TCP/IP over Gigabit Ethernet — yet at a price.

Hard Storage

For many HPC clusters, hard storage is not a terribly important issue. For others, it is the critical component.

If you’re in the latter group (perhaps doing genomics, image rendering for movies, visualization and simulation, and processing accelerator data), then you’re already trying to figure out how to get from today’s terabyte stores to tomorrow’s petabyte stores in some sort of scalable way. Important technologies for storage are storage area networks (SANs), network-attached storage (NAS), Fiber Channel, and RAID.

In a nutshell, mass storage is currently provided almost without exception by either a specialized cluster that shares a special or general purpose interconnect (such as. Fiber Channel, Infiniband, or Gigabit Ethernet) with the cluster nodes, or a more general cluster design that does about the same thing, but hangs the actual storage off of dedicated cluster nodes in a separate sub-cluster. Costs per terabyte can vary wildly, depending on your capacity, scalability, reliability, bandwidth, latency, and other requirements, as well as by vendor.

The Bottom Line

The fundamental thing to keep an eye on as the cluster future unfolds is how the price/performance of the various hardware alternatives compare across the various new designs both as they first come to market (when they tend to be relatively costly) and later, as demand and competition stabilize prices and the technology enters the mainstream. Price/performance, in turn, will be dictated by how well each of the competing designs manage access to resources shared by all the processor cores. This is a nontrivial problem: in some sense, it’s the fundamental problem of cluster computing.

There are a variety of very different design futures out there. For example, 2-4 general purpose cores per CPU for Intel and AMD initially, while IBM’s CELL processor is a general PowerPC core with 8 DSP elements it can use as SIMD processors. 2-8 CPU (or more, as time evolves) per system. There are also very different solutions (Hyperthreading and HyperTransport) to the associated memory and resource bandwidth problem that all multicore (and SMP) architectures create.

Being bold, the crystal ball predicts that AMD’s designs will continue to win on price/performance for HPC clusters, at least in the short term. Since AMD has clearly had superior price/performance for most HPC cluster applications with their Opteron and Athlon designs for the last few years, this is a pretty a pretty safe bet, like saying the weather tomorrow is likely to be pretty much like the weather was today.

Given that price/performance is (and should be) a dominating element in cluster design, expect to see a lot of AMD-based clusters over at least the next two years, although Intel will continue to sell a lot of systems in the cluster market. It’s never wise to ignore the technical and marketing clout of IBM.

Beyond two years it is a competition-driven horse race — if any of the various designs (which can be ably defended on paper or in presentations by any of their proponents) turn out to have real world price/performance advantages in the long run, the market will naturally favor them.

To conclude, it is worth noting that this article says almost nothing about the art of matching your application requirements to a suitable cluster architecture — it just attempts to prepare you to address your problem’s specific requirements requirements. No crystal ball can peer into your server room, and it’s a proverb of the cluster computing business that because of the differences between applications and hardware, “Your Mileage May Vary” (YMMV).

Just one more prediction that appears out of the random mist of dots being generated by a dot stereogram program inside the crystal ball: Over the next five years, most clusters will be based on Linux, installed on single processor, single core nodes, equipped with DDR (transitioning smoothly to DDR-2 as it replaces ordinary DDR) memory, 100-Base-T (transitioning to 1000-Base-T) Ethernet, and with either minimal hard disks or no hard disk at all.

There will be a lot of these “vanilla” clusters built with single processor, multicore nodes, and dual-processor, multi-core nodes where it is cost-beneficial to do so. Relatively few of these cluster nodes will have bleeding-edge CPU clocks, but will be purchased from the price/performance “sweet spot” several clock advances back that gets the most bang for the buck. Most of these cluster nodes will be in 1U form factor rackmount chassis, with plain old tower units on heavy duty shelving taking a close second.

Most of these vanilla clusters won’t ever make the Top 500 (though some will!), but there will be a lot of them, because this is the sort of design that will buy the greatest number of FLOPs per dollar spent then, just as it is the design that buys the greatest number of FLOPs per dollar spent now, for most cluster and grid applications.

The trick is to be aware of the different needs of exceptional applications, especially real parallel programs with interprocessor communications or any sort of program that stresses the CPU-memory or CPU-peripheral interface. If one of your applications has these sorts of specialized needs, the technologies listed here should help you Google up plenty of online white papers and references so that you can assess the suitability of the really spectacular new technologies that are just around the corner.

Robert Brown is a physicist at Duke University who has been actively engaged in cluster computing since the early 90’s and has participated actively on the Beowulf list since late 1996 or thereabouts. He is author of the online book “How to Engineer a Beowulf Style Compute Cluster” and until recently was a regular columnist for ClusterWorld Magazine.

Fatal error: Call to undefined function aa_author_bios() in /opt/apache/dms/b2b/linux-mag.com/site/www/htdocs/wp-content/themes/linuxmag/single.php on line 62