InfiniBand for Linux Clusters

InfiniBand's future is becoming more certain as researchers find a home for the once-floundering I/O architecture among Linux clusters.


Infiniband: When pipe dreams become reality

Amit Asaravala

Red Hat developer Pete Zaitcev once described InfiniBand in terms of an ancient parable. Speaking to a birds-of-a-feather session at the 2001 Ottawa Linux Symposium, he told of how a group of blind men came upon an elephant one day while walking in the woods. The first blind man reached out and grabbed the elephant’s tail and decided that they had come across a rope hanging from the ceiling; the second blind man put his arms around the elephant’s leg and declared that they had come across a tree; the third reached for the elephant’s trunk and thought it was a snake; and the story carries on.

To Zaitcev, this was the perfect analogy for the way different people perceived InfiniBand at the time. Some were excited about the specification’s potential to provide low latency interconnects, some talked about powerful 30 Gbps links, and still others saw it as a replacement for PCI, and so on.

Today, though, Zaitcev believes those perceptions have changed dramatically. “The analogy of the blind men used to describe the situation accurately in 2001,” he says. “But it’s 2003 now, and the tail has grown bigger than the rest of the elephant. The clustering aspect [of InfiniBand] has subsumed all the rest.”

Indeed, InfiniBand’s promise of becoming an all-encompassing I/O architecture seems to have almost disappeared as more and more people focus instead on the specification’s usefulness in high-performance clusters. Gone are the predictions that InfiniBand will completely replace both the PCI bus and existing storage area network (SAN) technologies. Instead are visions of dozens or even hundreds of commodity servers linked together to form the ultimate high performance cluster. And this time, there’s little doubt that the promise is about to become a reality.

Faster Than Any Other

Created in 1999 by the merger of Future I/O and Next Generation I/O — competing architectural specifications developed by Compaq, IBM, and Hewlett-Packard on one side, and Dell, Intel, and Sun on the other — the InfiniBand Architecture specification started out under the moniker “System I/O.” The name was later changed, but the goal for the technology remained the same: to create an open standard for high-performance I/O hardware that’s efficient, reliable, and scalable.

The impetus for such a goal was the inability of existing I/O networks to keep up with the industry’s need to transfer more data at higher and higher speeds. Although processors continue to get faster in accordance with Moore’s Law, the relatively slow PCI bus in many servers form a bottleneck that can’t be ignored. “Today’s PCs are saddled with a 33 Mhz and 32-bit PCI bus that gives peak throughput of 1 Gbps,” explains Progressive Strategies Analyst Naresh Sharma. “Theoretically, a single 1 Gb Ethernet card in a single burst can saturate the entire PCI bus.”

InfiniBand replaces the bus with a channel architecture that prevents any one device from bogging down the entire system.

It also enhances communication speeds by allowing for remote direct memory access (RDMA). “Using Infiniband, one could build a Beowulf cluster using off the shelf components that would run much faster due to RDMA capabilities and the lack of Ethernet/TCP/IP overhead,” says Sharma.

Although there are plans to add RDMA capabilities to Ethernet, the design of TCP/IP requires more processing than InfiniBand protocols, which are mostly built into the hardware. This makes Ethernet slower than InfiniBand in terms of latency. Specifically, the latency of today’s Ethernet products is generally in tens of milliseconds, while InfiniBand comes in under 10 µs.

InfiniBand also surpasses Ethernet and other networks in terms of bandwidth. At 1X speed, an InfiniBand link can carry data at 2.5 Gbps in both directions. More common, though, is 4X speed, which can carry data at 10 Gbps (although actual rates today are clocked at 6.5 Gbps). This by far surpasses Myrinet and Quadrics and other low-latency networks commonly used in high performance computing that tend to run in the 2 – 3 Gbps range.

Wires, Switches, and Adapters

InfiniBand fabrics, as they are called, consist of four types of hardware components: wiring, switches, Target Control Adapters (TCAs), and Hardware Control Adapters (HCAs). As defined by the specification, the wiring can come in 4-pin, 16-pin, or 48-pin configurations, which correspond to 1X, 4X, and 12X speeds, respectively. The wires can also be copper or fiber, and there are different ports for each.

The switches are the heart of an InfiniBand-connected cluster. Like IP switches, InfiniBand switches manage data packets as they pass between ports. Multiple switches can be linked to provide failure protection or access to other subnets. The market for these components is dominated by a small handful of companies, most notably InfiniCon, InfiniSwitch, JNI, Mellanox, Paceline, and SBS Technologies.

Some of these companies also build HCAs and TCAs, which translate and manage InfiniBand traffic to and from the native I/O subsystem, both on servers and I/O devices respectively. From a physical standpoint, the adapters essentially add InfiniBand ports to all the nodes.

As of this writing, there are two ways to configure HCAs on servers. The first is to purchase cards that plug into the existing PCI or PCI-X slots on your motherboards. Alternatively, you can buy a motherboard with the HCA already built in, like Mellanox’s Nitro II Blade Reference Design. Keep in mind, though, that neither option completely eliminates the need for a bus architecture on the motherboard. Since there are currently no pure InfiniBand system boards on the market, the HCA must eventually send all messages over a bus to get them to the CPU and memory. If that bus happens to be a standard 32-bit PCI architecture, the bottleneck could drastically reduce any benefit in speed that might be gained from using InfiniBand.

Even at 1X-speed (2.5 Gbps), an Infiniband-PCI interconnect would not be able to function any faster than 1 Gbps on standard PCI busses. Some servers, like the Mellanox Nitro II blades, get around the limitation by using a PCI-X bus which tends to better performance gains. The industry standard bandwidth for a 4X InfiniBand fabric connected to a PCI-X bus is 6.25 Gbps (800 MB/s) — still short of the promised 10 Gbps with InfiniBand alone, but faster than Ethernet as an interconnect, nonetheless.

It should be noted that getting rid of the bus altogether is certainly a goal of InfiniBand proponents, but the timeline for achieving this is still undefined. “To really see the full benefit of InfiniBand, it has to be implemented as part of the processor-memory-I/O nest,” says IBM Distinguished Engineer Greg Pfister, “but you don’t ship it unless it’s tied to your next update — and that’s tied to so many other things, like whether the processor and the software and the marketing is all ready.”

Pfister, who is also chairman of the InfiniBand Trade Association Management Workgroup, notes that IBM is committed to providing a channel architecture on its i-, p-, and z-series servers, but that the company isn’t ready to say publicly when those products will come to market.

Not surprisingly, the cost of redesigning system boards to use channel architectures is likely to be quite large. It’s unclear how these costs will be recouped, but some InfiniBand customers are optimistic that they will come out ahead when calculating the total cost of ownership.

“As InfiniBand storage targets, boot services, and console services materialize, we will be able to consolidate all the networks in a parallel machine,” says Sandia National Laboratories’ Curtis Janssen, who has been testing InfiniBand in a high-performance Linux cluster. “This will take us from a total of three — Gigabit Ethernet for file-system I/O, Myrinet for message passing I/O, and serial console servers for management — to one. This means we will need fewer cables, which will give rise to higher reliability, lower cost, and higher density solutions.”

Software Development

An InQuest Market Research report released in 2001 criticized InfiniBand, in part due to analysts’ predictions that an enormous software effort would be needed to make use of the technology, especially on clusters. “Clustering is successfully deployed today in a variety of environments, but to realize the InfiniBand vision of clustering, an unimaginable amount of software development is required,” the report read. “It’s not impossible, but we cannot imagine how long it would take. On the surface, it’s a bottomless pit.”

Indeed, rewriting standalone applications so that they can operate in a clustered environment can take considerable effort. However, recent case studies, such as those at Sandia National Laboratories, have shown that existing clustered applications are comparatively easy to install and run with InfiniBand. The only major change is the swapping of the Message Passing Interface (MPI) layer with one that supports InfiniBand protocols. MVAPICH, created by Professor D. K. Panda at Ohio State University, already provides this support in a freely available package. And Mississippi-based MPI Software Technology recently added InfiniBand support to its commercial MPI/Pro offering.

Additionally, major application vendors may soon get into the act by adapting their software to support InfiniBand directly. At the 2001 Intel Developer Forum in San Jose, California, a team demonstrated DB2 running successfully over InfiniBand. And last year, Oracle participated in a similar demonstration at its OracleWorld event in San Francisco, showing how Oracle 9i RAC performed on a 4X InfiniBand fabric.

Drivers, too, are becoming less of concern for many IT managers since product manufacturers are now supplying them for a number of operating systems, including Linux. This leaves only the subnet manager, an application that, among other things, assigns local IDs to devices and determines the best path between two points in the subnet — even that is readily available from startups like Lane 15 and VIEO. Paceline has also started embedding subnet managers in its switches, eliminating the need to install the software on one of the nodes.

Finally, Intel is backing a SourceForge project to develop an entire suite of open source software packages to help both hardware manufacturers and consumers better support InfiniBand with Linux. The project consists mainly of various drivers and a subnet manager.

Project administrator David Lomartire, who also manages Intel’s InfiniBand Software Enabling group, notes that most of the development has been straightforward so far. “There was really only one [problem] and it is related to the SDP [Sockets Direct Protocol] development,” says Lomartire. “Presently, there is no provision in Linux for network offload devices to work transparently with legacy socket applications. The existing infrastructure requires a new socket address family for new hardware based transport engines — which then requires changes to existing applications. There is no infrastructure that allows legacy applications to span multiple TCP protocol providers. This infrastructure was implemented as a part of the SDP portion of this project.”

The group’s IP-over-InfiniBand patch has already been accepted into the Linux 2.5 kernel. Lomartire expects that the SDP patch, which provides a way for legacy applications to work transparently with offload devices as mentioned above, will be available with the SDP code by the time you read this. “The real challenge now lies with the adoption of InfiniBand by the Linux Community,” he says.

In Use: Sandia Labs

The research community, which has long been a fan of Linux, is already beginning to look at InfiniBand as a replacement for aging I/O networks. Among the facilities taking the lead is Sandia National Laboratories. As one of the nation’s leading research and development facilities, Sandia is often called upon to perform computational simulations of, for example, nuclear and chemical reactions. Because these simulations are so resource-intensive, they require multiple processors with high-bandwidth and low-latency interconnects. Although principal staff member Curtis Janssen and his fellow researchers here been using a number of different networks to accomplish this, they saw in InfiniBand an opportunity to improve performance even further.

“For smaller and midrange systems, high performance must be obtained as cost effectively as possible, which implies that commodity technology must be employed,” says Janssen. “InfiniBand is one technology under evaluation, and Sandia and its partners have made much progress in demonstrating that InfiniBand is a viable network component for high-performance computing.”

In 2001, Janssen and his team began working with vendors to build two Linux cluster testbeds. The first is composed of 12 Mellanox Nitro II Reference Design blade servers. Each server has a 2.2 GHz Pentium 4 CPU and two 4X InfiniBand connections. The team uses only one of these connections per server, though, which it links to a 16-port InfiniBand switch. The switch is also connected to a 1U server with an HCA card installed. The server hosts both the subnet manager and the user file system, which is exported via NFS and mounted with IP-over-InfiniBand.

The second testbed consists of 24 Dell 2650 nodes with dual 2.4 GHz Pentium 4 Xeon CPUs. The cluster is tested with both an 8-port 4X switch from Paceline, which contains a built-in subnet manager, and a 24-port 4X switch from InfiniSwitch. The cluster also uses HCAs from both Paceline and InfiniSwitch.

Janssen says that the Linux software that his team has chosen is “fairly generic.” To get up and running, they first installed each vendor’s SDK. Because the Linux versions of the SDKs were often written with Red Hat in mind, the clusters run a Red Hat distribution as a base install. From there, the team installed MVAPICH to handle message passing.

Janssen notes that using open source software is important to Sandia, and suggests that the closed-source nature of some of the SDKs could become a liability in the future. “InfiniBand could possibly even fill a role for very large scale machines,” he says, “but only if we have access to source code to provide the scalability enhancements needed for, say, initializing an IB subnet with 15,000 nodes.” He has not yet had a chance to evaluate software from the SourceForge project.

Currently, the testbeds are mainly used for benchmarking purposes, but the team has included a parallel quantum chemistry application as part of their testing procedures. The team also plans to ramp up testing of more applications, including nuclear simulations, when a new set of hardware and SDKs arrive in the lab around the time this article hits newsstands.

“Sandia’s primary mission is stockpile stewardship,” says Janssen. “That means we help make sure that the nation’s nuclear weapons are safe and reliable. Simulation plays an incredibly important role in this work, since no [live] nuclear tests are now done.”

Sandia also plans to benchmark parallel I/O on its InfiniBand clusters. Rather than using NFS, the team will look to the Lustre parallel file system primarily because it allows managers to add more bandwidth simply by adding more storage targets, which are then load balanced. NFS, on the other hand, limits file system bandwidth to that of a connection of a single server to the network.

It should be noted that InfiniBand is not the only network architecture being evaluated at Sandia. Yet, Janssen says, they’re pleased so far with InfiniBand. “It’s a competitive multi-vendor solution, and it can be applied to a much larger problem domain than high performance computing, meaning that prices will be driven down by large volumes,” he says.

Using InfiniBand, the lab has also managed to effect higher bandwidth rates — at 6.6 Gbps — than with any other interconnect technology. The latency has also been low at 7.5-7.7 µs, as reported by the latest version of MVAPICH.

What Lies Ahead

As the work at Sandia shows, InfiniBand is fast approaching maturity. Although the technology is still relatively young when compared to other interconnects, the value it brings to high-performance computing scenarios is readily apparent.

With a latency that is comparable to proprietary networks, bandwidth rates that far exceed any similar interconnect, and an open specification that allows for competitive pricing and industry cooperation, InfiniBand is well-poised to become the I/O network of choice in clusters. This is a welcome sight to those who, like Pete Zaitcev, have watched the technology go from being an overarching promise to a well-defined solution.

“We’ve gone through the hype,” says Pfister, “and now we’re starting on the reality.”


When InfiniBand isn’t being called a replacement for PCI, it’s often compared to a variety of other network architectures, from Myrinet and RapidIO to Fibre Channel and 10 Gb Ethernet. Yet, experts involved in the development of InfiniBand argue that it won’t replace any of these architectures completely. Rather, it’s only going to push them to the edge of the network fabric, they say. Here’s a look at how popular I/O architectures compare to InfiniBand:

10 Gb Ethernet:

Critics of InfiniBand argue that Ethernet can deliver similar high bandwidth and is a more ubiquitous technology. But Ethernet lacks remote direct memory access (RDMA), an important feature for those worried about latency. Although some manufacturers are planning to release Ethernet hardware with support for RDMA later this year, Ethernet still faces additional latency problems due to its reliance on TCP. But don’t expect Ethernet to disappear completely from clusters. IT managers expect that legacy networks will simply be pushed to the edge of the fabric. In other words, Ethernet bridges will reside farther and farther away from the system board, with InfiniBand running between the two.

Fibre Channel:

As the preferred method of connecting nodes in a storage area network, Fibre Channel is likely to maintain its status even while more and more people adopt InfiniBand. The reason for this is that, like Ethernet, Fibre Channel will live at the edge of the network, connecting devices that don’t need to be upgraded to InfiniBand.


Launched by Myricom in 1994, Myrinet offers the toughest challenge to InfiniBand adoption in the high-performance computing space. For one, the company has been around long enough that researchers have grown comfortable with it. Secondly, IT managers who have no qualms with Myrinet’s current 2 Gbps full-duplex bandwidth cap and 7 µs latency are not likely to be able to justify scrapping one investment in adapters, cables, and switches for another. Of course, those who need more bandwidth will find that Myrinet falls short when compared to the 10 Gbps bandwidth of 4X InfiniBand. Additionally, Myrinet’s inherent problems — such as the fact that the specification and pricing is controlled by a single company — may be too much for IT buyers looking to build new clusters.

PCI (and PCI-X):

The industry standard in most desktops and Pentium-based servers, PCI is used to connect microprocessors to attached devices using a local bus. While this had led to a number of advancements in home and office computing, the 32-bit/33 MHz bus hasn’t been able to keep up with growing I/O and processor speeds. To meet the need for a faster interconnect, HP and IBM designed PCI-X as a replacement. With PCI-X, the bus runs at 64-bit/133 MHz speeds. The upgrade has certainly helped, but Sharma notes that a bus architecture can only go so far. “PCI is an I/O technology to provide a bus between the processor and hardware for one system,” he says. “Imagine a rack having 250 systems — it’s not cost effective or resource effective to draw a PCI bus between all 250 systems.” Given that PCI is limited in this way to local I/O, it’s certainly possible that native InfiniBand motherboards (which use a channel architecture) could take over a large portion of the high-performance computing market when they finally become available.

RapidIO: Although the RapidIO Trade Association was formed by member companies looking to increase the overall speed of I/O — a history closely resembling that of the InfiniBand Trade Association — RapidIO and InfiniBand do not currently overlap. The former is intended to connect multiple processors in a single server, whereas the latter connects multiple servers and I/O devices, often across large distances. However, it is possible that the two architectures will compete if and when server manufacturers begin to build InfiniBand into the system board itself. The RapidIO Trade Association counters this threat by comparing the effort needed to implement both systems. “Both InfiniBand and the RapidIO technology reach into the card-to-card communications domain,” says the Trade Assocation’s technology FAQ. “In this domain where we overlap, InfiniBand provides a more abstracted interface to allow complete decoupling of the subsystems. To accomplish this abstraction, InfiniBand requires modification of legacy software, more transistors to implement, and specialized management software.” Whether or not these differences will lead to mixed systems, with RapidIO on the board and InfiniBand outside the box, is too soon to tell.

Amit Asaravala is an independent journalist based in San Francisco, CA. Prior to becoming a full-time writer, he was the founder of New Architect magazine and the Editor-in-Chief of Web Techniques magazine.



Linux Magazine /
July 2003 / FEATURES
Infiniband: When pipe dreams become reality

Comments are closed.