The Network IS the Cluster: Infiniband and Ethernet Network Fabric Solutions for HPC (Part One)
An often asked question by new HPC cluster users is, What kind of interconnects (networks) are available? This question is important because the network is perhaps the single most important factor in terms of cluster performance. GigE, 10 GigE, Infiniband: Which one is right for you?
An often asked question by new HPC cluster users is, What kind of interconnects (networks) are available? This question is important because the network is perhaps the single most important factor in terms of cluster performance. It can limit top end performance and also determine the scalability (that is the ability to improve performance as you add more processing power). The most common interconnect in use today is Gigabit Ethernet (GigE). It offers good performance for many applications and is inexpensive (most server motherboards already have at least 2 GigE NICs and most desktops have at least one or more GigE NICs). But, there are other networks that are gaining popularity.
These networks are usually higher performing networks such as Infiniband and 10 Gigabit Ethernet (often written “10GigE”). As the volume of shipments increases, the price of these interconnects has been falling and may now be the price/performance winner for many applications. Moreover, with the advent of multi-core these high speed networks may change how clusters are designed and operated. The interconnect is an important choice when designing a cluster and ultimately the choice depends upon on your code, requirements, and budget.
In the first part of this article we will discuss 10GigE and Infiniband networks and how they have become an almost essential element in cluster design. I’ll start by discussing the networks themselves – GigE, 10GigE, and Infiniband. Then I will present some micro-benchmarks of the interconnects. In Part Two, we will look at some real benchmarks of CAE (Computer Aided Engineering) applications that compare GigE to Infiniband (there aren’t any 10GigE benchmarks for these applications yet). Then I will discuss why people are choosing Infiniband (and soon, 10GigE). Finally I will try to summarize the current situation.
Ethernet has an enduring history. It was developed principally by Bob Metcalfe and David Boggs at Xerox Palo Alto Research Center in 1973 (man these guys were busy). The development of Gigabit Ethernet (GigE), which runs at 1 Gbps (Giga-bits per second) became an IEEE standard (802.3z) and was adopted in 1998. Initially it was a fairly expensive interconnect, but if one looks at commodity push, the prices became inexpensive.
Currently, there are a large number of manufacturers of GigE Network Interface Cards (NIC). Many of the associated Ethernet drivers have parameters that can be adjusted. For example, because the Ethernet standard still moves data around at 1500 bytes at a time (called the Message Transmission Unit or MTU), communication at GigE speeds create large number of processor interrupts. With modern GigE chipsets, interrupts can be “coalesced” to reduce processor overhead or “un-coalesced” to improve small packet latency. You can also increase the MTU size to get more efficient data transmission for service like NFS. In addition, there are various parameters you can change in the Linux kernel and operating system to help improve performance. For example, you can increase the size of the TCP send and receive buffers to help improve performance. So the set of driver parameters and kernel parameters give you a great deal of freedom to tune your cluster for improved performance. These choices provide you the opportunity to “design the cluster for your application”, rather than the other way around for traditional HPC hardware where you had to design your code around the hardware.
Communication over GigE is usually done by using the kernel TCP services. The advantage of using this standard is that virtually every MPI (Message Passing Interface) implementation is available as open-source or a commercial product. The use of TCP is quite ubiquitous for HPC. In fact, a very large percentage of the systems in the Top500 use GigE with TCP as their computational interconnect.
The portability advantage of TCP is paid for by the introduction of latency for small messages. Kernel TCP transfers require copying memory to and from the users application space. These copies introduce communication lags that may limit the maximum performance of a cluster. As with most specialty HPC networks (available from a single vendor), best performance is archived by using “kernel by-pass” or “user-space” communication mechanisms that eliminate excess copying of data. Another name for these mechanisms is “zero-copy” protocols.
For HPC clusters it is worth mentioning that kernel by-pass is available for GigE. The GAMMA or Genoa Active Message MAchine is a project maintained by Giuseppe Ciaccio of Dipartimento di Informatica s Scienze dell’Informazione in Italy. It is a project to develop an Ethernet kernel bypass capability for some Intel and Broadcom Ethernet NICs. MPI/GAMMA is also available. As we will see in Table One, GAMMA can provide a low latency for some GigE NICs.
Beyond GigE: 10 Gigabit Ethernet
Once GigE was established, the development of the next level of Ethernet, 10 Giga-bits per second, or 10GigE, was started. One of the key tenants of the development was to retain backwards compatibility with previous Ethernet standards.
Primarily because of the speed of 10GigE the initial installations used fiber optic cables. While fiber optic cables are expensive as are NICs, they do make things much easier to build since the cables are light weight and have a reasonably small bend radius. However, copper cables are becoming increasingly popular because of cost. There are NICs and switches that support 10GBASE-CX4, which are copper cables that are also used with Infiniband 4x connectors and CX4 cabling. These cables are currently limited to 15m in length. There is also a new copper standard, 10GBASE-T, for 10GigE. It uses unshielded twisted-pair wiring, the same as cat-5e or cat-6 cables that we use for GigE now. This is the proverbial “Holy Grail” of 10GigE. These cables are much thinner than the CX4 cables so the bend radius much smaller than CX4 cables. They are also much cheaper than CX4 cables. There aren’t any 10GigE NICs using 10GBASE-T yet, but many vendors have 10GBASE-T NICs in beta testing. Early prototypes of the 10GBASE-T cards are showing high power requirements, but the next several generations of the NICs should lower the power. Also 10GBASE-T switches are coming soon.
Since 10GigE is a somewhat new interconnect it’s worth briefly reviewing the NIC manufacturers and then the switch vendors.
There are several vendors currently developing and selling 10GigE NICs. Here is the current list of vendors that supply 10GigE NICs (Those vendors of which we are aware).
The lowest cost per NIC is about $795 list price. However, there are some manufacturers trying to lower the price to below $500, perhaps down to $400 a NIC as volumes increase.
To go along with the NICs you need a switch(es). There are several 10GigE switch manufacturers. The typical HPC switch vendors such as Cisco, Foundry, Force 10, and Extreme all make 10GigE line cards for their existing switch chassis. They have been developing these line cards primarily for the enterprise market, but the now they realize that as the costs come down on the line cards and the NICs, that they may have a product line suitable for the HPC market. Since the switches are fairly important and typically drive the cost in the 10GigE market, I will spend some time on them.
There are several 10GigE switch manufacturers, including switches from the dominant switch companies, and some from smaller companies and start-ups. Here is a brief list of them:
Cisco has a 10GigE switch, called the Catalyst 6509 that can accommodate up to 32 10GigE ports.
Foundry has large chassis (14U – 16 line cards). With their 4-port line 10GigE line cards (copper or fiber) – you get to up 64 10GigE ports that run at full line rate at about $4,000 per port.
Force10 has a large single chassis switch, called the E Series. They have 4-port 10GigE line cards that gets you to 56 total 10GigE ports, running at full line rate at about $7,400 per port.
Force10 also has a line card for that same switch chassis that has 16 10GigE ports. Unfortunately they are not full line rate cards (possibly 4:1 oversubscribed) resulting in a minimum of 2.5 Gigabits to each port. But you can get a total of 224 10GigE ports at about $2,700 per port.
You can see that the large switch manufacturers don’t have large (port count) 10GigE switches at full line rate (Force10 has a larger number of ports, but not at full line rate). Smaller companies are developing new 10GigE switch ASIC’s (Application Specific Integrated Circuit) and creating larger 10GigE switch fabrics from them. For example,
Fujitsu has a single-chip 10GigE switch ASIC they have used to create their own small switches called, XG700-1200 that has 12 ports in a compact form factor (1U or 2U). The switch can be configured for SR, LR, and CX4, connections. It has a very low latency of 450 ns and has a reasonably low cost of about $1,200 per port.
Quadrics has a 10GigE switch that uses the Fujitsu single-chip ASIC. It is an 8U chassis, 10GigE switch with 12 slots for 10GigE line cards. Each line card has 12 total ports with 8 ports for external 10GigE CX4 connections and the remaining four ports for each line card are used to internally connect the line cards in a fat tree configuration for a total of 96 ports. This means that the network is 2:1 oversubscribed (5 Mbps per port). The price is around $1,600 per port.
Fulcrum Micro, has also developed a 10GigE switch ASIC. The switch has a latency of about 200 ns (VERY good) and uses cut-through rather than store-and-forward for better latency and throughput. There are a number of manufacturers who are looking to use this chip in efficient 10GigE switches.
Woven System, a new company, has recently introduced the EFX 1000 switch. It is a 10U chassis with up to 144 10GigE CX4 ports. Woven says that you can use multiple EFX 1000 switches in a multi-path mesh topology to create large 10GigE fabrics (over 4,000 total ports). They say that the latency is about 4 microseconds which is just a bit higher than Infiniband. They also claim it’s 1/5 the price of other 10GigE implementations but they don’t say which one(s).
Just in case you haven’t noticed, the per port prices for 10GigE are still a little bit high for HPC. There are a number of people hoping that 2007 is the year that 10GigE drops enough in price for HPC to start adopting it (they said the same thing of 2005 and 2006 though). Many people are pinning their hopes on the Woven switch to really drop the per port switch costs. But right now, 10GigE has only appeared in a few small clusters. The limiting factor in adopting 10GigE has been the cost. Nevertheless, the commodity push and hence lower prices, still holds significant promise for 10GigE in HPC.
Like the growth of GigE, people have been watching 10GigE closely because it is still Ethernet. Thus it is “plug and play” for most clusters software (the kernel modules take care of the actual hardware) features allow you to use just about any MPI implementation, commercial or open-source, as long as it supports TCP or Ethernet. Virtually every system administrator understands Ethernet and TCP.
Infiniband (IB) was created as an open standard to support a high-performance I/O architecture that is scalable, reliable and efficient. It was created in 1999 by the merging of two projects: Future I/O supported by Compaq, IBM, and HP, and Next Generation I/O supported by Dell, Intel, and Sun.
Much like other standards, IB can be implemented by anyone. This freedom has the potential to allow for greater competition. It is similar to 10GigE in that there are HCAs (Host Channel Adapter – IB-speak for NICs) and IB switches (IB is a switched network). Today there are four major IB companies: Mellanox, Cisco, Qlogic, and Voltaire. However, Mellanox is the main manufacturer of Infiniband ASIC’s while Qlogic builds IB ASIC’s that they sell under the name of Infinipath (An Infinipath network use Infinipath HCAs and standard IB switches).
As with other high-speed interconnects, IB does not use IP packets. Rather, it has it’s own packet definition. However, Infiniband has the ability to run TCP packets over IB. It’s called ‘IP-over-IB’, allowing anything written to use TCP to run over IB albeit with a performance penalty compared to native IB.
The IB specification starts IB at a 1X speed which allows for an IB link to carry 2.5 Gbps (Giga-bits per second) in both directions. The next speed is called 4X. It specifies the data rate at 10 Gbps. The next level up is 12X which provides for a data transmission rate of 30 Gbps. There are also standards that allow for Double Data Rate (DDR) transmissions which transfers twice the same amount of data per clock cycle compared to the Single Data Rate (SDR) that transfers at 1X compared to DDR NICs. There is also the Quad Data Rate (QDR) transmission that transfers 4 times the amount of data per clock cycle as SDR. For example, a 4X DDR HCA can transfer 20 Gbps and a 4X QDR HCA will transmit 40 Gbps. Currently the IB world is using 4X DDR HCAs. There are still 4X SDR HCAs available, but they cost about the same as DDR HCAs so the popularity of SDR HCAs is rapidly decreasing. Plus later in 2007, Mellanox will be introducing 4X QDR HCA ASICs and a new HCA ASIC with a latency at or below 1 microsecond.
As mentioned, IB is a switched network. That is, the HCAs connect to switches that are used to transmit the data to the other HCAs. A single chassis switch can be used or the switches can be connected in some topology. Today there are a wide variety of switches from the major IB companies including switches that have IB ports as well 10GigE ports, allowing you to combine data from different networks.
When IB first came out, the software to support (subnet managers, MPI stacks, etc.) were specific to the various vendors. But the IB vendors knew that to make it a success they had to work together on the software stack. So they jointly developed OFED (Open Fabrics Enterprise Distribution). OFED also includes iWARP which will hopefully also work with 10GigE NICs. OFED is freely available.
Also one of the obstacles to IB adoption has been the fact that it’s a new protocol and networking compared to TCP. Consequently the administrators have had to retrain to understand how to debug IB network problems. But IB clusters have been in use for several years now, so the network knowledge is expanding rabidly.
There are a number of MPI implementations that support IB. Some are open-source such as MVAPICH and Open MPI, and others that are closed-source such as Scali MPI Connect.
Contrasting the Networks
Now that the networks have been introduced, let’s take a look at their performance from a micro-benchmark perspective. (In Part Two we will look at some real applications.)
Table One below is a collection of micro-benchmarks for the various interconnects. The table lists lists latency in microseconds, bandwidth in Mega-Bytes per second (MBps), and the N/2 packet size in bytes. The N/2 packet size is the size of the packet that reaches half the single direction bandwidth of the interconnect (A measure of how fast the throughput curve rises). The smaller the number, the more bandwidth (speed) that small packets will get. This means that a small N/2 gives a large range of packet sizes, the full bandwidth of the network. You would be surprised how many codes send small packets.
The table below is intended as a 50,000 foot look at cluster interconnect technologies, particularly 10GigE and IB. The performance (and the devil) is in the details. The final selection of an interconnect should be done by testing your codes or a set of benchmark codes that you can correlate to your codes on these interconnects.
Table One – Interconnect Summary: Performance Metrics
10 GigE: Chelsio (Copper)
Infiniband: Mellanox Infinihost (PCI-X)
Infiniband: Mellanox Infinihost III EX SDR
Infiniband: Mellanox Infinihost III EX DDR
There are some interesting observations that can be made from this table. Some of the ones I see, in no particular order:
The latency of GAMMA is about the same as 10GigE, but the bandwidth of GAMMA, since it is GigE, is much less.
The latency of GigE/TCP has been reported as low as 17 microseconds.
The N/2 for GigE is fairly large, but much smaller than the 10GigE NIC (this will undoubtedly be corrected soon).
The N/2 for Infiniband is very small. This should help performance a great deal if the applications uses lots of small packets.
The bandwidth for DDR IB is huge! That is likely to be more bandwidth than any application will use (but then again SDR IB and even 10GigE should have more than enough bandwidth for almost all applications).
If codes are latency bound then going from GigE to IB should increase their performance.
If your codes are bandwidth bound, then IB and 10Gige will give a huge performance over GigE.
Remember that this is just the interconnect. While it is a major limiting factor, there are many other aspects that affect the overall performance – memory bandwidth, CPU speed, CPU cache, etc. But in many cases, the network performance is the driving force behind the scalability of applications.
I’m not aware of any application that uses a great deal of bandwidth, but multi-core may change this conclusion. That is, with the industry fast approaching eight (or more!) cores per motherboard, you will need a bigger pipe to keep data flowing. Consequently, if you use 10GigE or Infiniband you are going to have extra bandwidth that you may not need.
In addition to the micro-benchmark performance of the interconnects, let’s also look at the costs of the various network options and the total network costs for various cluster configurations.
The cost of using an interconnect in a cluster is a sum of the cost of the NIC (or Host Channel Adapter – HCA) and the switch(es). Dividing this sum by the number of switch ports used in the cluster gives an important metric that is called the “per port” cost. If you divide the sum by the number of nodes you get the “per node” cost. In the cost below, you can compute easily compute these metrics.
Table 2 is a cost comparison for various cluster sizes and interconnect technologies. Aside from direct price comparisons, it was also created to illustrate how cost scales with cluster size. Because of switch sizes and prices, the network costs do not increase linearly with the number of nodes. Rather they increase non-linearly. Please consult the foot notes after the table for the assumptions and numbers that went into computing the costs. Remember, prices are changing, so this table should be used as a guideline and NOT as a definitive price index.
Table 2 – Interconnect Summary: Pricing
8 Node Cost
24 Node Cost
128 Node Cost
10 GigE: Chelsio (Copper)2
Notes for Table Two:
1 This assumes $26/NIC using the INTEL 32-bit PCI NICs, a basic 8-port GigE switch at $50, $320 for a 24-port switch (SMC GS24-Smart), and for 128-ports a Force10 switch that costs approximately $24,000.
2 This assumes $795/NIC (list price). For an 8-port switch, I used an approximate price of $1,200/port (assuming a Fujitsu switch). For 24 ports, I assumed a price of $1,800/port using the new Quadrics 10 GigE switch. For 128-ports, I assumed a per port switch cost of $2,700 using the Force10 E1200 even though it’s not a full line rate line card.
3 Approximate industry average pricing.
Coming In Part Two
In Part Two we will look at real application performance and how these network technologies play out in the real world. The effect of of multi-core systems will be discussed further as well. Stay tuned, it is only going to get more interesting.
Fatal error: Call to undefined function aa_author_bios() in /opt/apache/dms/b2b/linux-mag.com/site/www/htdocs/wp-content/themes/linuxmag/single.php on line 62