The 10 GigE train is on its way. Simplicity and low cost have left the station.
I have been reading about 10 GigE (10 Gigabit Ethernet) lately. It seems 10 GigE is starting to enter the cluster picture. There are some things to consider, but in general, “Good Old Ethernet” is taking its next big jump. I am convinced that over the next year there will be a significant up-tick in 10 GigE HPC clusters.
Why am I certain about this prediction (vague as it may be)? It is quite simple, if history is any guide, Ethernet is going to keep on rolling. Before I begin my nostalgia laced discussion of Ethernet, let me be the first to say, I do not believe in all-or-nothing scenarios (at least in HPC). InfiniBand (IB) is not going anywhere. There is room for IB and 10 GigE, just like there are different types of cars. They both get you where you want to go, but depending on your needs and budget, the one that is right for you may not be the best choice for the next guy. Therefore, because I’m talking about 10 GigE does not mean I am prediction the demise of IB. More like I am predicting the demise of GigE use in clusters. You know the interconnect used in 56% of the recent Top500 list. Okay, you in the back there you can sit down now.
My 10 GigE prediction is based on the following rule of thumb, Speed, Simplicity, Cost, pick an two. I believe 10 GigE will win because of simplicity and cost. IB is already faster and has better latency and if you need this level of performance you are not even looking at Ethernet direction. The joy of clustering is that one size does not fit all and you can build your cluster around your needs.
Looking at cost, you might conclude that 10 GigE is expensive right now, and you would be right. Let’s jump in the WABAC machine and look at what a Fast Ethernet switch costs in the late 1990′s. I can see a 2U Foundry Networks Fast Iron Workgroup switch with lots of lights and 16 ports for $4995. That is Fast Ethernet. Jumping ahead, I see similarly priced GigE switches in the not to distant past. In each case they were big, hot, and built to last. Now it is possible to buy an 16 port GigE switch for less than $200. Smaller versions, encased in plastic no less, are available at a lower price. The same can be said for network interface cards (NICs). Very often the NIC ends up on the motherboard as well. Thus, the commodity pressure guarantees low cost.
The good news about the commoditization of technology is cost and availability. The bad news is it can also produce a lot of junk. Some of the plastic GigE switches I mentioned just don’t work. They may work for Joe or Jane SixPack, but when pushed they fall down. For this reason, swimming in the commodity stream requires testing some assumptions and/or paying for higher quality parts — again your choice. I have seen benchmarks improve by 25% just by swapping one inexpensive GigE switch for another. The lesson here, the gems are out there, test before you buy.
Let’s move on to the simplicity factor. For the most part, Ethernet is a plug-and-play technology. It just works. When you are tying to get a cluster up and running, having dependable networking makes life much easier. All the services you know an love, NFS, schedulers, MPI, run over Ethernet. The other nice thing about Ethernet are the simple inexpensive cables. Click the cable in and presto, the link light goes on (unless something is broken). Just like there is a down side to low cost, there are some things to consider with the whole “plug-and-play” approach. Because you can ping between nodes, does not mean things are optimized. Indeed, many people are not aware that Ethernet connections can be tuned with either kernel module arguments or with ethtool. On almost all NICs you can also set the MTU size (the size of the Ethernet data packet). This feature becomes more important as the bandwidth increases because the standard 1500 byte Ethernet MTU creates a lot of overhead with GigE and 10 GigE. Tuning these settings can help (or sometimes hurt) performance. The good news is you can always fall back to the default mode if you goof up your settings.
The other nice thing about Ethernet is it has the ability to do User Space communications. Once the providence of the high end interconnects, Ethernet can now send messages without kernel overhead (copying and TCP/IP processing). A few projects that are worth looking at in this regard are: Genoa Active Message MAchine or GAMMA which is famous for achieving less than 10 µsecond latencies over GigE. It does require a patch to the Ethernet driver and only supports certain Intel Ethernet chip-sets. Another optimized communication protocol is Intel® Direct Ethernet Transport (DET) which works by providing a uDAPL like InfiniBand interface over GigE. uDAPL is the User Direct Access Programming Library that defines a single set of user APIs for all RDMA-capable transports. DET includes a kernel module and a uDAPL library for Ethernet and will work on almost any Ethernet NIC. It can linked with any software requiring uDAPL library. Finally, there is the Open-MX project. Open-MX is based on the Myrinet MX protocol and can run over any Ethernet connection. Essentially, any software that links to the Myricom MX library should be able to link with Open-MX. Depending on the chip-set Open-MX latencies as low as 10 µseconds for GigE have been reported.
With each iteration of Ethernet there are always some changes. Perhaps the biggest difference between 10 GigE and older Ethernet standards is the abandonment of half-duplex operation in favor of full-duplex. Therefore, backward compatibility should be considered before mixing Ethernet standards. In terms of cabling there are now 10GBASE-T cards and switches that use Cat 6 cables and the familiar RG-45 connectors for distances up to to 55 meters (Cat 6a can be used for 100 meters, Cat 5e is not part of the specification, but should work similar to Cat 6 cable).
How long before 10 GigE becomes a total commodity and hits the desktop? I’m not sure. As in the past, it starts out as backbone network connecting switches, then shows up on server motherboards, and then in the desktop environment. Of course, there will the those who say the desktop will never need 10 GigE. Just like they said about Fast Ethernet, and GigE. To those network-neigh-sayers, how is that coax 10Base-T hub working out for you these days? You did use Cat 6 cable for everything right?
In case you missed it, I had a big announcement on twitter. Something about an HPC for Dummies free ebook that is not a real book, but good enough. Check it out.
Douglas Eadline is the Senior HPC Editor for Linux Magazine.