In a world of rack-em and stack-em clusters and grids, DAS-3 is implementing a new solution that redefines the concept of both clusters and grids.
When you mention the term cluster, most people think of racks of servers crunching away in a data center or lab. When you mention grid, most people think of clusters connected by the Internet. Setting “What is a grid?” arguments aside, there is a real distinction between the two high-performance computing (HPC) methodologies. A cluster normally lives in one physical location and has one administrative domain. A grid, on the other hand, is usually built out of separate administrative domains.
In the traditional grid/cluster model, there’s an impedance mismatch of sorts. For a cluster, you want communication to happen as fast as possible. To accomplish this goal, you go around the kernel, doing all of your communication in user space, using some sort of” zero-copy” protocol. On the grid side, you want robustness, standards, and the Internet making sure what you send is what’s received. Furthermore, the cluster/grid connection usually takes place through a special gateway node in the cluster. The node could be a login node or a node that is set up specifically to translate from cluster to Internet (and back). To help with bandwidth, there are sometimes more than one of these nodes, but in general, they represent a bottleneck between “out there” and “in here.”
For many, the cluster/grid mismatch is a real problem. While not quite as vexing as reconciling quantum mechanics and relativity, there are two distinct domains that must be seamlessly connected before large-scale distributed computing can become a reality.
There is a possible solution in the works. A project called DAS-3 may go a long way in solving this problem, but before we dive into the technology, a little background may be helpful.
What is DAS?
DAS stands for Distributed ASCI Supercomputer, where ASCI stands for Advanced School for Computing and Imaging (http://www.asci.tudelft.nl/). ASCI is a Dutch graduate school established in 1993 and accredited by the Royal Netherlands Academy of Arts and Sciences. Research groups of Delft University of Technology, Vrije Universiteit, University of Amsterdam, Leiden University, University Utrecht, University of Twente, University of Groningen, Eindhoven Technical University, and the Erasmus University Rotterdam participate in ASCI. The DAS-3 system is funded by NWO/NCF (the Netherlands organization for scientific research), the VLe project (Virtual Laboratory for eScience), and the participating universities.
The DAS project was founded to create a distributed grid supercomputer in the Netherlands. Though the Netherlands is not an especially large country (inn speed-of-light terms, the Netherlands is a scant two milliseconds by three milliseconds in size), the project does connect a handful of geographically separate clusters.
The DAS-1 project began in 1997 and consisted of four clusters connected by wide-area ATM. Although located at four different ASCI Universities, DAS-1 was used and managed as a single integrated system. DAS-2 was begun in 2002 and consisted of five clusters with a total of 200 dual-CPU nodes based on the 1 GHz Pentium III). Within each cluster, nodes were connected with Myrinet. Between clusters, communication was done over SURFnet, a high-grade computer network specially reserved for higher education and research in the Netherlands. DAS-2 continues operation today, while DAS-1 was decommissioned at the end of 2001.
The DAS-1 and DAS-2 philosophy was somewhat unique in terms of wide area grid computing. First, individual clusters were designed to be very homogeneous. While physically this was a simple goal, politically it required all the participating institutions to agree on specific hardware and software choices. In the end, choices like Myrinet, Linux, and x86 hardware prevailed and researchers had one less variable with which to contend. According to principal investigator, Henri Bal of Vrije University, Amsterdam, “This homogeneous structure made it much easier to do clean,[ laboratory-like] experiments with DAS, to compute speed-ups of parallel programs running on multiple clusters at the same time. It also eased software exchange between the different groups.”
DAS-1 and DAS-2 proved to be very successful. In addition to studying many aspects of distributed supercomputing, the game of Awari was solved. A solved game means given two optimal players who know all of the game combinations, the best they can do is end in a draw. (The Awari game originated in Africa and is played worldwide. It uses simple rules, but can require complex strategies to win.)
As a reference, humans can quite easily solve (figure out all the possibilities) lesser games like Tic-Tac-Toe, but more complex games like chess, checkers, and Awari require large “game spaces” to be considered. For instance, Awari has 889,063,398,406 positions that can possibly occur in a game. Using the power of DAS-2, the game was solved by using 144 processors to create a database 778 GB in size. The researchers used a new and fast parallel algorithm to compute the database in only 51 hours.
While solving a game may not seem like productive computing, it was a good test for the DAS-2 grid, which is dedicated to research. Of particular importance was the large amount of data generated by the the solution. Approximately 1.0 petabits of information was sent between nodes during the computation. Solving such a large problem helps researches understand the limits of their machines and helps generate ideas for the next generation of systems.
That brings us the the DAS-3 supercomputer. The DAS-3 project (http://www.cs.vu.nl/das3/) is not quite as homogeneous as the previous projects, but researchers did incorporate many lessons from their past work. In the case of DAS-3, there are five clusters located at four ASCI sites. The clusters were built by ClusterVision BV and are connected by an optical backbone called StarPlane (http://www.starplane.org/). The cluster nodes use either a single- or dual-core AMD Opteron processor ranging from 2.2 to 2.6 GHz. Three of the four sites use Myri-10G (See the sidebar) for the interconnect, while one uses standard Gigabit Ethernet. Like all of the previous DAS clusters, Linux is used as the operating system.
Figure One presents a high level schematic of the DAS-3 computer. Some astute readers may suggest that it should properly be called a grid of clusters. Others may look at it and assume the architecture is just clusters using Myrinet connected to the Internet. Take a closer look at the figure. Ignore the colorful pathways for the moment, and take a close look at the Myricom boxes.
Now, That Is Different!
If you haven’t figured it out yet, the Myricom boxes represent Myrinet switches. The diagram seems a little funny though. Something is missing. There are no gateway nodes. The Myrinet switch is connected directly to the wide area network. Either DAS-3 has some really long Myrinet cables or there is something new going on in this project.
The key to this new design, and a solution to the grid/cluster impedance mismatch is the use of Myricom’s Myri-10g interconnect (http://www.myri.com/). In addition to supporting both Myrinet and 10 Gigabit Ethernet (10 GigE) protocols at the data-link level, the physical level designed to the 10 GigE standard. The data rate of either protocol is 10 Gbps (bits per second) at full-duplex (10 Gbps in both directions). Because the physical protocol is 10 GigE, a direct connection to outside networks is possible, that is the different data protocols can live in the same wires.) This solution is rather unique and represents a convergence of both close-area cluster communication and wide area grid communication.
For instance within a cluster, communications can take advantage of low-latency and high throughput kernel bypass protocols typical of Myrinet networks, and at the same time use TCP/IP over 10-GigE to communicate directly to nodes in other clusters.
Turning back to Figure One, you can see that each Myrinet switch is connected to the center ring by more than one port. Thus, the bandwidth between clusters is not limited to a single link through a gateway node. According to Henri Bal, DAS-3 expects to achieve between 40-80 Gbps between clusters. Interestingly, the size of the network, and thus the speed of light, has more to do with the latencies than the actual hardware. The expected latencies are expected to be only a few milliseconds over the network. Longer distances, like those in the United States would increase latencies.
When examining this design, there is a certain sense that this type of convergence may be harbinger of things to come in the HPC world. Of course, there is a need for new infrastructure, but wide area computing, which I suggest is different than grid computing, is now possible. For instance, consider a university or business that has HPC cluster resources located at various sites. If the institution is planning (or already has) a 10 GigE backbone, connecting clusters to act as one can literally become a “plug-and-play” proposition. One advantage of a wide area cluster is that it could be administered and used as a single entity. Recall, this concept differs from a grid where distinct administrative domains are connected.
SUBHEAD And Let’s Make it Configurable
The DAS-3 has one other design feature that merits some discussion.
First a little background on optical networking. In an optical network, information is transmitted by light. Optical networks are used to traverse long distances because light does not suffer from electrical resistance over distance. However, unlike electrons, photons are not controlled by semiconductors and thus cannot be controlled as easily. In the past, routing of optical networks required converting into an electrical signal, routing, then converting back to a light. There are now ways to redirect light by wavelength and eliminate the conversion. A wavelength-selectable switch (WSS) can select light by different wavelengths (or lambdas) and direct it in a specific direction. An example of a WSS is shown in Figure Two.
In a single-lambda (wavelength) optical network, users must share the bandwidth. While this may sound perfectly reasonable as optical bandwidths are quite high, for HPC clustering, more is often better. If the fiber employed multiple lambdas, then more bandwidth could be accommodated.
The DAS-3 optical network, called StarPlane, was designed to use 4-8 dedicated 10 Gbps lambdas for total bandwidth of 40-80 Gbps. Impressive, but the DAS-3 design takes it a step further. By using a WSS switch, the available lambdas can be reconfigured in less than a second as bandwidth requirements between the sites changes. That capability means that when the network is running with all eight lambdas, 80 Gbps of bandwidth can be partitioned on a per application basis between the five DAS-3 clusters.
Switching at this level makes sense because the eight 10 GigE connections from each clusters, via the Myrinet switch, go directly in to the optical switch (a Nortel optical Multiservice Edge 6500). With eight 10 Gbps lambdas in the optical network, these connections can be used to create optimal application topologies between clusters. In a sense, the DAS-3 project has extended the switched, high-speed cluster network out into a wide area optical network. The lines between clusters and Internet have begun to blur.
From Theory To Practice
The DAS-3 wide area cluster represents a new paradigm in HPC clustering. At this point, trying to define DAS-3 is interesting. The terms “distributed supercomputer,” “super-grid,” and “wide-area cluster” all seem to fit.
There doesn’t seem to be one overriding conclusion just yet, but the DAS-3 project will undoubtedly chart a new direction for all clustering methods. With the completion of DAS-3, the convergence of the cluster interconnect and the wide area network has begun. Or as Myricom CEO and founder Dr. Chuck Seitz likes to say, “There are now only two networks in the world: Ethernet and Ether-not.”