One of the challenges facing HPC are I/O rates and new clusters designs are paving the way to new levels of performance.
There is a tendency to assume all supercomputing revolves around floating point performance. While crunching numbers is important, it is not the only challenge facing the High Performance Computing (HPC) community. Indeed, one of the results of numeric supercomputing is the generation of answers that measure in TeraBytes (TB) in size. In addition, there are other extremely large data sets that challenge even the largest machines. For examples, bio-science (genome databases), climate modeling, earthquake modeling, astronomy, and other data rich areas are ripe for analysis — if only there were machines that were designed to do Input and Output (I/O) as fast as they can crunch numbers.
Meeting this challenge is an important goal for the HPC community. While chasing FLOPS (Floating Point Operations per Second) is important, designing machines that can all deliver sustained IOPS (I/O Operations per Second) is becoming just as important. Recently the San Diego Supercomputing Center (SDSC) at the University of California, San Diego, was awarded a $20 million NFS grant to develop the Gordon cluster. Gordon, slated for installation in mid-2011, is intended to be a production machine and is based on SDSC’s award winning Dash cluster.
Recently, I had the chance to talk with Mike Norman, SDSC Interim Director and project principal investigator, and Allan Snavely, Associate Director of SDSC and co-principal investigator, about the Gordon project. The cluster is being built by Appro International, so I also invited Steve Lyness, VP of HPC Solutions for Appro to join the conversation. Gordon, like its smaller precursor, Dash, has some novel features including the use of Flash SSD (Solid State Drive) for storage and ScaleMP software, but also will use dual rail InfiniBand in a 3D torus.
Linux Magazine: Looking at the description of the Gordon cluster there are some novel aspects, in particular the “supernodes,” I think that is somewhat of a departure from the standard single server node that most clusters use. Can you give us some insight into the hardware and software design of the supernodes?
Mike Norman: Let me begin by stating this project was in response to a call from the National Science Foundation to deploy a “data intensive” computer. That term, data intensive, is not very well defined. You can’t go to Wikipedia and find out exactly what that means. We started with a blank piece of paper and wanted a system that had very large memory nodes measuring in the TeraBytes because this is the typical size of the large working sets in science today. Of course you can’t get that many TeraBytes on a single board. So we are looking to use ScaleMP and aggregate the memory of many physical nodes into a supernode. We decided to push the envelope and have come up with a design that gives us 32 supernodes. Each supernode consists of 32 HPC nodes and is capable of 240 GFLOPS/node and 64 GigaBytes (GB) of RAM. A supernode also incorporates 2 I/O nodes, each with 4 (TeraBytes) TB of flash memory (using SSDs). When tied together by virtual shared memory, each of the system’s 32 supernodes has the potential of 7.7 TFLOP of compute power and 10 TB of memory (2 TB of DRAM and 8 TB of flash memory). We will also be using dual rail QDR InfiniBand to connect the nodes.
If you take 32 supernodes, that adds up to a pretty substantial amount of compute and I/O power. I was basically just trying to provide as much shared memory for a single application whether that was RAM or flash, we really think of them as the same and we got to 10 TeraBytes with this design and we think this is a pretty good number for the working sets that people have or plan to have in the next few years.
LM: Is the cluster made up of just supernodes or do you have regular standard nodes as well?
Mike Norman: Well a supernode is an aggregation of regular nodes and if we don’t boot the ScaleMP software it is just a 1024 node Linux cluster. We have the option of running it both ways and so you can say that say yes, we have both regular nodes and supernodes.
LM: Do you think the main usage is going to be the supernode model or is this something the user might say look I just need 64 or 32 regular nodes, can I just boot up them up regularly and run my application or are things going to be running primarily the supernode applications?
Mike Norman: We are going to feel our way along with that. We actually don’t know what the workload is going to look like in the future. We anticipate strong demand for supernodes, but we may only run half the system as supernodes at any one time.
LM: Along those lines, do you have a target set of applications you are looking to support or do you have people who have expressed interest in running specific problems on Gordon?
Mike Norman: We divide them in two basic types; one is which you might call traditional HPC predictive simulation kinds of applications and the other is data mining. In the first category you have your structure engineering, quantum chemistry, etc. These are the kinds of applications that can use as much RAM as you give them but they don’t scale particularly well. They don’t live well in the small memory nodes of the really big HPC clusters. So that’s our bread and butter, but the second kind, the data mining applications we think there is a new user base for this type of problem. This type of application is either using or accessing very large databases or very large data sitting in flat files. Both kinds are very prevalent in science and engineering but not very well supported by current architectures.
LM: The phrase you used is “data driven scientific exploration”, that’s the second category you were talking about?
Mike Norman: Right, Allan may want to amplify on the second kind a bit.
Allan Snavely: If you think about a complex data base query where you are finding a piece of data that then tells you where to find the next piece of data. For example, maybe you are doing a search through a medical records database to find a patient and then you are going to use that patient’s number to find that person’s genome, perhaps in a second database or a correlated database. These kinds of queries are becoming more common in science and also more prevalent in business as well, for example, like the Wal-Mart problem where you are trying to do price-purchase correlation. If these are very large databases, they are latency bound. In other words, if the database actually falls out of memory (RAM) and sits on a disk you have to move the disk head. You have to move a physical thing made up of protons, which is heavy. And, you can spend a lot of energy and time to move that to find a piece of data; now this data in then used as the lookup index to go find the associative piece of data. You are actually swinging the disk kit back and forth quite a bit. This becomes a hopeless endeavor in terms of the slowness of the task. You hear “People are drowning in a sea of data” all the time. What they mean by that is there is a literal physical limitation getting the disk head to move around to find my data, so it takes forever to cover all the data of interest in that database. We plan on making this process at least 10 times faster so there will be an order of magnitude improvement in this type of query.
LM: That is a good segue into my next question which was about the SSD (Solid State Disk) drives. The people often ask is “what is the best way to use them in HPC?” SDSC certainly has some history with the Dash cluster, so maybe you can give some insight and background on how they will be employed and used in this cluster?
Allan Snavely: The simplest use, but the lowest hanging fruit, is to replace the spinning disk application with a flash application, in other words, to treat the SSD as a disk. SSD’s are interesting because they can be considered very dense but some what slow memory. They could be a replacement for RAM where you get access to more of the memory, but it is slower or they can be considered a fast disk drive. In another words, something that is maybe not as large capacity as a spinning disk, but faster. It is interesting to try to think which kind of applications can use either one. But the things where we have been seeing the most dramatic speed-up is something with a thrashing data access pattern on disk; that is moving the disk head all the time jumping from sector to sector. Now if you move this to an SSD and there is an instant benefit. This is the classic view of the data mining problem.
LM: Are these being used in a parallel file system capacity or just as an attached disk to nodes? In other words, how does the end user see them?
Allan Snavely: We don’t have a parallel file system running on them, although that is a possibility. What we do is take 16 individual drives and connect them up to a controller and set up a RAID so that you have 16×64 GB or 1 TB of storage, a rather big pool of flash. It just looks like one file system that you can write to. So there is parallelism in the performance of it because of the RAID stripping, but there isn’t user semantics for writing things to that in parallel.
LM: The other interesting thing about this system is it’s the Appro Extreme-X1 design which has a dual rail InfiniBand and redundant compute, data, and administration networks with fail over. I am interested to hear what your reasoning is behind that and how you can think leverage that with this cluster?
Allan Snavely: The main thing that we like about it is the dual rail, which we are using for performance for bandwidth not for redundancy. Of course, it is nice to have fail over and redundancy especially for very large systems because you know you’re going to have failure from time to time. However, Gordon is not particularly large. It’s large but it’s certainly not “state of the art large”, even today much less when its going to be deployed, a mere 1000 nodes. So, it’s really the dual rail for extra bandwidth. We do have lots of parallel applications and other applications moving lots of data over the network and we just want that double bandwidth on the rail. The other things are nice to have.
LM: I forgot to put this in the question, How is the network configured, is it fat tree, grid?
Allan Snavely: It is a 3D torus.
LM: Is that considered somewhat unconventional in a way? Is that largely due to the supernode idea that things are happening in supernodes you don’t need a full fat tree across the cluster or is the 3D Taurus adequate for what you think you will need to do?
Allan Snavely: The supernode is small enough that it actually has its own topology, it is just two switches. So, it hardly amounts to a torus — data goes in one switch and out the other to get to any other processor on in the supernode. It’s the group of those switches which are connected in a torus. The nice thing about a torus is it is symmetrical and easy to reason about how long communications are going to take to get to any other point in the torus. So, the users communication load balancing problem is a lot easier to understand in that environment than in a fat tree. Typically in a fat tree is what happens that your nearest neighbors are very close to you but your furthest neighbors may require a complex traversal up to the top and back down to the root of that tree. So, while aggregate, bandwidth and latency may be good, you’ve got these weird outliers. You actually really have to think a lot about the topology of a fat tree if you need to get good load balance and performance. You have to think about these same things in a 3D Taurus but it’s a little simpler to think about, at least I think so.
LM: I want to switch over and ask Appro some questions. Obviously the design of this cluster is a little different than most. I am wondering what features of Appro Cluster Engine ™ (ACE) management software may help with this type of cluster?
Steve Lyness: I think one of the very important parts of ACE that is critical to this implementation is going to be the ability to quickly and easily repartition the main cluster into virtual clusters. This will give SDSC the ability to run different OS images based on what they are trying to do. Being able to change a group of 34 nodes from running MPI jobs to a supernode running ScaleMP software within a couple of minutes will be very valuable. So the user can use it as a MPI machine or a big SMP box and automatically shift between the two modes in a mater of minutes.
LM:In turns of job scheduling, would each supernode be part of the cluster scheduler or have its own scheduler?
Steve Lyness: I think what will happen is that Grid Engine will probably view an SMP node as a single resource along with the other nodes. A job will have certain characteristics that a user will need. They get a supernode and it has 64 processors and one TB of memory, and access to the data disk farm. So it has all these things as if it is one large box.
LM: Anything else new with ACE that you think is going to be valuable in this system?
Steve Lyness: I think the ability to do this system, being able to have the SSD’s out there and being allocated to the individual compute note when it’s not in the SMP supernode mode. Being able to access those local hard drives will be very critical. We are also looking at adding some additional capabilities around using these SSD’s potentially as a buffer, as kind of a cache for data access.
LM: That is an interesting idea. One thing we did not cover was the name Gordon. Is there a special meaning?
Allan Snavely: The systemâ€™s name is Gordon and itâ€™s built on flash memory, so we figured people would get the connection â€“ and they have!
LM: In HPC we tend to measure clusters with one yard stick, that being HPL. If someone would say to you how does this cluster rank in terms of HPL and you had a chance to explain to them that is the wrong metric, what kind of metric would you provide? In other words, what kinds of things can Gordon do that some “HPL clusters” cannot do?
Mike Norman: We actually think of Gordon as a HPD machine, High Performance Data machine. Like I said in the beginning of the interview, there isn’t really a well defined definition for a data-intensive supercomputer. So we just have to make one up. What we wanted to do is max out all the metrics that have to do with I/O. I/O to disk, I/O to memory, and I/O to flash so the metrics that we are most proud of are things like IOPS (Input-Output Operations per Second) not FLOPS (Floating Point Operations per Second). So I would like people to start thinking about a different sector of this business — the HPD business — and were trying to pioneer what that means. Allan could tell you a little bit more big numbers we are trying to shoot for.
Allan Snavely: Well 32 million IOPS is the goal. That is the number, granted it has not been built yet, someone is going to have to work hard to beat that number. If we had a “Top 500″ list for random IOPS, at least on paper, we are sitting at number 1 with our proposed machine and that where we want to live.
LM: So you are basically finding that the ScaleMP is working as expected?
Allan Snavely: We are very pleased with it so far and the evaluation is ongoing. So I think Mike will probably say, “Solve me a big finance problem as he doesn’t care about the stream numbers.” We are trying out some very interesting science applications as well. These include a search of the transient database, like Mike mentioned earlier in astronomy and so we will see how it does on real science applications. But the low level benchmarks are promising and seem to show possibility or likelihood of doing well on real problems. We had Dash up in a short period of time and are now trying to solve real problems on SMP supernodes.
LM:Is there something else you want people to know about your system?
Allan Snavely: I want to give you a quote. SDSC has historically led in data-intense supercomputing and Gordon really takes us to the next chapter of that story, because it is unlike any other system in the NSF portfolio. We have high hopes it’s going to lead to new science. [Note: The SDSC Dash cluster recently won the SC09 Data Challenge.]
LM: You seem to be looking at this data driven cluster idea in a very top down way rather than just adding more nodes.
Mike Norman: We are trying to do something innovative with commodity parts. It is an interesting phase. You can go out and try and build a machine to do random memory access from scratch and burn a huge amount of R&D on it, for example, the Cray MTA would be an example of that. In other words, design a thing architecturally from first principles to do a big graph problem. I like that idea, but we can’t afford it and we are skeptical that you can make a single business model for that kind of machines — and more power to them if they can. Our idea is to take stuff off the shelf that has maybe been developed for a slightly different purpose. Take workstation processors and a commodity parts and try to connect those together with innovative software and hardware to solve science problems. So that’s an engineering challenge among other things.
LM: That is actually a good point. So in my experience I have seen that approach triumph more than building it from the bottom up.
Mike Norman: Yeah that is the history of high performance computing in the last twenty years. It went from Cray systems that were designed like a battleship for a specific purpose to big clusters that are just built out of processors that are also used on machines that play Halo or World of War Craft. And, you know it seems to be working.
Douglas Eadline is the Senior HPC Editor for Linux Magazine.