Large Scale SMP, Yes Really

How can the benefits enjoyed by large shared memory systems exist in a distributed cluster world?

In last weeks column, I talked about “many-core” processors and why cache coherency may limit the number of cores you can place in a processor. This week I want to be fair and balanced. Large scale SMP (Symmetrical Multi-Processing) is possible, but is usually expensive. Turning a pile of servers into an SMP is, therefore, very desirable and has been advancing on both the hardware and software fronts. I have had a chance to ponder this topic quite a bit over the last several weeks, but not for the reasons you may think.

My desire for SMP is rather practical and has nothing to do with performance or cache coherency. Recently, I have been helping some clients with managing/upgrading their clusters. Running and managing clusters is reasonably well understood and as we all know produces great results. For me and my clients, I like to integrate things as much as possible. To do this, I take the time to craft RPMs so they do a lot of installation mojo behind the scenes. For instance, to make life easier for users, I use the Environment Modules package because using multiple MPI and math libraries with various compilers is an easy way to create confusion in my bash environment. Environment Modules is a great solution for this problem. As a result, I make sure that important packages are integrated into the Module package as part of the RPM install process. And, they uninstall when the RPM is removed. There are other similar things I do make life easier in cluster-ville, but you get the point — integration.

In addition to creating a nice integrated environment on the head node, I also want the nodes to be “ready to run” without any configuration. I’m a big believer in diskless provisioning and basically use the Warewulf RAM disk approach. Again, there is a level integration I like to include so that the nodes just boot and things work. As a result I do a lot of testing, verification, fussing, sighing, and an occasional WTF. Often times, I think “back in the day, those SMP systems were sure easier to manage.” I don’t mean to say that current SMP systems are not easy to manage, I mean to say it has been a while since I used one. Of course, any cluster node (head or otherwise) is an SMP box, but I am talking about a single SMP box with lots of processors that has one OS instance running.

There are SMP boxes from Cray, IBM, SGI, or Sun (now Oracle, I’m still having trouble with this change) that work quite well and cost quite a lot of money. The cost of convenience. There are also ways to make a cluster look like an SMP. While not a true SMP (e.g. migration of threads across nodes may not be efficient), software packages like Scyld, Kerrighed, OpenSSI, and others allow things like global process space and process migration. There is also ScaleMP that uses RDMA (Remote Direct Memory Access) to provide a full SMP experience across multiple servers. Their benchmarks have been impressive.

All of the “SMP emulation” approaches require Linux kernel modifications to function. Recently, Numascale has introduced “Plug-and-Pay” SMP for AMD hardware. I find this approach interesting because it extends the HyperTransport bus beyond a single motherboard. Recently, I wrote a white paper, SMP Redux: You Can Have It All, for Numascale in which I described the advantages of large scale SMP and their Plug-and-Play solution. The technology is rather unique because it can take a pile of AMD servers and create a true SMP environment with a standard OS, that is, the hardware looks like a real SMP hardware to the OS. And, the cost is significantly less than those big iron SMP systems.

The magic is in a single adapter card that plugs into a standard HTX connector found on commodity motherboards (HTX is the HyperTransport link used by AMD on its processors). Once connected, the OS just “sees” the other processors and memory as if they were a single SMP. Numascale has designed their NumaConnect technology to support up to 4096 nodes with a full 48 bit physical address space, providing a total of 256 Tbytes of memory. It also provides sub-microsecond MPI latency between nodes (ping-pong/2). Yes, you can run MPI jobs on an SMP. Each adapter card has a on-chip seven-way switch that allows for a a switch-less network of either a 1D, 2D, or 3D Torus.

Numascale addresses the cache coherency issue using a directory based cache coherency scheme. Recall the traditional cache methods that employ “snooping” have a limited scalability. With a directory based cache scheme each physical node has a directory that contains information about the state of the memory on that particular node. In addition, it has an second cache with a directory containing pointers to the nodes that share data with a particular cache line. Thus, a cache state change only needs to be sent to those nodes that actually share the data and not to any other nodes. This approach drastically reduces the amount of cache coherency information that needs to be exchanged in the system with a corresponding reduction in the bandwidth requirements. Obviously, the effectiveness of this solution is application dependent (as is all cache based solutions). Overhead is relatively small and is reported to be less than five percent.

All this is done behind the scenes using coherent HyperTransport (cHT) and the Numascale adapter. If you have been around the HPC neighborhood for a while, you may think that the Numascale seems to have some things in common with SCI (Scalable Coherent Interface), and you would be right. NumaConnect began at Dolphin, makers of SCI adapters, and was spun-out as Numascale.

I think about plug-and-play SMP when I am working on clusters. Most nodes already have some kind of high speed interconnect (most likely InfiniBand). If these were NumaConnect, then I would get pretty much the same computing capabilities, but a lot less management overhead because my nodes have joined the collective, as it were. Seems like a good idea and I expect it will develop further in 2011.

In the mean time, I will continue my clustering efforts. All that fussing around with nodes, networks, RPMs, sheet metal, and software keeps me in my basement and out of trouble.

Fatal error: Call to undefined function aa_author_bios() in /opt/apache/dms/b2b/ on line 62