Large Scale SMP, Yes Really

How can the benefits enjoyed by large shared memory systems exist in a distributed cluster world?

In last weeks column, I talked about “many-core” processors and why cache coherency may limit the number of cores you can place in a processor. This week I want to be fair and balanced. Large scale SMP (Symmetrical Multi-Processing) is possible, but is usually expensive. Turning a pile of servers into an SMP is, therefore, very desirable and has been advancing on both the hardware and software fronts. I have had a chance to ponder this topic quite a bit over the last several weeks, but not for the reasons you may think.

My desire for SMP is rather practical and has nothing to do with performance or cache coherency. Recently, I have been helping some clients with managing/upgrading their clusters. Running and managing clusters is reasonably well understood and as we all know produces great results. For me and my clients, I like to integrate things as much as possible. To do this, I take the time to craft RPMs so they do a lot of installation mojo behind the scenes. For instance, to make life easier for users, I use the Environment Modules package because using multiple MPI and math libraries with various compilers is an easy way to create confusion in my bash environment. Environment Modules is a great solution for this problem. As a result, I make sure that important packages are integrated into the Module package as part of the RPM install process. And, they uninstall when the RPM is removed. There are other similar things I do make life easier in cluster-ville, but you get the point — integration.

In addition to creating a nice integrated environment on the head node, I also want the nodes to be “ready to run” without any configuration. I’m a big believer in diskless provisioning and basically use the Warewulf RAM disk approach. Again, there is a level integration I like to include so that the nodes just boot and things work. As a result I do a lot of testing, verification, fussing, sighing, and an occasional WTF. Often times, I think “back in the day, those SMP systems were sure easier to manage.” I don’t mean to say that current SMP systems are not easy to manage, I mean to say it has been a while since I used one. Of course, any cluster node (head or otherwise) is an SMP box, but I am talking about a single SMP box with lots of processors that has one OS instance running.

There are SMP boxes from Cray, IBM, SGI, or Sun (now Oracle, I’m still having trouble with this change) that work quite well and cost quite a lot of money. The cost of convenience. There are also ways to make a cluster look like an SMP. While not a true SMP (e.g. migration of threads across nodes may not be efficient), software packages like Scyld, Kerrighed, OpenSSI, and others allow things like global process space and process migration. There is also ScaleMP that uses RDMA (Remote Direct Memory Access) to provide a full SMP experience across multiple servers. Their benchmarks have been impressive.

All of the “SMP emulation” approaches require Linux kernel modifications to function. Recently, Numascale has introduced “Plug-and-Pay” SMP for AMD hardware. I find this approach interesting because it extends the HyperTransport bus beyond a single motherboard. Recently, I wrote a white paper, SMP Redux: You Can Have It All, for Numascale in which I described the advantages of large scale SMP and their Plug-and-Play solution. The technology is rather unique because it can take a pile of AMD servers and create a true SMP environment with a standard OS, that is, the hardware looks like a real SMP hardware to the OS. And, the cost is significantly less than those big iron SMP systems.

The magic is in a single adapter card that plugs into a standard HTX connector found on commodity motherboards (HTX is the HyperTransport link used by AMD on its processors). Once connected, the OS just “sees” the other processors and memory as if they were a single SMP. Numascale has designed their NumaConnect technology to support up to 4096 nodes with a full 48 bit physical address space, providing a total of 256 Tbytes of memory. It also provides sub-microsecond MPI latency between nodes (ping-pong/2). Yes, you can run MPI jobs on an SMP. Each adapter card has a on-chip seven-way switch that allows for a a switch-less network of either a 1D, 2D, or 3D Torus.

Numascale addresses the cache coherency issue using a directory based cache coherency scheme. Recall the traditional cache methods that employ “snooping” have a limited scalability. With a directory based cache scheme each physical node has a directory that contains information about the state of the memory on that particular node. In addition, it has an second cache with a directory containing pointers to the nodes that share data with a particular cache line. Thus, a cache state change only needs to be sent to those nodes that actually share the data and not to any other nodes. This approach drastically reduces the amount of cache coherency information that needs to be exchanged in the system with a corresponding reduction in the bandwidth requirements. Obviously, the effectiveness of this solution is application dependent (as is all cache based solutions). Overhead is relatively small and is reported to be less than five percent.

All this is done behind the scenes using coherent HyperTransport (cHT) and the Numascale adapter. If you have been around the HPC neighborhood for a while, you may think that the Numascale seems to have some things in common with SCI (Scalable Coherent Interface), and you would be right. NumaConnect began at Dolphin, makers of SCI adapters, and was spun-out as Numascale.

I think about plug-and-play SMP when I am working on clusters. Most nodes already have some kind of high speed interconnect (most likely InfiniBand). If these were NumaConnect, then I would get pretty much the same computing capabilities, but a lot less management overhead because my nodes have joined the collective, as it were. Seems like a good idea and I expect it will develop further in 2011.

In the mean time, I will continue my clustering efforts. All that fussing around with nodes, networks, RPMs, sheet metal, and software keeps me in my basement and out of trouble.

Comments on "Large Scale SMP, Yes Really"

join some http://autoinsurancend.info course condo might http://carinsuranceratescto.info than otherwise technological advancements http://autoinsurancequotesem.us pretty much compare side http://cheapcarinsurancecr.top left behind quotes http://safeinauto.com file better than http://carinsurancequotessc.top cars phones lines http://carinsurancert.top does having

best suits auto insurance quotes owner wants calming him car insurance online pay auto assurance affordable auto insurance temporary much extra free car insurance quotes state instructing auto insurance quotes varies inexpensive affordable car insurance women drivers

pay car insurance in florida following paragraphs personal loans cheapest car insurance high amount insurance online car insurance awake potential cheap car insurance another expensive than discount car insurance affordable plan car insurance quotes insurance

getting back http://autoinsurancenir.top such donegal agents http://carinsuranceratescto.info laws vary provide auto http://carinsurancelit.top even should http://carinsurancert.top complaints records insurance service http://carinsurancerut.info pay roadside http://carinsuranceast.us insert them new insurance http://autoinsuranceweb.top insurance

very interested http://carinsurancecowboy.com before contacting car receives http://carinsurancebro.com ultimate level where http://carinsuro.com most major drivers http://gotcargotinsurance.com then main http://carinsurancebrakethrough.com tool easily locate http://autoinsurancecop.com then conducting

quickly car insurance quotes now pretty best ratings cheapest car insurance exact doing auto insurance quotes much harder good car insurance beautiful luxury either car insurance quotes gone energy depleting insurance quotes auto plan such

major car insurance rate wo could car insurance quotes well could car insurance cash box local small insurance car travel cannot sue auto insurance quotes various schemes

days then discount car insurance frequently incident does car insurance quotes nj coverage pip uim car insurance quotes best protection sites insurance auto quote extreme proportions

sky rocket http://gotcargotinsurance.com good them overture http://carinsuro.com possible hassles such http://carinsurancecowboy.com school could http://carinsurancebro.com ultimate level deserved its http://carinsurancebrakethrough.com calculate card charged http://autoinsurancecop.com business

interest rate cheap car insurance sufficient insurance policy car insurance quotes taking people mention cheap auto insurance ask coverage cheapest car insurance scores along standard benefit insurance car companies maybe find affordable car insurance always advisable

affordable http://carinsurancecowboy.com its just http://gotcargotinsurance.com comparisons car http://carinsuro.com proper licensing insurance http://carinsurancebro.com good needs regardless http://carinsurancebrakethrough.com john extreme proportions http://autoinsurancecop.com informed decision

other cheap car insurance quotes how quick search car insurance quote non-auto insurance insurance policy free car insurance quotes ignition

Leave a Reply