Doug Eadline takes a break from his ranting about multi-core CPUs to rant about another technology that gives him fits: Virtualization.
This month, I thought I’d take a break from my usual multi-core ranting and talk about another new technology that’s starting to give me fits. No, it’s not the somewhat confusing General Purpose Graphical Processing Units (GP-GPUs) acronym, which is not to be confused with the GPU that has something to do with putting colored dots on your screen. No, the thing that’s raising my dander is the virtualization thing, or “virt” for short.
Now, virt is a good idea, even if it’s become an annoying buzzword of late. It kind of makes sense with all the extra cores and all. And while I’m not expert enough to talk about how virt will play out in the larger server market, I do have some thoughts about virtualization in high-performance computing (HPC).
Before Virt Came to Town
Back in the good old days of single-core processors, when HPC clustering was in its infancy, getting the application as close to the hardware was very important. In many cases, it still is. Communication between nodes can take place through the operating system using TCP/IP, or outside the operating system using a user space zero-copy protocol. With the exception of pinning down memory, a user space protocol removes the operating system from the communication, resulting in better latency and often better application performance.
If your application is sluggish over Fast Ethernet and TCP, you must get closer to the hardware and use a specialized interconnect like Myrinet. All high performance interconnects are as close as possible to the hardware. Indeed, the hardware actually assists in the communication.
Let’s recap how programs operate on a cluster. A parallel MPI program is essentially a collection of individual programs (processes) running on the same node or on different nodes. Messages are essentially a way to copy memory from one program to another. Programs are “placed” on specific idle nodes by a scheduling program (such as PBS, Sun Grid Engine, or Torque).
The programs are managed by the all-knowing scheduler and once a program is placed on a node, it remains there until complete. Communication takes place over an interconnect, but as far as the scheduler is concerned, what you do with your nodes is your business. This model has worked well for HPC clustering, although it is not the only scheme available.
The other important issue is MPI or parallel applications must be explicitly written. Communication between processes is what makes it “parallel”. The goal is to make your program run faster.
Recently, the OpenMosix project (http://openmosix.sourceforge.net/) announced it was shutting down. For those that don’t recall or have never heard of OpenMosix, it was an open implementation of the Mosix software originally developed by Amnon Barak. Mosix team member, Moshe Bar started and lead the OpenMosix effort. And what is Mosix? Very simply, it’s a method to migrate running processes to other computers.
For instance, if you had two computers running Mosix (or more properly, a Mosix-modified kernel), and one computer noticed that its load was very high, it could transparently migrate a process to the less-loaded node.
Now think of Mosix running on a cluster of nodes. A user logs into the head node, starts jobs, and as the load increases, the jobs migrate off the head node to less loaded nodes in the cluster. To the user, the job still looks as if it’s running on the head node, as it appears in the process table and can be manipulated as such.
The unique feature of Mosix is the ability to migrate jobs transparently. Mosix contained heuristics (rules) that determine when and where to move a process. A process could be moved several times during it’s execution as means to equalize the workload among all nodes. In short, Mosix and OpenMosix are able to make a collection of nodes look like a big fat SMP machine, or as it is often called, a Single System Image (SSI). (There are other SSI efforts as well. If you are interested, have a look at OpenSSI, found online at http://wiki.openssi.org; Kerrighed, hosted at http://www.kerrighed.org; and Scyld, found at http://www.penguincomputing.com. Each does user-directed process migration as well.)
There are some issues with Mosix migration, however. Things like I/O and threads make migration difficult or in some cases impossible.
Jobs requiring I/O are often returned to the head node to access I/O directly. Recent versions of Mosix are considering global file systems to help solve this problem.
Jobs that are threaded cannot be migrated because managing threads across multiple nodes requires fine-grained synchronization (for instance, shared memory access) that would lead to large inefficiencies.
It is also possible for some MPI versions (and PVM) to work under Mosix, provided they use the kernel for communication. Migrating user space communication is not so clear as migration is kernel-directed. If you are bypassing the kernel – well, you’re kind of asking for trouble when Mr. Migrater comes to visit.
The point about Mosix is that process migration is done dynamically with no user control. There is no need for a global scheduler, as Mosix takes care of load balancing. Compared to a traditional cluster where processes are placed in queues and eventually nailed to specific nodes, Mosix kind of dumps all the processes in a bucket and sorts things out at run-time.
While OpenMosix is used on many clusters, OpenMosix doesn’t make your applications run in parallel. That task is up to the programmer. According to project leader Moshe Barr, the advent of multi-core and virtualization has reduced the need for OpenMosix. Where once OpenMosix could unify sixteen nodes into a low-cost SMP resource, multi-core has been doing this for the past year. Today, servers with eight cores are not uncommon or that expensive. In the near future, sixteen cores will be the norm.
So what does all this have to do with virtualization? Nothing. Well almost nothing. Virtualization is about moving operating system instances around. Cluster parallel computing is about placing or moving processes on or to nodes.
Yet interesting things happen when you mix the two concepts.
Virtualization allows the operating system to run on top of a “Super OS” or a hypervisor. This standardization or abstraction of the hardware presents some interesting possibilities. It allows multiple operating system instances to run at the same time on the same processor.
Remember all those cores you have? Using virtualization on such a machine makes sense if you want to run different distributions at the same time. Maybe you want to run Red Hat and Novell/Suse on the same machine. Or maybe you want to create a sandbox machine that runs a new or different kernel version. Or maybe you are selling co-locations space and you want to charge each customer for an individual machine. Of course, you, the clever one, just bought an eight-core server and have eight virtual machines running on it. If a customer wants to reboot their virtual machine, no problem. You just saved a bundle on power and hardware costs.
Because the hypervisor insulates the operating system instance from the hardware, it allows the instance to be migrated from one hypervisor to another. Think about this for a minute. Need to take down a server to fix/add a hard drive? No problem. Start a new virtualization server, move the operating system instance (s) to the new server, fix the box, then move the instance (s) back. The operating system instance has no idea that this is happening. It is almost like the hypervisor is a light bulb socket. An operating system instance, like a light bulb, can vary in different ways, wattage or color, but all lightbulbs fit into the same socket. So if you need to move a light bulb, you can do it easily.
In the virtualization world, there are two ways to move a light bulb. The first way is to turn it off, unscrew it, maybe even wait a bit, and then put it in a new socket. The other way is to move it while it’s on. (Stick with me – this is a thought experiment.) If you unscrew the bulb really fast, and put it in the new socket before the electrons stop moving (I did say “really fast”), no one notices the light going out.
What I’ve described is the two forms of migration that are available with virtualization. The first is kind of a “halt and move” while the second is live migration. There are practical applications for both. Halting an operating system instance, let’s call it check-pointing, allows the current image to be suspended and preserved until it is restarted. Live migration allows real-time movement of the operating system image. I don’t have space to go into all the interesting ways this could be used, so I invite you to let your mind wander a bit.
Done wondering? Just in case, here comes some cold water. For virtualization to work, it must add an abstraction layer over the hardware. If you are a true cluster geek, an alarm should have just gone off in you head. Layers add overhead. HPC requires minimal overhead. There is a cost for virtualization goodness. In addition, migrating a single operating system instance is a neat trick. Migrating operating system instances that are synchronously or asynchronously communicating adds another layer of difficulty.
Even with the overhead issue, there is still something alluring about virtualization in HPC. If you think about it, what is running on an cluster compute node? Well, let’s see: You have your MPI process (es), the operating system, and hopefully nothing else. If you provision your cluster in an efficient manner, the operating system instance should be pretty minimal and maybe even lives in a RAM disk. Not that much to migrate.
The Mosix and OpenMosix approach heavily modified the kernel to allow process migration. With virtualization, the kernel still needs some modification, but it to is scooped up in the migration. This approach could be valuable to HPC in a number of ways:
Check-pointing node instances is one possibility. Just dump each operating system instance to disk.
Similarly, whole cluster hogging jobs could be swapped out of the cluster and run at a later time or on another cluster.
Another possibility is running unique node instances. Suppose one of your codes requires a specific kernel, libc, or distro version not used on the other nodes. No problem, start the specific operating system instance you need on the nodes you need.
Schedulers could be crafted to work with operating system instances, migrating an instance to help load balance the cluster in real time.
It all starts to sound a little crazy at some point. The thought of moving a large N- way MPI code from one cluster to a hard drive (or at some point a USB stick), then on to another cluster definitely makes my head spin.
As you will keenly note, I intentionally glossed over many details and looked at the big picture. Details, I’ve learned, are important in the end and make a nice home for the devil.
Virtualization is overhead expensive and still maturing. What we do know is that processors and networks will get faster, cores will multiply, and memory sizes will grow. Approaches that are overhead-intensive and- expensive today may not be as costly tomorrow. Virtualization may act a kind of glue that may bring Mosix-like migration to MPI programs.
But then, my head still spins when I wonder how it will all work out in the end.