HPC Virtualization Fun

Back in the good old days of single core processors, when HPC clustering was in its infancy, getting the application as close as possible to the hardware was very important. In many cases, it still is. Communication between nodes could take place through the operating system by using TCP/IP or outside the OS using a userspace zero-copy protocol. With the exception of pinning down memory, the userspace protocol totally removes the OS from the communication. The result of userspace communication is better application performance thanks to better latency and throughput.

Back in the good old days of single core processors, when HPC clustering was in its infancy, getting the application as close as possible to the hardware was very important. In many cases, it still is. Communication between nodes could take place through the operating system by using TCP/IP or outside the OS using a userspace zero-copy protocol. With the exception of pinning down memory, the userspace protocol totally removes the OS from the communication. The result of userspace communication is better application performance thanks to better latency and throughput.

To see how virtualization might play in HPC, let’s consider how programs operate on a cluster. A parallel MPI program is basically a collection of individual programs (processes) running on the same or different node. Messages are essentially a way to copy memory from one program to another. Programs are “placed” on specific idle nodes/cores by a scheduling program (such as PBS, Sun Grid Engine, or Torque). The programs are managed by the all-knowing scheduler and once placed on a node, stay there until complete. Communication takes place over an interconnect, but as far as the scheduler is concerned, what you do with your nodes is your business. This model has worked well for HPC clustering although it is not the only one.

Virtualization allows the OS instance (and subsequent programs) to be run on top of a hypervisor. Because the hypervisor insulates the OS instance from the hardware it allows the OS and running programs to be migrated from one hypervisor to another.

Think about this for a minute. Need to take down a server to fix/add a hard drive? Start a new virtualization server, move the guest instance(s), fix the box, then move them back. The guest OS has no idea that this is happening. It’s almost like the hypervisors are light bulb sockets. An OS instance, like a light bulb, can vary in different ways, wattage or color, but they all fit into the same socket. So if you need to move a light bulb you can do it easily.

[ Article continues below the poll. ]

In the virtualization world, there are two ways to move a light bulb. The first way is to turn it off, unscrew it, maybe even wait a bit, and then put it in a new socket. The other way is to move it while is on. (Stick with me, this is a thought experiment. It doesn’t have to actually be possible.) If you unscrew the bulb really fast, and put it in the new socket before the electrons stop moving (I did use the term “really fast”) then no one notices the light going out.

Recall that, in order for virtualization to work it needs to add an abstraction layer over the hardware. If you are a true cluster geek, an alarm should have just gone off in you head. Layers add overhead. In HPC we want minimal overhead, as we want to live as close to he hardware as possible. Virtualization goodness comes at a cost. In addition, migrating a single OS instance is a neat trick, but migrating multiple OS instances that are synchronously or asynchronously communicating via MPI adds another layer of difficulty.

Even with the overhead issue, virtualization in HPC is still alluring. With the ability to migrate a running MPI/OS instance some interesting things are possible.

First, check-pointing node instances is one possibility. Just dump each OS instance to disk. Similarly, whole cluster-hogging jobs could be swapped out of the cluster and run at a later time or on other nodes or on another cluster. Of course, check-pointing and job swapping are not new ideas, but they are not easy to do in a cluster environment.

Another possibility is running unique node instances. Suppose one of your codes requires a specific kernel, libc, or OS version not used on the other nodes. No problem! Start the specific OS instance you need on the nodes you need. Finally, schedulers could be crafted to work with OS instances, migrating an instance to help load balance the cluster in real time. It all starts to sound a little crazy at some point. The possibility of moving a large N way MPI code from one cluster, to a hard drive (at some point a USB stick), then on to another cluster definitely makes my head spin.

As you will keenly note, I intentionally glossed over many details and looked at the big picture. Details, I have learned, are important in the end and make a nice home for the devil. Virtualization is overhead-expensive and still maturing. What we do know is that processors and networks will get faster, cores will multiply, and memory sizes will grow. Approaches that are overhead-expensive today may not be tomorrow. Virtualization may open up a transport interface for HPC where computation effortlessly slides around an ever changing array of computing resources.

Then the fun begins.

Fatal error: Call to undefined function aa_author_bios() in /opt/apache/dms/b2b/linux-mag.com/site/www/htdocs/wp-content/themes/linuxmag/single.php on line 62