dcsimg

Over There Vs. Over Here

Matching the right solution with the right problem takes skill, flexibility, and a little luck.

It is funny how solutions develop. Very often, a “perceived problem” is identified and discussed, technical solutions are put together, and the problem is solved. Many times, however, the problem may turn out to be not as big a problem as you may have thought. Of course, along the way, some really nice technology may result, but it may not be the “killer solution” (i.e. big success, everyone uses etc.) you were expecting. Then there are other problems, often unforeseen, that find a solution from some other unforeseen area. Clusters “sort o

It is funny how solutions develop. Very often, a “perceived problem” is identified and discussed, technical solutions are put together, and the problem is solved. Many times, however, the problem may turn out to be not as big a problem as you may have thought. Of course, along the way, some really nice technology may result, but it may not be the “killer solution” (i.e. big success, everyone uses etc.) you were expecting. Then there are other problems, often unforeseen, that find a solution from some other unforeseen area. Clusters “sort of” fall into this category. In fact, there were many who thought clusters just some kind of DIY fad.

I was reminded of this scenario by a recent post to the Beowulf Mailing list by Don Becker:.

BProc is based around directed process migration — a more efficient technique than the common transparent process migration. You can do many cool things with process migration, but with experience we found that the costly parts weren’t really the valuable ones. What you really want is the guarantee that running a program *over there* returns the expected results — the same results as running it *here*. That means more than knowing the command line. You want the same executable, linked with exactly the same library versions in the same order, with the same environment and parameters.

In years past, process migration and clusters was of great interest because it brought a unified process space across the entire cluster. (Still a great feature as implemented in Scyld’s BProc), but, as Don says it is not what seems to really important right now. I recommend you read the full post and jump over to our recent Interview with Don Becker. Don is a busy cluster kind of guy and when he speaks up he usually has something interesting and important to share.

The “over there” vs “here” problem is one of scale. With a small 32 node cluster it would seem like a non-issue. Bump that up by two or three orders of magnitude and you might begin to see how this could be a problem. The less experienced, might suggest loading the same thing on all the nodes. Sure, good plan, at first, if something were changed for any number of reasons, you could have a problem. Another question worth asking, is how to verify that the execution environment is what you want.

To be clear, Don, states that you don’t need directed process migration to ensure consistency, but BProc can be used to achieve that goal and provide other nice things. Which leads me to another thought.

One of the questions I am often asked is “What is the virtualization play for HPC?” I usually reply that there are issues that need to be resolved before virtualization and HPC walk hand in hand, but process migration in the form of check pointing would be a great thing to have. Thinking about the “over there” vs “here” problem in terms of virtualization, however, may just be the killer HPC/Virtualization application that solves a big problem.

Imagine, creating a tested working image of your application, operating system, and file system and running it on a virtual HPC machine. The “over there” vs “here” problem goes away because, “over there” is “what is here.” Of course, we talking about scale and pushing a large number of images out to thousands of nodes is an issue. And, notice I threw in the file system part. I believe before HPC can be virtualized (or “clouded”) the I/O issue (both compute and file system) needs to be resolved. I suspect this will be through some form of I/O specification that travels with the job image. The specification will allow the cloud the run the application on the right hardware. The current cloud definition is rather loose when it comes to I/O (i.e. it will be there, just can’t say how exactly fast or consistent it will be).

Speaking of cloud issues, I read two interesting articles recently. The first seems to be a possible solution to what I consider a thorny issue: cloud security. That is, as soon as my data leaves my walls, I do not have 100% control over it. And, anything less than 100% means I cannot guarantee security. Of course you can encrypt it, but then to operate on it in the cloud, you need to unencrypt it in the cloud, which means it is still naked data. That is until, now. Recently, an IBM researcher has solved the problem of fully homomorphic encryption, which to you and me means the ability to use encrypted information without un-encrypting it. (i.e. data always remains encrypted which means the result is always encrypted). Problem solved. Nice work. When do we see the demo?

The other issue I read about was the lack of entropy in the cloud. (Entropy is a measure of randomness). Basically, a virtualized instance does not have access to some of the physical means to build up it’s “entropy pool” and thus could become more predictable. Since randomness is the key to security, this might make virtualized servers more vulnerable. Of course there are some ways to fix this, however, I thought about HPC applications first and how this could have an effect on Monte Carlo results.

To sum things up, it seems the HPC problem space is evolving. I noticed that I am talking about virtualization and cloud much more than in the past, but yet there is no big killer HPC service/application out there. One other thing I have noticed is that the more open the discussion, the more solutions seem to flow. I suppose that allows solutions to get from over there to over here, and vis-versa.


f” fall into this category. In fact, there were many who thought clusters just some kind of DIY fad.

I was reminded of this scenario by a recent post to the Beowulf Mailing list by Don Becker:.


BProc is based around directed process migration — a more efficient technique than the common transparent process migration. You can do many cool things with process migration, but with experience we found that the costly parts weren’t really the valuable ones. What you really want is the guarantee that running a program *over there* returns the expected results — the same results as running it *here*. That means more than knowing the command line. You want the same executable, linked with exactly the same library versions in the same order, with the same environment and parameters.


In years past, process migration and clusters was of great interest because it brought a unified process space across the entire cluster. (Still a great feature as implemented in Scyld’s BProc), but, as Don says it is not what seems to really important right now. I recommend you read the full post and jump over to our recent Interview with Don Becker. Don is a busy cluster kind of guy and when he speaks up he usually has something interesting and important to share.

The “over there” vs “here” problem is one of scale. With a small 32 node cluster it would seem like a non-issue. Bump that up by two or three orders of magnitude and you might begin to see how this could be a problem. The less experienced, might suggest loading the same thing on all the nodes. Sure, good plan, at first, if something were changed for any number of reasons, you could have a problem. Another question worth asking, is how to verify that the execution environment is what you want.

To be clear, Don, states that you don’t need directed process migration to ensure consistency, but BProc can be used to achieve that goal and provide other nice things. Which leads me to another thought.

One of the questions I am often asked is “What is the virtualization play for HPC?” I usually reply that there are issues that need to be resolved before virtualization and HPC walk hand in hand, but process migration in the form of check pointing would be a great thing to have. Thinking about the “over there” vs “here” problem in terms of virtualization, however, may just be the killer HPC/Virtualization application that solves a big problem.

Imagine, creating a tested working image of your application, operating system, and file system and running it on a virtual HPC machine. The “over there” vs “here” problem goes away because, “over there” is “what is here.” Of course, we talking about scale and pushing a large number of images out to thousands of nodes is an issue. And, notice I threw in the file system part. I believe before HPC can be virtualized (or “clouded”) the I/O issue (both compute and file system) needs to be resolved. I suspect this will be through some form of I/O specification that travels with the job image. The specification will allow the cloud the run the application on the right hardware. The current cloud definition is rather loose when it comes to I/O (i.e. it will be there, just can’t say how exactly fast or consistent it will be).

Speaking of cloud issues, I read two interesting articles recently. The first seems to be a possible solution to what I consider a thorny issue: cloud security. That is, as soon as my data leaves my walls, I do not have 100% control over it. And, anything less than 100% means I cannot guarantee security. Of course you can encrypt it, but then to operate on it in the cloud, you need to unencrypt it in the cloud, which means it is still naked data. That is until, now. Recently, an IBM researcher has solved the problem of fully homomorphic encryption, which to you and me means the ability to use encrypted information without un-encrypting it. (i.e. data always remains encrypted which means the result is always encrypted). Problem solved. Nice work. When do we see the demo?

The other issue I read about was the lack of entropy in the cloud. (Entropy is a measure of randomness). Basically, a virtualized instance does not have access to some of the physical means to build up it’s “entropy pool” and thus could become more predictable. Since randomness is the key to security, this might make virtualized servers more vulnerable. Of course there are some ways to fix this, however, I thought about HPC applications first and how this could have an effect on Monte Carlo results.

To sum things up, it seems the HPC problem space is evolving. I noticed that I am talking about virtualization and cloud much more than in the past, but yet there is no big killer HPC service/application out there. One other thing I have noticed is that the more open the discussion, the more solutions seem to flow. I suppose that allows solutions to get from over there to over here, and vis-versa.

Comments on "Over There Vs. Over Here"

laytonjb

Love the article! Once again a good timely insight.

The virtualization in HPC jihad seems to be gaining steam lately. It must be grant proposal time for NSF so researchers are trying to capture all the buzz words and new concepts together for their grants :) (hey – don\’t kill me – I did the same thing for a little, but I refused to seduced by proposal writing by buzzwords). There are still a great many problems that need to be solved with virtualization before it becomes an attractive scenario for HPC. For example, in my day job, our engineering team tried to migrate a single process from a single node to a second node. It was done over GigE while the first node was performing local IO. It took about 30 minutes to migrate the process. It worked and there was no data loss, but 30 minutes? Since one of the desires of process migration is the ability to move off a node that appears to be having trouble, 30 minutes is probably too long (BTW – Garth Gibson has a great article that talks about the time between interruptions as systems get larger and larger. To reach Petscale or a larger we need so many processors that we are getting very close to the 30 minute time before we have an interruption which means the code dies).

But to help any one looking for cool new buzz words or phrases to sling into a proposal here are some that might help:

  • Cloud
  • Virtual or Virtualization or Virtualized
  • Accelerated
  • Heterogeneous
  • GPU
  • Intelligent (as in Business Intelligence)
  • Data-Intensive
  • Cohesive (always a favorite)
  • Grid (a bit overused but in combination with Cloud it\’s very powerful)
  • Resilient
  • \”As a Service\”

Now if you use these words or phrases in conjunction with the following:

  • Processors or processing
  • Network (including the plural \”Network fabrics\” reall punches it up)
  • Storage and/or file system
  • Pick your favorite application area (e.g. chemistry, gene sequencing)

And you have yourself a potentially heavily funded research project courtesy of NSF. My favorite story was the guy who wrote a proposal or an article that described the wheel – it either got funded or published (can\’t remember which).

Sorry for being so cynical on a Sunday but lately I\’ve just seen too much of this and when I\’m reminded to be honest on Sundays, it\’s sometimes hard for me not to become cynical :)

However, there are some worthwhile goals in virtualization in HPC (I think). I just think they may are really hard problems and may be even be impossible to adequately solve.

Enjoy!

Jeff

Reply

Thank you, I have recently been searching for information about this topic for a while and yours is the best I’ve discovered till now. But, what concerning the conclusion? Are you certain about the source?

Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>