A Beowulf pioneer provides insights and experience from the HPC trenches.
Linux magazine HPC Editor Douglas Eadline had a chance recently to discuss the current state of HPC clusters with Beowulf pioneer Don Becker, Founder and Chief Technical Officer, Scyld Software (now Part of Penguin Computing). For those that may have come to the HPC party late, Don was a co-founder of the original Beowulf project, which is the cornerstone for commodity-based high-performance cluster computing. Don’s work in parallel and distributed computing began in 1983 at MIT’s Real Time Systems group. He is known throughout the international community of operating system developers for his contributions to networking software and as the driving force behind beowulf.org.
After MIT, Don was a researcher at the Institute for Defense Analysis Supercomputing Research Center, working on parallel compilers, specialized computational techniques, and various networking projects. Subsequently, he started the Beowulf Parallel Workstation project at NASA’s Goddard Space Flight Center. While at NASA, he led the technical development of the Beowulf project and made significant contributions to the Linux kernel, most visibly in providing very broad support for networking devices.
Don is also co-author of How To Build a Beowulf: A guide to the Implementation and Application of PC Clusters and a co-editor of the Extreme.Linux CD-ROM, the first packaged Beowulf software distribution. With colleagues from the California Institute of Technology (Cal Tech) and the Los Alamos National Laboratory, Becker was the recipient of the IEEE Computer Society 1997 Gordon Bell Prize for Price/Performance. In 1999 Becker received the Dr. Dobb’s Excellence in Programming Award, which is presented annually to individuals who, “in the spirit of innovation and cooperation, have made significant contributions to the advancement of software development.” Don holds a Bachelor of Science degree from MIT.
Linux Magazine: It has been well over ten years since the Beowulf clustering first hit the scene. What do you think about some of the trends we have seen in HPC over the years. Did they really pan out?
Don Becker: There was a point where some companies were designing their own motherboards. While this may have allowed some HPC features to be added, it also locked customers and vendors out of the rapid upgrade cycle that commodity systems enjoy. It also means vendors can stay out of the OS side as far as motherboards go. In addition, there are small boutique motherboards companies that can do a better job than most vendors.
Another thing that did not really pan out is Linux BIOS (or coreboot as it is now called). For HPC coreboot is not a good thing. For commodity systems, it puts us back to depending on the intimate details of the BIOS,, The current BIOS structure, while it could be improved, is workable. What we can hope for is that the BIOS is gone in less than a second. Right now it gone in a few seconds and I don’t see it as an important feature in HPC. Customers asking “why not coreboot?” may want to consider that at Penguin we were tasked with maintaining a coreboot machine and found that the best solution to get the machine usable was to burn a new standard BIOS. There is also the kernel issue. Try running a 2.6 kernel on a 2.4 Linux BIOS system and you will have a real issue.
At Penguin we deal with getting the standard BIOS right for Linux and HPC. When you attempt to build or modify your own, it takes several years to get it right and it limits the available hardware options.
LM: What are customers looking for in today’s HPC market?
DB We call it “stand up rate.” That is, how successful are you in getting a cluster operational in a reasonable amount of time – several weeks. At Penguin we have a 100% stand up record on integrated clusters over last 2-3 years. You can sweep some under the rug of course, but that does not make customers happy. As a company we do a full hardware and software install on the entire system. It sounds easy, but is a hard thing to get right. At Penguin, we build the cluster entirely at our site and run burn in tests before we ship it. In order to do provide this level of service you need to control your own software and have the ability to adjust things based on the underlying hardware. You need to be a certain size to offer this level of service. If you just rack and stack hardware you are probably too thin to make it all happen. If you are too big you just add some professional services to the hardware. Neither approach gives the customer a “stand up cluster” that is functioning in less than two weeks.
One of the good things is that the industry has realized that services is good margin business. Penguin hires people who are knowledge in HPC expertise and software/hardware integration. Most vendors really don’t make that much on HPC hardware or create a stand-up rate and working cluster in two weeks. Most companies don’t’ tell you how long it takes to get things plugged in and producing numbers or how successful they are in this task. At Penguin we have a 100% stand up rate and most customers are producing results withing two weeks. After two weeks an expensive resource that is not working is loosing money. The market is starting to realize that getting things to work is an important part of putting in a successful HPC system.
LM: You mentioned software. What is new with the Scyld Cluster software?
DB: Our latest version has a focus on consistency, a considerably expanded nameservice, and a new GUI. The old GUI was good and very functional, now that GUI is polished and has more information easily available.
One thing we have changed is our node environment. In the past we provided a clean minimalist environment, we had very lightweight computer nodes. Today, our default is a more complete environment because newer applications require a broad set of tools and a bunch of different languages and environments. The core of our system is still a lightweight environment that provides guaranteed consistency throughout the cluster. In the past executables were a single binary written in C or Fortran. This has changed as many users now need a a complete environment. Even though the core of our systems is the lightweight consistency we by default provide more options to load a full install into local disk. We still boot up lightweight environment then switch over the the full environment from which you can actual run the applications.
Before we tried to convince people to build a minimal environment, and build from there, but now users are finding utility in a full comfortable environment out of the box. Minimum environments are still available though.
No matter how easy or trivial it is to “mount everything” in our minimal environment, we now do it by default. We also make it make it easy to turn on rsh and ssh daemons, which are a problem when you try to maintain consistency of execution over a cluster, but so many people expect them or think it is the only way to run a cluster. We pretty much have to have them there by default. But, I consider that a bad thing for the advance of cluster technology and we provide it because that is what customers expect. I would like to see more people use advanced job starting systems, like bproc, that guarantee job starting consistency.
Actually for the whole industry I would like to see a more of a focus on consistent job starting. Not just because we have done that from the beginning, but because it reduces problems. If we found other tools that provided the same consistency that ours do, we would be happy use them.
LM: Moving to another hot topic (no pun intended). What is your view on GP-GPU computing?
DB: The first thing to remember is the design of video hardware is different than HPC hardware. For example, with video designs mistakes are gone in 1/60th of a second. Reliability issues are also important. That said, I think it is an interesting trend. As a company we have deployed big GPU based clusters and are now on the leading edge helping customers stand-up these types of systems. It has all the right elements for success, commodity hardware driven by the bulk market. It is fast, easy to get started, low cost of entry, and a low cost of learning. On the flip side, probably from too many years in the HPC industry, I have seen lot of attached processors none of which had long term traction. This could change the game but I’m still on the fence about it. As a company Penguin is more than hedging as we have a lot professional expertise this area.
LM: What about virtualization and HPC?
DB: Virtualization has abysmal I/O performance so I/O is going to be a challenge. I’ll tie that with clouds as well. Being able to know you have predictable and very high I/O performance when you lay out your job is a critical thing. And clouds generally have very poor communication and medium or even good disk I/O, but not predictably good.
In addition, one of the critical things right now in HPC is learning to deal with large memory machines. At a recent Linux Foundation Summit, one of the best HPC talks was about large pages and giant pages. A surprising number I heard was that for some applications if you use giant memory pages and avoid page table entry thrashing you get up to a 6 times performance improvement.
What we have seen, if you have lightweight compute node, is the very first job, if it is a big memory job, like a large matrix computation, we can get a 40-50% performance improvement because we don’t have a dirty virtual to physical mapping. We have a clean set of pages in the virtual to physical mapping because we have done essentially nothing at boot time. That is only one of the ways to accomplish performance, but even with our system that advantage is there only for that very first job. As soon as the kernel starts putting objects in memory, they are not completely immovable, but as soon as the kernel grabs a page of memory for itself, you can’t shuffle pages around that allocation. So you get that bump only on the first run.
With virtualization you have just made this problem one level more difficult. The great thing about the large page approach with virtual machines is that is also showed performance improvements. Now you have reconcile that with the virtualization people who claim there is no performance impact. In HPC, there is a large performance impact because of page table entry thrashing (TLB look aside buffer and caches) That comes out as high memory bandwidth but it artificially high when you are doing lots of reloading page table entry and lots of memory access, but they are not application memory access, they are just for managing memory.
That is big critical issue and relatively new issue in HPC, it was important before, but now that we are seeing regular deploys of 32 to 128GB of memory per physical machine. With the standard 4K page size that might be 32 million pages you have to manage. If you are stepping through memory that is 32 million mappings that might be pushed in an out of the cache.
LM: What about cloud computing and HPC?
DB: Everyone is calling everything they have a cloud strategy. From renting computing time (time-sharing) to managing virtual machines and up-time with completely transparent assist. It is curious that a book store (Amazon) has the best virtual cloud computing environment. As a side note, we are staring to sell the services of a company that does that as well. It does cloud computing with much more of an HPC bent, one that has a high performance interconnect. We still don’t have the full model for being able to guarantee low latency connections. So we are doing something with cloud in the HPC space but is not a general purpose solutions yet. If you want guaranteed performance you are renting a machine not a cloud.
It is a big step forward from grid though. The computing community version of grid was we have libraries to make all these different installs communicate rather than doing machine virtualization. Think of it as library level virtualization. For every service they could think of they provide a library function. For every service they did not think of they spent years writing library functions making different and potentially disparate operating systems, distributions, and versions interoperate. I think that turned out to be, in my opinion, a huge failure. They could not guarantee consistency, that is they could not guarantee you could run any executable anywhere, and never guaranteed that by running the same executable you would get the same results. That is one of the fundamental assumptions you have to make in HPC. If I run a program over there, I have to know what executable is running, what libraries it is linking to and in what order. I need to reproduce that exact same result everywhere in my run. Cloud computing provides virtualization at the machine level, you to do more work and it is more of a synchronization rather than a guaranteed consistency but it is a step better than what grids were.
In my opinion, your job starting systems and name service has to provide a consistency. You can make due with copying consistency and rsync if you have to. It is more to manage and more work. It is harder to think about and it is harder to get it right, but it is more workable, unlike grids.
LM: Can you elaborate on consistent job starting?
DB: When you rsh or ssh by name and you are not certain what you start “over there” is what you intended to run, or even if it it will produce the same result. If I walk into completely different building and ask for John, I might get the John I’m thinking about, I might get a John with two arms and two legs, and looks pretty much like the John I want, but he may not be able to answer my questions the way I expect. Putting an executable out there and running it is step better. What happens if I run it days later, what if there have been updates, what if I have wrong libraries or they are linked in the wrong order. What if something in the environment changes. I’m likely to get the right result, but in the face of updates, changes, variable environments it is not certain.
One of the areas we provide a complete sub system is user services, we ship users names and credentials out with their job. We do the network and host name look-ups basically all from the same name service and we can handle both at execution time. This provides a consistency in end results.
LM: You mentioned large pages, how will this and other new technologies effect HPC?
DB: There are both large pages (2 or 4 MByte pages) and giant pages (4 GByte pages). That is an exciting area. I think 4MB pages are sufficient for right now. But if we could do 4GB pages, just a handful of Gigabyte pages would solve the problem for large memory jobs and provide predictable execution time and minimal traffic to memory, that is traffic to help manage memory rather than user code memory.
Another area is where everything is going to change for HPC I/O is flash disk (Solid Sate Disk or SSD). Right now they are abysmally bad, Intel has recently updated the firmware on their SSD to solve some of the worst anomalous conditions. They will get better.
I talked to Ted Tso, a hard core Linux developer. He has been helping with Linux development almost from the beginning. He is focusing on file system work and he is noticing pretty bad and anomalous performance from flash drives. The performance characteristics and specifications make it unlikely that current generation of drives will do what we need for HPC.
But, it changes the I/O expectations from being this very slowing growing curve, we went from 50 MB/sec to 70 MB/sec to 90 MB/s sustained write rate over a period of 8-10 years. We did not even get a factor of 2 sustained write rate on the best drives. So now we are going to see a semiconductor curve instead of the disk drive curve for write rates. That will change everything. But it will change with how we have to deal with these rates in the file system structure because the previous models do not apply. We can’t throw it all away because it took several decades to get file systems right, but some dramatic changes are needed.
This ties in with cluster and parallel file systems, which have long been a real challenge. We still have very few options. In the commercial deployment space right now we are seeing Panasas as the best scalable option with Lustre being an option as well. But there is going to need to be more than two options. Panasas is hardware solution which starts out expensive. It is not a start out small and grow big solution. It does provides great performance overall and they (Panasas) are putting a lot of research in to the I/O issue.
Lustre is a significant software configuration challenge and a challenge to do administration on and use over the long term. It is a software solution but it is not a perfect solution. It cannot be delivered pre-packaged you need to deliver it with professional services. And some vendors like that.
We are not seeing a lot of other file systems out there that can serve large scale HPC needs. Of course in the Linux community, people are talking about Btrfs as the one moving forward in the future. And we will see what falls out from the Oracle Sun purchase deal and how that impacts Btrfs and ZFS and what spillover that has in the HPC space. Another interesting thing is we have seen a lot of hadoop customers lately.
LM: Thanks for your time and I’m are pleased to hear that Penguin is doing well.
DB: Actually, we are incredible shape, we are still in business! We are flying straight and level while others are having difficulties. In a healthy economy, I suspect we would be going gang busters.