x
Loading
 Loading
Hello, Guest | Login | Register

Under Deadline

Senior HPC Editor, Douglas Eadline, Ph.D, blogging on clusters, multicore, interconnects, and everything high-performance.

The HPC Software Conundrum
Wednesday, February 10th, 2010

In the computer world there are concepts that appear to be universal. The adage, “software moves hardware” is one example. Talk to anyone about the history of computers in this context and they will nod their head in agreement. Many of the same people, however, will also get whipped into a frenzy when new hardware is announced. Such enthusiasm is perfectly understandable when processor clocks were increasing each year. This is what I call the Free Lunch era of computing.

Lunchtime is over. Events in the processor market and particularly the HPC market have changed the game. Software applications that once enjoyed performance increases from increased clock speed have not seen any real significant bumps in recent years. Celebrating and ranking processors based on clock speed is therefore of little value. Indeed, the current hardware advances in both multi-core and streaming-core (GP-GPU) do not apply to many single threaded applications. And yet, there always celebration of these new hardware advances.

Looking closer there is a small subset of software that actually has been adapted to use these new hardware designs. Therefore, jumping up and down because your favorite processor vendor has added a core or two — or two hundred — is like getting all excited about hydrogen cars. Good idea, but not going to work until we get enough hydrogen stations. Likewise, multi-core and streaming-core needs software in a big way. And, there is no sense in bragging about how well the hardware works unless there is an adequate software base to drive the hardware to its intended performance.

As an aside, the other analogy I was considering was two neighbors having a bragging contest about the performance of the fighter jets they both kept in their back yards. What does it matter if can’t fly the darn things.

The above argument is why I don’t put much credence in FLOPS numbers for multi-core and streaming-core. Of course the numbers are correct for that benchmark, but until that kind of performance is generally available to the Joe Programmer, the extra hardware is, in a sense, superfluous.

Of course the hardware companies realize this fact and have made various tools available for programmers. For the HPC practitioner there is MPI, but MPI currently does not work in all cases. It has been the mainstay of “parallel computing” for years and does not map well to stream processors. Although this situation may be changing (see the MIMD On GPU work at the University of Kentucky).

There is also OpenMP, CUDA, and OpenCL. Of course there are other solutions hitting the market like Ct from Intel. None of these answer the big question, “How should I write/update my code to use all this great hardware in my cluster?” In order to answer this question a summary table may be helpful. I have included the current methods that are generally available. I should point out that there are edge cases for the table, but I have tried to indicate the general use of each method. Also I assume that a cluster is a collection of SMP multi-core servers.

Hardware
Environment
MPI OpenMP CUDA OpenCL
Cluster      
SMP multi-core
server
 
SMP multi-core
server plus
GP-GPU(s)
   
NVidia Only

As you can see there is no one size fits all or no silver bullet as it were. This situation is both a problem and an opportunity and the answer to the question above is a bit unclear right now. There has been considerable work in the area of hybrid computing, that is combining the above methods in a single program. While this works in many cases, I still get the feel that it is a kludge of some sort and there should be a way to express what you want to do in a parallel cluster environment using a clear concise standard level of abstraction.

Thus, there is an opportunity for a solution. The hardware vendors are doing their part and of course pushing solutions that work best with their hardware. I also believe that any solution should be an open standard so that it can be freely implemented by anyone. Locking someone to your hardware though some software secret sauce is a recipe for failure. Open access to tools and standards are what grow hardware markets. Furthermore, I’m not so sure we can expect hardware companies to push solutions beyond their products, they are, after all, tasked with selling their stuff. And, I am not sure how it is all going to turn out. On the one hand, the HPC market has seen huge growth based on an open software infrastructure. On the other, the issue of multi-core and streaming-core seems to be developing solution fiefdoms that don’t necessary cooperate outside the hardware domain.

I started out wanting to write about CUDA, but somehow drifted into what I call the software conundrum, i.e. we need a solution that lifts us above all the hardware, but historically efforts that come from vendors cannot be expected to support this goal. Of course, a vendor (or vendors) who supports open cooperation as the way forward may actually have the fastest hardware from a practical sense, i.e. it is what gets used. By promising everyone a ride in their jet, they can get everybody to help them get it off the ground.

Cheap Stuff: Trends in Commodity HPC
Wednesday, February 3rd, 2010

Now that I am finished with my SC09 videos, I assume some of you are thinking “it is about time.” In my defense, I wanted to post and discuss the videos at a slow pace so that you can take the time to see my cutting edge work (cough, cough). By the way, if you like or dislike the video idea, please let me know in the comments. Although it may not look like it, doing the video is hard work. I spend a lot of time running around the show trying to make appointments and give everybody their 10 minutes (or less) of fame. Plus, it takes time to edit the video.

While I was writing about the videos, I happened on this somewhat sensational story about an inexpensive $100 Wii balance board working as well as a $17,885 medical device. I won’t bore you with the details, but comments about this story often seems to center around how the medical company is gouging customers. The naive view certainly would produce that opinion. Of course, HPC folk are cut from the rocket science cloth and a little more analysis is in order.

Consider yourself in the medial device market. You identify a product need, perform the research and development, gain the various certifications and approvals needed, hire a sales force, and set out to make your mark. Getting your product to market required a lot of capital and you price your product based on a competitive analysis and your development costs. In any case, your price is based on a highly focused market (i.e. the number of units you expect to sell is probably be in the thousands).

Let’s change your job. Now you are working for a consumer gaming company. Smart people in your company propose a dohicky that adds to the console experience and works very similar to the medical device. You spend about as much money as the medical company did on their product, but now you price it 100 times less than the medical device because you plan on selling 100 times more of the product.

There are other differences between between the markets as well. Supporting a medical device is quite a bit different than a consumer device. Consumers can often wait for repairs, professionals often lose money when equipment does not work.

Looking at the details of the two markets gives some insight into why the costs are so different. And, if you are in the medical device company you may want to take a good look at the “toy” that works about as good as your professional version. You can bet an entrepreneur is thinking about this situation.

If you have not figured it out yet, this same trend is what has been fueling the HPC market for the past ten years (or more). Technologies from the bigger markets have found their way into the HPC market and redefined the price to performance curve. The x86 HPC story is well known at this point. Are there any new products entering the consumer markets that may have a play in HPC?

The obvious answer is GP-GPU technology. I believe the GP-GPU idea is going to play out in a big way this year. I also think software will still be an issue — just as it has been ever since someone got the idea to do more than two things at the same time in the same program. The mass market numbers for GP-GPU processors guarantee that low cost parts will be available, but I think that is only half the story.

One development I have been watching very closely are low powered CPUs. I consider low power to be in the 25 Watt or less range. There seems to be plenty of commercial incentive to produce these processors for things like netbooks, e-book readers, cell phones, and even something called and ibook. Before you chuckle, have a look at this announcement.

If you recall, a company called SiCortex produced supercomputers using lower power MIPS processors (MIPS processors are a commodity design found in may embedded devices). Their idea was to use large amounts of low power processors connected with a fast network.

The idea worked quite well and were it not for the economy falling over last year, I might still be writing about them. In any case, large numbers of small lower powered processors offer some advantages in terms of cost, power, and cooling. Using many cores also means that codes must be scalable and finer grained. A fast low latency network helps with this issue.

In addition to low powered processors you need things like motherboards. In terms of HPC better densities can be had with smaller cooler motherboards. Ideally a small low powered motherboard should have enough memory slots, at least one GigE port, and a PCIe slot for a possible high speed interconnect. These types of boards are starting to emerge in the market. For instance SuperMicro has just introduced the
X7SPA-HF a Mini-ITX motherboard (6.75″ x 6.75″, 17.15cm x 17.15cm).

This motherboard uses the new Atom (Pinetrail-D) dual core 17 Watt processor, as two Intel GigE ports, a x4 PCIe slot, and can be passively cooled. Passive cooling means boards can be packed in a more dense fashion with less fans and noise.

If you were to combine smallish GP-GPU chips with these low powered processors, then you would be boosting your HPC credibility quite a bit.

AMD’s Fusion and the new Intel Clarksdale are examples of this type of integration. NVidia is also in the fray with it’s Ion2 chipset that is rumored to have at least 32 stream processors in a low power package. Now that puny little low power node doesn’t seem so useless.

Another interesting aspect of this approach is the potential for very low cost nodes. Of course, you will need a lot more nodes than you normally have in a standard clusters, but the nodes themselves would cost quite a bit less. It may be possible that the low price points make the nodes almost disposable (i.e. not really worth fixing, just add a new one).

In closing, there are reasons I am proffering this idea. The low cost, low power, integrated hardware is on it’s way. The commodity market is demanding it. I’m not sure it will work for HPC, but we should be taking a close look at this type of approach. Next time you are standing on your wii balance board doing yoga (or ski jumping) remind yourself that it was not too long ago that using x86 processors for HPC was considered a daft idea. And, the first time you heard about some HPC hacker contorting a video card into solving linear equations or searching genome sequences you probably chuckled. Not so funny now is it?

More SC09 Videos: IBM, ScaleMP, and The Great T-shirt Hunt
Tuesday, January 19th, 2010

When I was a kid, I used to go out for Halloween. That was back in the day when you got real big candy bars and all kinds of other goodies. I also remember collecting money for UNICEF. We would take around the little orange boxes and ask for donations. Almost everyone gave money. I mean how do you say “no” to cute kids in costumes. The biggest problem was at the end of the night the combined candy haul and pile of change got real heavy, which of course was a good kind of problem.

I’m on the other side of Halloween now. Yes, I buy those little candy pieces to give out and I make sure I give to the kids collecting for UNICEF. Although out of respect for the little tykes I give them paper money instead of heavy change.

If you don’t know where I am going with this, it is about Haiti. The country was literally shook to the ground and now they need our help. I just gave to both the Red Cross and to UNICEF. Both of these organizations make sure the money goes to those in need and not to overhead. Indeed, UNICEF has pledged that 100% of the funds gathered for Haiti will go to relief efforts.

Now it is your turn. If you have not donated, I invite you to do so. If you have already donated, consider another donation in the future because this is not going to be fixed in a week or two. We now return to the SC09 video parade.

In the past, I have written about the IBM iDataPlex, but it is great to have a first hand video tour of the real hardware. If you ever wondered about the name iDataPlex and what it actually looks like, then here is your chance to learn about a very efficient clustering system from IBM. I must say there is nothing like pulling a server and placing it on the floor and then pulling pieces out to show the camera. I’ll let Keith Olsen of IBM handle the tour that includes the field stripping of an iDataPlex node.



The next video is from my visit to the ScaleMP booth. In case you don’t know, ScaleMP aggregates SMP servers into larger virtual SMP servers. That is, a typical server node now has 8 cores that all work together as a Symmetrical Multiprocessing (SMP) node. In a cluster, multiple SMP nodes are used together in a distributed fashion using MPI. There is no memory shared between nodes, only messages. ScaleMP is a software solution that allows multiple SMP nodes to share memory and provide one large SMP system image. To the user, it looks like a huge SMP machine. Amazingly, ScaleMP can provide up to 4TB of shared memory. Incidentally, this capability is used in the Gordon Cluster which is designed to solve large data problems. Let’s take a look at the video.



One of the goals I have at any tech conference is to collect t-shirts. I pride myself in not buying t-shirts because I tend to harvest them at trade shows. This year the pickings were kind of thin, but none-the-less, I tried my best to discretely ask for t-shirts. In the next video, you will see my attempt to get a free t-shirt, hear about SICORP, and learn about the “lore of the deadline.”



What can I say. The deadline has got it going on. My email address proceeds me. (cough, cough) And, don’t forget to donate, Sam Hithead will be proud. More videos next week.

SC09 Videos: AMD, Penguin, and the HPC French Fryer
Wednesday, January 13th, 2010

In case you did not know the fastest machine in the world (running the HPL benchmark) was ORNL Jaguar. This year at SC09, the 1.75 petaflop/s calculator placed at the top of the the list. The achievement also gave AMD some bragging rights as Jaguar used their new six core AMD Opteron. A bit more interesting to me was the machine at the number 5 spot. Tianhe-1 is housed at the National SuperComputer Center in Tianjin/NUDT, China and combined Xeon E5540/E5450 processors and ATI Radeon HD 4870 GP-GPUs to achieve 563.1 teraflop/s. Not only did the Intel and AMD hardware work together without a lawsuit, this was a rather big win for the GP-GPU crowd.

In 2009, we heard a lot about Nehalem and Telsa. AMD and AMD/ATI were not taking a nap. Their efforts are paying off and as the following video will illustrate, they have some great technology to offer the HPC crowd. Indeed, AMD has been a big supporter of OpenCL and has now released the ATI Stream Software Development Kit with OpenCL support. The promise of OpenCL is portability across processors and GP-GPU’s (not cluster nodes, however). The SDK from AMD/ATI has beta support for both x86 processors and ATI video cards. As you can see in the second half of the video, OpenGL has arrived.




It seems each year I spend less time with people I know at SC. This situation my be due to the fact that I know so many people and I just don’t have time to talk with everyone. So before you think I’m some kind of social networking want-to-be, let me just say, the first time I went to SC was when it was called SuperComputing and I was showing software in the nCUBE booth. I have been to SC for so many years my “HPC Rolodex” is actually quite full. Many people would consider this a good thing ™ and I do as well, but it leaves me missing the side conversations, the technical discussions, and general HPC scuttlebutt. A case in point is my friend Don Becker. The longest conversation I had with him this year was the interview below. We both helped organize and attended the Beowulf Bash, but I never got a chance to talk with him and some other friends. The huge crowd did not help either — free beer will do that. Check out the video, Don as always has interesting things to say.




This next video was from the disruptive technologies area. Each year there are a handful of companies that are showing technologies that may have “disruptive effects” on the industry. Of course, the word disruptive has buzz-word status among the marketing droids. For myself, I tend to think of it in the past tense. That is, what things really disrupted the market, rather than if something might disrupt the market. In any case, this display did catch my eye. Have a look at the video and hopefully the rest of what I say will make sense.




I’m not sure how to take contraptions like these. One the one hand, the technology is really interesting and makes a lot of sense. Indeed, I can envision entire data centers in huge vats of dielectric oil with the sysadmins moving about in some kind of scuba suit — or better yet a completely automated systems where any node can be extracted in true matrix fashion. In all seriousness, it has the making of a green technology and it is certainly efficient. On the other hand it seems like a lot of expense and hassle to move heat. From what I was told, you can remove a blade, let it hang for 24 hours and it is perfectly dry. Adds a bit of latency to the old repair cycle. Plus higher heat capacity means more weight. That is the thing about air, poor heat conductor, but it does not weigh much and we need to have it around for people to breath. I’m still wondering about the french fies.

Another week and another set of videos. I think I have enough for 2-3 more weeks, after that I’ll have to come up with some new topics to write about. As long as we are talking contraptions, I’m still convinced my coffee cup warmer-PC is still a good idea. Take one of those 150 watt CPUs and attach it to a large metal plate and make it the top of the computer case. It will keep your coffee warm and maybe even cook your lunch. When you are not in need of hot food prep, why then just flip on the Sterling Engine and charge your phone, ipod, or tooth brush. Genius I tell you.

From Ct to InfiniBand, The inside Scoop from SC09
Wednesday, January 6th, 2010

Before we jump with both feet into 2010, I want to finish up some issues from 2009. I still have more video footage from SC09 to post and discuss. Indeed, it may take me most of this month to finish up the video. Not to worry, it is still newsworthy and relevant to the coming year.

Lets begin with a talk Jeff Layton and I had with Jim Reinders of Intel. Intel’s web page gives his title as “Director Software Products and Multi-core Evangelist, Intel Corporation.” I can say that Jim fills that role quite well. The conversation was quite informative. One of the main topics is the Intel Ct language.

If you have never heard of Ct, then allow me to provide a little background. As you know, there are two hardware trends in HPC. The first is multi-core computing. While this presents some issues, it still has the advantage that it provides some level of homogeneity i.e. all the cores are the same. The latest trend is heterogeneous computing where there are special purpose cores available for data parallel computations i.e. GP-GPU’s.

The heterogeneous hardware trend is upon us and will continue to offer “interesting” hardware to the HPC market. As an example, consider AMD Fusion and the recent introduction of the
Arrandale and Clarkdale
processors. The era of heterogeneous computing is here and the dividends can be huge for HPC. One of the hold-backs, however, is the software issue. No one wants to create code that only works on one family or type of architecture. Right now the GP-GPU hardware possibilities include NVidia and AMD/ATI. Many expect Intel to have a competing GP-GPU in the future (Although the mass market Larrabee has been delayed.) Combined with a multi-core CPU, which requires a new programming paradigm as well, the GP-GPU (a.k.a Data Parallel Processing Unit) has created a new level of software questions.

Intel is well aware of this situation and if you listen to what Jim has to say, you will see why Ct may be a significant step in the right direction. In addition, Intel’s acquisition of Rapidmind. I’ll leave the rest for the video. There is also some MPI news at the end. If you want to track Ct progress and other Intel software developments you can follow Jims Blog as well.

I also had the chance to talk with Hitesh Chellani and Anand Babu who help bring us the Gluster filesystem. Although Gluster is a recent entry in the filesystem market, they have made quite a big splash with many HPC users. They also use an Open Source model for their software. I’ll let them tell the Gluster story in the video.

While on my travels at the SC show, I like to stop in to visit some of the pioneer companies. One such company is Microway. For those who can remember, Microway brought us HPC hardware that used everything from Inmos Transputers to Intel i860 processors. They know HPC and Richard Warner gave me the quick overview of what is new at Microway.


Our final stop is Qlogic the makers of high speed InfiniBand interconnects. One of the things I like about the current HPC market is that there is a true competitive landscape in the world of high performance interconnects. Ultimately we all benefit as companies push the limits of the technology. I’ll let Qlogic’s Steve Zivanic take it from here.

That is all for this week. I’ll have another batch of videos next week. I’m sure many of you have the same reaction my 16 year old daughter does when she sees the videos on YouTube, “Dad is that you? I’m telling Mom”.

Eating Your Own Tail: HPC in 2009
Wednesday, December 23rd, 2009

At various times, I start to write science fiction stories in my head. You know that place where all the books I have written reside. In any case, I wanted to write a year in review piece now that 2009 is winding down. And, not to worry I’ll have more SC09 video in the new year. Rather than a bullet list of highlights (or lowlights) I thought I would spin a tail that, as far as I know, is accurate and a bit more interesting than the standard year-end fare.

They say the world is connected in ways we do not know. It has even been noted that our actions influence those outside of our local sphere — more than one degree as it were. Quantum entanglement not withstanding, I believe simple Newtonian forces have created a Twilight Zone ending to a most pivotal year. I invite you to sit back, grab a cup of something, and ponder a tale that needs to be told.

I suspect 2009 will go down in history as the payback year. The payback for all the irrational exuberance for building a false economy. Of course, business failures are due to many reasons. The speed at which technology can make millionaires overnight can also bring down organizations almost as fast. This past year, however, we have seen the demise of many companies, contributors to the HPC market and community, through no large fault of their own. The economy dried up cash and put most customers on a spending freeze for the first part of the year. It is hard to make a living when the economy is passed out on the floor.

Perhaps the most notable event was the sale of Sun to Oracle. Of course, many people will say Sun was bleeding for a while and that the commodity steam roller was slowly crushing their propriety products. All that may be true, and Sun was also a well entrenched company, notably in both the academic and Wall Street sectors. They were a very “open company” and contributed a large amount of open software to the HPC community. It seems, they were not “to big to fail” and were eaten by Oracle. There are those that think “Oracle will do the right thing,” but in reality the Sun that was is gone.

Not long after the Sun/Oracle announcement, another bleeding UNIX company, SGI, is purchased in a fire sale by Rackable. Again, the name lives, but the old SGI is gone. Another stalwart of the HPC world has given in to the economy and the commodity avalanche that has been rolling over the HPC landscape.

Later in May, we learn SiCortex has closed. This news was different. SiCortex was not some established Silicon Valley company that was trying to adapt to a new market. No, SiCortex was a New England company with a new idea. And, it was growing, but not fast enough for the nervous venture capitalists it seems.

Rumors surface and were confirmed in June that interconnect vendor Quadrics was going away as well. Quadrics always seemed to be the Ferrari of interconnects. One could argue they were a victim of InfiniBand and you would probably be right. And yet, I have to wonder, were the economy not so far down could they have survived?

And finally, just last week news of Verari closing down seemed to remind us that it is not over. Their absence at SC09 was telling and there are reports that they may re-emerge as a new re-organized company, which in my experience means a somewhat protracted good-bye.

Before I move on to the plot twist, I do want to mention that 2009 was not all bad news. It turns out that even with the casualties, the HPC sector faired rather well compared to other sectors. IDC seems to think so anyway. Those that lost their jobs may beg to differ.

One of the bright spots for HPC was, of course, the release of the Nehalem from Intel. It was finally a true quad-core that did away with the memory bottleneck of the past. Another highlight was that AMD and Intel finally settled their differences and AMD got a much needed cash infusion as a result. AMD kept their place at the head of the multi-core parade by introducing a six core processor this year. Another highlight was the announced Fermi processor by NVidia. Fermi engineers seemed to have the HPC wish list on hand when they designed this new processor. The coming year could be very interesting for the GP-GPU market. AMD/ATI and NVidia are the only two real players now that Cell for HPC and Larrabee are gone. By the way, we really can’t miss Larrabee because it was never here.

The final milestone for 2009 was the introduction of what I call Cluster 3.0 (a full article is forthcoming). That is the use of dynamic provision to allow application driven HPC. This technology is going to open up the HPC market because it changes the way applications are delivered to end users.

Amid the market demise, I often thought how ironic it was that the cause of the economic mayhem was do to large HPC clusters calculating the “risk” associated with various financial instruments called derivatives. The term derivative come from the fact they are derived from other financial instruments that are at some point supposed to have a connection to some thing real like a mortgage. The very companies that sold the hardware to Wall Street may have indirectly contributed to their own demise. Those thousands of servers sales may have helped the bottom line in the past, but this was the first glimpse the hungry snake got of its own tail.

The story, however, gets a bit more sinister. At one point I happened upon this post about the
Intractability of Financial Derivatives that points to a paper called Computational Complexity and Information Asymmetry in Financial Products by Princeton computer scientists Sanjeev Arora, Boaz Barak, Markus Brunnermeier, and Rong Ge. I recommend reading the article and browsing the paper because it may be the reason you or someone you know is now unemployed. For brevity’s sake, I’ll give you the upshot. It is computational intractable (i.e. there is not enough computing power in the world) to determine the risk of most derivatives sold today. If you can’t determining the risk you have no idea what you are selling or buying. Sounds dangerous. Like it could possibly lead to a world-wide economic disaster.

This result begs several questions. Just what are the Wall Street firms calculating with their mountains of servers? According to the paper, they are solving problem that cannot be solved. If the Wall Street quants did not know that, then maybe they should be in the unemployment line with all those who worked at the companies I mentioned above. But, if they knew the problem was not solvable and sold derivatives anyway only because they could find a buyer than that sounds a bit like a sinister movie villain. Massive data centers set about calculating garbage so you can say to your customer “trust me I have the blinking lights.” And, on the other side customers are buying things that are computationally impossible to understand. And where does that leave us mere humans. Was it just exuberant stupidity or grievous fraud or a little of both?

If I were to write a story about the high-tech irony of selling machines so advanced that they stupidly contribute to their own demise, I might just get my first novel published. But, that story, it seems, has already been told.

A Second Smattering of SC09 Videos
Thursday, December 17th, 2009

Most major cities have symbols or icons that identify them. For instance, Philadelphia has the Liberty Bell and New York has the Empire State Building. Portland has the majestic Mount Hood as its moniker. Of course, back east we don’t call those things mountains, we use their proper name — a volcano. Mount Hood is considered dormant, but in reality it has an estimated 3-7% chance of erupting in the next thirty years, which, by the way, is about 2.5 million times more likely than winning the Power Ball lottery. (A Power Ball ticket has a 1 in 80 million chance of winning). The USGS characterizes it as “potentially active.” Let’s keep that in mind next time we have the SuperComputing conference in Portland. Next year we head to New Orleans.

The videos for this week are from Intel, Numascale, and Mellanox. As I watched these videos, I realized something. I did not quite smack myself on the forehead, but I think we are approaching a fork in the road. If you read my Small HPC article, you may recall that that multi-core is dramatically increasing core densities in a single box. It will soon be almost standard to get 16 cores per box. Using four socket motherboards, 24 and 32 cores will be possible as well. Why are these numbers important? If you look at survey data that reports scalability limits of applications, you find that over 55% of HPC users uses less than 32 cores per run. As I mentioned previously, pretty soon half the HPC users can get the number of cores they need with a single motherboard. I believe this will change things. And, with the advent of SMP scaling methodologies (both hardware and software), it is conceivable that virtually all the HPC market could run on commodity shared memory platforms. Performance is another issue, but in theory, a large amount of HPC may never leave a single server.

With that in mind, listen carefully to the following videos. The first is an interview with Richard Dracott, general manager of Intel’s High Performance Computing Group, we learn about the up coming Intel EX (8-core Xeon) and a higher clocked six core variant for the HPC market. The un-named six core version is actually a good use of 8-core processors that have two flat tires. The video has much more including the role HPC plays in Intel’s development cycle and the Ct language.


While gallivanting around the show floor, I sometime run in to people I know and grab an “on the spot” interview. In the video below, I meet Jim Cownie, and after ten years learn how to pronounce his name correctly.


Speaking of scalable SMP systems, I had a chance to talk Einar Rustad from Numascale about there plug and play HTX card that transforms AMD servers in to a full shared memory computer. An HTX port is needed of course, but think of it as extending the HyperChannel bus off the motherboard. I’ll let Einar explain further.


My final interview this week is with Mellanox. As you know Mellanox is a leading provider of InfiniBand technology. They continue to push the performance envelope with new technologies. This year they announced CORE-Direct, which provides hardware assisted collectives for MPI programs. For example, a typical MPI program may have stages where there are data broadcast, global synchronization, and data collected across all nodes. By offloading the collective communication, ConnectX-2 adapters help to reduce communication time and CPU cycles. I’ll let Sanjay from Mellanox explain further:


I’ll put a bookmark in the video parade for this week. There are plenty more to come. Stay tuned for the high tech french fry machine and T-shirt hustle. I can promise that these two topics have virtually nothing to do with HPC, which is why you want to watch them.

HPC Gone Wild: The SC09 Video Parade Begins
Wednesday, December 9th, 2009

Most people are not comfortable in front of a camera. For some strange reason, I am. I don’t think I have any particular star quality and certainly I’m no threat to Brad Pitt or his dog, but I find it easier than writing. So rather than write a bunch about what is going on at SC09, I ran around the show floor with my camera man, Vien Hong, and did interviews and other interesting things.

In terms of camera work, I definitely fall into the Clint Eastwood style. One take, maybe a redo if someone falls over, but that is it. I also do things pretty much unscripted. I’ll ask companies, “What is new?,” and invite them give me a quick overview. Well, in most cases it is quick. I may also ask some questions or make some embarrassing comments, but this is the age of YouTube and reality TV. Why practice?

In all fairness, I do rely on my knowledge of HPC to ask some questions and I try to get unscripted natural answers from the people I interview. I like to do the interviews on the show floor rather than in a “quiet spot” because it tends to provide a more informal setting and discussion. In the coming weeks, I’ll be introducing new videos each week. I’ll provide some commentary as well — a little back story if you will. Let me know what you think as I am always trying to improve my craft (cough, cough).

Let’s begin with one event that is near and dear to my heart — The Beowulf Bash. I am proud to say that I was one of the founders of this event. Back in the day when HPC clustering was the bastard child of the show, I had the desire to get all those who conversed on the Beowulf Mailing to talk to each other in person. Far fetched, I know, but it worked. The event has become a huge success drawing over 400 people each year. There was even a line to get in the door. The Bash video starts out with some words from our sponsors, but you get to see Don Becker moonlighting as the MC and of course the band singing about supercomputers (at the end) is pretty funny. Oh, and don’t miss Tom Sterling’s comments in the middle.

With the big social event out of the way, lets move on to the SC09 exhibits. This next video chronicles my entrance onto the show floor through a secret entrance so I would be mobbed by fans. I also head over to one of the show highlights SCinet. SCinet is designed and built entirely by volunteers from universities, government and industry and can push 300 Gigabits a second over the network. I get that half right in the first video.

As everyone knows, GP-GPU’s are all the rage in HPC. I stopped by NVidia to get the latest and greatest from Andy Keane, General Manager of Tesla Supercomputing. In this video we see three NVidia products including real Fermi processors running in the booth. I also get to inquire about the curious tip jar in the NVidia booth (at the end).

This weeks final video is with Adaptive Computing formerly Cluster Resources. I find the name change interesting and a sign that HPC is undergoing a significant change. I call it Cluster 3.0 where applications drive the provisioning and execution of jobs. That is, the entire software infrastructure is now dynamic and everything from the application to the OS is booted at runtime. They use the cloud word, but not in the sense of a virtualized cloud, but more like a HPC cloud with dynamic provisioning. I’ll let Trev Harmon and Peter Ffoulkes of Adaptive Computing explain further.

In the coming weeks, I’ll have more videos and some HPC hijinks as well. I enjoy interviewing people, but the limited amount of time only allowed me to visit a fraction of the companies and people I see at SC. Next week I’ll have new and exciting SC09 videos and I promise no more Gladiator Movie comments. And, not to worry, the booth babes, sorry, I mean the professional booth personnel are coming.

HPC Reflections: SC09 in Portland, OR
Wednesday, December 2nd, 2009

Arriving home from the annual SC show has become rather routine. My wife assumes I will be “out of sorts” until after Thanksgiving, my daughter usually wants to inspect my t-shirt haul, and I just want to lay on the sofa and watch some mindless television. As it is now the first week of December I am basically recovered from my sojourn to Portland, although I find myself wanting to remain on the sofa.

I don’t know if busy is the right word to describe my week. Perhaps over-allocated is a better description. Let’s start with my last night in Portland (Thursday Nov. 19th). It has become somewhat of a tradition that fellow cluster geek and Linux Magazine writer Jeff Layton and I have dinner on this night. Our conversation usually goes something like:

Doug: I’m exhausted.

Jeff: So am I.

Silence

Doug: What was I saying?

Jeff: Something about car exhaust.

Doug: Oh yea, so I was …

We did manage to make some sense eventually. At one point, I made the rather random comment, You know with all that application driven dynamic provisioning stuff now offered by Platform and Cluster Resources (now Adaptive Computing) we are basically done (with perfecting the clustering model). Jeff asked for a little clarification. I continued, Well, from a users perspective, they are now in the drivers seat. Issues that the community used to fret over are gone. For instance, getting codes to compile for a specific OS. Now you can just automatically provision the nodes you need with the OS you need. It is almost like the application is now bundled with the entire run-time environment that gets dynamically loaded at run-time. It is kind of like grid was supposed to be, but cloud over-simplified. The OS, interconnect, and file system have now become schedulable resources. Jeff replied that it was an interesting way of thinking about it. I’m not sure I made a lot of sense in any case.


I had the whole plane ride home to think about that discussion. I decided that I need to write a longer treatment of the topic as it not easily explained in two or three paragraphs. I have called this Cluster 3.0 — dynamic provisioning.


Jumping back to the beginning of the week, the Beowulf Bash was a big success. A big thank you to our sponsors. The event has become one of the highlights of the show. As Tom Sterling remarked, It is a special vendor-neutral community driven event where you can meet people and talk about HPC. In case you missed it, there are pictures over at Inside HPC. There will be some video posted real soon.

Speaking of video, I was in front of the camera again this year. I managed to get a pile of interviews and some other (cough, cough) commentary. I should mention that these are not the boring kind of public relations interviews. I pretty much show up with my trusty camera man, Vien Hong, and start asking questions. There are no scripted questions or retakes. If it looks awkward and real — it is. We should have the first batch loaded next week. I’ll supply some back story and links to the videos in my column. I’m still wondering if the gladiator movie comment was out of line. Running around doing video and helping out in the Appro booth made for some busy days. Plus I was showing my Limulus system across the isle.

In terms of new technology, I think GP-GPUs have made a mark on the market. There was quite a lot of activity in this area. Between NVidia’s Fermi and ATI’s Firestorm, the HPC market is in for some big changes. Heterogeneous computing is the new buzz-word. Recent news about IBM Deep Computing dumping the Cell processor confirms that it is now a two pony race (NVidia and AMD/ATI) for the best SPMD co-processor. Over on the x86 side of the market there was news of, get this, more cores.

Another piece of news was the SDSC Gordon Cluster. While the forthcoming cluster can crunch numbers like everyone else, it is designed specifically for data-mining operations. For the first time, a major supercomputing center is not touting where their cluster will land on the Top500 (although SDSC has provided estimates), but rather how well it will do in IOPS (Input/Output Operations per Second). Gordan will use a large amount of flash storage (get it, if you don’t ask an old person about Flash Gordon). I’ll have more on this topic real soon. I consider it a new breed of cluster which will become an important tool in data heavy HPC.

One other piece of news was the debut of my single case Limulus Machine (four microATX motherboards in one case with a single power supply). It arrived in more pieces than when I shipped it, but I was able to repair it on Monday before the show started. As I am not one to single out any one company, I’ll be coy about it and mention that the shippers name rhymes with FedEx. If you hop over the project page in the link above, you can see some pictures and view copies of the slides that were running at the show. I’ll have more detailed pictures in the coming weeks. Overall the response was very positive. Most people were surprised how quiet it was and how little power it used (I had my Kill-a-Watt meter attached one of the days). A special thanks to Jess Cannata for helping out.

And finally, my SC09 Twittering was kind of boring. Looking at the posts, I can’t figure why anyone would be interested in me searching for coffee. On the other hand, shooting around URL’s seems like a good use for Twitter. I am inching closer to my 250 follower goal. The current tally is 183. That and a cup of coffee will get me, let’s see, a cup of coffee.

The Unofficial Guide To SC09 (Part Two)
Wednesday, November 11th, 2009

The week before SC is always a busy time for me. And, as usual, I’m behind with everything. I should get caught up sometime in January. Before I dive into some other SC topics, I wanted to provide a public health notice. As many of you know SC is an international conference and most of us get there by sitting for hours in long aluminum tubes packed with other people. Then once we are there we all gather in rooms, breath the same air, shake hands, talk, and generally exchange molecules. The molecules are what I want to talk about. There are some molecule groups that are organized better than others. Indeed, some of these very small molecules have devised ways to make copies of themselves using the cells in our bodies. As these molecules are so small we cannot see them, the best defense is to prevent them from getting to our cell machinery.

Of course I’m talking about H1N1 and other assorted influenza viruses that move through the population this time of year. From 1979 through 2001, an average of 41,400 people died each year in the United States from the flu. To give you some perspective on that number of people, imagine a full football (US) stadium on a Sunday afternoon. Now imagine every other seat is empty due to the flu. You get the idea.

fist-bump-small1.jpg So this year let’s be extra vigilant. Here are the suggestions from the professionals. Wash your hands. Keep you hands away from your eyes, ears, nose, and mouth and use hand sanitizer. I have one more suggestion. I want to introduce the SC fist bump. I propose that instead of shaking hands we do a friendly fist bump. Or if it is someone you don’t like, pretend to accidentally miss and punch them in the stomach. In any case, I’m going to try bumping instead of shaking as a means of slowing those pesky, but highly organized, molecules from spreading during the event. Let’s see how many people we can get bumping.

Back to SC proper. As I mentioned last week, there were a few other things I wanted to talk about. First, there is the Top500 announcement. There is always lively discussion about this topic and what the list actually means. My take is the following; The Top500 is a great idea and it measures how fast we can compute one type of problem. It provides a wealth of historic data that can be used to observe trends and changes in the industry. It is, however, a single data point in the overall performance picture. Keep that in mind when you hear about the latest Top500 list.

One of the reasons I am so busy is I just got some sheet metal back from the shop for my Limulus machine. Back in October 2008, I made a case (no pun indented) for a couple pieces steel bent the right way so that multiple motherboards could be placed in a single PC case. Well, I took my own advice and if all goes well you can come see the results in a corner of the SICORP booth (#1209) next week. I will have a small pedestal to show my LInux MULti-core Unified Supercomputer (Limulus). Stop by and take a look. I’m interested in what you think.

Finally, I want to invite everyone to the annual Beowulf Bash on Monday night. I have been involved in this community event for quite a while. It is a great time to say hello to many old friends and meet some of the people who started it all. This year we tried to pick a location so there would be some areas for casual discussions while the entertainment is playing. In addition, we have a “free an in beer” theme becasue a local HPC company (who wishes to remain anonymous) has donated five kegs of custom brewed beer for the event. The party begins at 9PM on Monday November 16th, 2009 (after the opening gala). We are holding it at the The Game, at the Rose Quarter, One Center Court (one really long block + two short blocks from the convention center, at the Rose Quarter Max Stop). Click Here to see the map. For public transportation information, visit Tri-Met’s website.

Of course I would be remiss if I did not mention the Beobash sponsors; AMD, Econnectix, InsideHPC, Penguin, SICORP, Terascala, Xand Marketing, and ClusterMonkey. I hope to see you there. Stop by and bump me.

The Unofficial Guide To SC09 (Part One)
Wednesday, November 4th, 2009

The big HPC show is upon us. As a seasoned veteran, I thought I would provide some insights into the Supercomputing trade show experience — perhaps one of the coolest trade shows on the planet. First, let’s clear up one thing. While the show is called Supercomputing by many people, it is officially called “SC.” You will be hard pressed to find the word supercomputing on the SC09 web site. At one point in the past Supercomputing went away and was replaced by SC. I don’t know why. It seems to be one of those things that just happened.

If you are not attending SC, then stay tuned for my reports. I will be trying to twitter my way through the show. This is an experiment because normally I don’t post all that much. For the most part, I find the whole thing a little weird. Call me old school, but I think that you say something when you have something to say. I write what I want to say in columns like this, so brain droppings on twitter seems rather senseless to me. Plus, I can sum up what I do on daily basis in a few short sentences. Drink coffee. Take the dogs out. Write stuff. Talk on the phone. Drink coffee. Make to-do lists. Help a handful of clients. Read my to-do lists. Drink coffee. Read stuff. Reply to email. That is pretty much it. At least when I go to a trade show there will be something new and exciting to twitter about (in theory). And, I’ll be doing video at the show, which depending upon whom you ask is not necessarily a good thing™.

Location
This year SC is in Portland, OR. I have been all over the country attending trade shows and when I tell people, “I’m going to (fill in the blank),” they politely say, “Oh (blank) is great city, you have to go to (fill in the blank) and do’t miss (fill in the blank).” I usually, sigh, then explain, that although I am going to these (mostly) nice cities, it is not a vacation. “I actually spend a large amount of my time in a really big hall with lots of blinking lights and fans. I could be in Hoboken, NJ for all I know. (Not that there is anything wrong with Hoboken).

There is the after hours night life, which will be covered next week. For the most part, traveling to other parts of the country is nice, I do manage to see some of the main attractions in the cities. Some of the social events try to integrated the local experience, but in general after a few beers, I could be in Hoboken.

Announcement Overwhelm
Any industry trade show generates its share of press releases. Indeed, there is often a “quiet period” in the months preceding the show where there is very little news. Then a week before the show, the emails and phone calls start. The news flow builds until Monday of the show where there is so much news I just get a complete sense of overwhelm. I have found it impossible to sift through all the news the week of the show. I often ask other people what they found interesting as there is just too many announcements to digest.

Which brings me to the other issue — useless press releases. Let me go on record and say that there are many important announcements at SC. Indeed, it seems that many of the “big companies” make “big” announcements that extend beyond the HPC world. These announcements give credence to the whole HPC market. Then there are the other announcements. I understand that companies need press releases at trade shows to help grab some portion of the limelight. And, I assume the plan is that someone like me will find and re-print your announcement about how your company now has Cloud enabled widgets (in blue). These announcements all show progress, which is great, but I’m always looking for the ones that offer something new or fundamental to the game.

Exhibit City
The exhibits are one of the more interesting parts of the show. SC is rather unique in that both vendors and customers (universities and national labs) have displays. There is, of course, a synergistic element in the HPC market, but I have always been puzzled as to why non-vendors (i.e. the customers) have displays. I do think it is a great idea because I believe conversations build community. One possible reason may be the competition for research dollars or just some good old fashioned public relations in the sexy HPC arena.

In any case, it takes some time to walk the exhibits. The big names are there along with some other interesting areas like the Disruptive Technologies Exhibits or the Student Cluster Competition. Keep an eye on some of the small vendors with big ideas in the low rent booth displays because sometimes they grow into big vendors (or get purchased by big vendors). One observation I have about SC is that almost all the companies bring members of their technical teams. If you have a question about a product, package, or benchmark, you can usually find a technically competent person with which to talk. You may have to come back or hunt them down, but you can finally meet some of the people who turn the gears.

Papers and Discussions
If I had more time, I would probably attend many of the papers because I like the hear about new ideas and the leading edge. In order to get into the conference portion of the show, you need a conference badge, which costs extra — a free exhibit badge won’t work. One of the best kept secrets of SC are the Birds Of a Feather (BOF) sessions, which are open to everyone. These are often informal presentations by individuals and groups. The discussions are open and the audience is invited to contribute. Al Gore is slated to present the key-note address on Thursday at 08:30 -10:00AM. A convenient time for most of us east coasters who have already been up for four hours.

I’ll end there for this week. Next week I’ll cover the Top500 list, cool booths, and the social aspects. (Yes, I know that is what you really want to hear about.) And, if everything fits together (literally), I hope to provide some news about where you can see the Limulus Project at SC09. Not to worry, I will prepare and distribute a press release in case you miss the show.

You Need A Cloud Based Grid Supercomputing Cluster
Tuesday, October 27th, 2009

As strange as it may seem, I sometimes do not know exactly what I will be writing about when a start a column. Okay, if you read my columns it may not seem so strange. In any case, I often try to think of a topic and then try and determine if it would be interesting to the hard core HPC audience. For instance, just today was another silly headline “… Releases Desktop-Sized Supercomputer.” Wow, I thought maybe someone is doing something interesting. Click to the article and find it is a PC with three Nvidia Tesla c1060 cards. Yawn. That is not news, that is plugging boards in some PCIe slots. Now, don’t get me wrong, I like what NVidia is doing, but at some point, plugging boards in a PC and running around yelling “Supercomputer” is just a bit much for an old HPC geek like me.

Again, let me say to those that may take this the wrong way, GPU computing is great stuff, NVidia is doing cool things and they are to be congratulated for putting big FLOPS in small places, but I find the PC vendor hype is little too much. Let’s back up little bit because you may be wondering why I’m all hot under the collar about this issue.

First, some of you may not know the word “hype” is short for “hyperbole” which means excess or exaggeration. It is a rather polite way to say male bison excrement, which is a polite way to say, well you know what I mean. Second, I will make the rest of this rant general enough that you can substitute clouds, grids, green, and clusters, for the word supercomputer at any place in the text.

Where to begin. Let’s take a look at electric heaters. Electric heaters are appliances that convert electricity to heat. If I buy an electric space heater, take it out of the box, turn it on, it gets hot. Clear? Moving on, if I buy a desktop PC with three GPU cards in it, take it out of the box, turn it on, it also gets hot and it sits there waiting. Heater heats, computer waits. I hope I’m not going to fast for you. Let’s take it a step further. If I buy one thousand servers each with eight cores and turn them all on, I have eight thousand waiting cores. The computer needs something else.

To help emphasize my point, I have designed a small quiz to see if you are an HPC geek or a marketing droid. Here is question number one.

If I purchased 1000 servers do I have a (select all that apply)

  1. space heater
  2. supercomputer
  3. grid
  4. cloud

If you answered A, you are a geek. If you selected B, C, and D, you are definitely a droid. If you only selected B, C, or D, then you might make a good droid, but you are thinking too narrow. The HPC geek would realize that raw hardware does not become something until the right software is installed and configured properly. The clever marketer would realize that depending on the latest industry buzz, the “Cloud Ready” sticker could be placed over the “Grid Ready” sticker.

Which brings us to the next somewhat related issue. Floating point performance. Here is the second question.

How fast is your computer?

  1. really fast
  2. really, really fast
  3. really, really, really fast
  4. need more information

If you answered D, then you are a geek. Get back to work as you probably don’t care about the rest of this article. If you selected C, then you are a marketing droid extraordinaire. If you chose A or B, then we need talk about upgrading your computer because you need a “really, really, really” fast system.

More to the point, when I read a FLOPS number if I don’t see the name of the benchmark or the word peak within a few words I immediately stop reading because the hype machine is in full gear. I recall being told by a big company marketing droid not to use the results for the NAS parallel LU benchmark I ran because the GFLOPS number what not what everyone else was getting. He then went on to explain the everyone was getting XX GFLOPS and I should report that number. I assumed he was talking about the HPL number, but just to be sure I asked. He said something that I did not understand. I reported the LU number because I always thought credibility was good thing and besides why would I report the HPL number for one of the NAS tests.

Here are some take-away points. First, a pile of hardware is just that. I have said repeatably in the past, racks of hardware are not a supercomputer. HPC clusters are built from commodity parts, but that does not mean any commodity PC vendor can sell and support a supercomputer. Of course, it is easy to add the buzz words to web pages and brochures, but good HPC vendors are few and far between. Second, the term supercomputer has no strict definition other than “a type of computer you cannot afford”. The same goes for clouds, grids, clusters, green computing, and what ever the next big thing is. In the end, the over-selling and over-hyping of anything is counter productive — particularly if you have no clue about your customer needs, or what it really takes to deliver FLOPS on a consistent basis. As I tell vendors who ask about the HPC market, “You know those rocket scientists you always hear about, well they use HPC as do a bunch of other really smart people. They respond well to results, benchmarks, and new products (that work). You will loose big points by trying to over sell or hype products. These are smart people, give them the good data, understand their needs, and they will make the right decsion.”

Finally, I realize marketing hype is a worn out topic, so I won’t continue droning on about what seems to be a side effect of the free market system. It just feels good to let off a little steam sometimes. And, no offense to the droids out there. You are just doing your job. Perhaps this column will help you refine your approach to the market. I also want to mention, SC09, which I believe is cloud enabled this year, is a few weeks away. I will be sending messages to the twitter cloud as I travel the show floor. In addition, the crack Linux Magazine video team will be there once again. Shine-up those supercomputers.

Parallel Programming: Non-optimal Is as Non-optimal Does
Wednesday, October 21st, 2009

Before I begin yet another discussion of “the ways we don’t know how to describe many things happening at the same time,” I feel obligated to point out the most non-optimal thing I have seen in a while. By “non-optimal” I really mean a politically correct way of saying stupid. I have told my daughter, never tell anyone they are being stupid, just use the phrase non-optimal, chances are, they will not understand it, in which case your intuition was right.

Getting back to today’s non-optimal tidbit. Even with spam filters, I usually get a handful of unwanted email each day. Today I have seen a bunch of emails slipping through with the most hilarious From: and Subject: lines. The messages it seems are from United Parcel Service or as I like to say UPS. The subject line reads FedEx Tracking N5421062126 or as I like to say FedEx. Maybe I did not get the memo, but I don’t think UPS is delivering FedEx packages. There is a total impedance mismatch of the sender and the subject. This contrivance shows the spammer has either a basic misunderstanding of the current world-wide package delivery system or is just plain non-optimal (or both). And, I’m not even going to mention the contents of the email that references the postal package and October 18st. Of course, I assume the non-optimality will continue as Jane and Joe SixPack click and add yet another piece of crude to their computer.

As I was deleting this and other spam, I was watching a video about a new programming concept called Swarm. When I first read the description one thing jumped out at me; Swarm embodies the maxim “move the computation, not the data”. I thought, “Yes, now this is a good idea - fluid dynamic computation.” About mid-way through the presentation, Swarm designer Ian Clarke mentioned that he was using Scala 2.8 because it had something called continuations. I must admit I don’t know anything about Scala, but Ian’s description of continuations made some sense.

Continuations remind me of the C Blocks which I recently discussed in a column about Apple’s Grand Central Dispatch. The basic idea is to put a bookmark in your computation so you can come back to it later, send it off to be executed elsewhere, or just wait.

The idea with Swarm is that the computation can be moved to the data (from computer to computer) because continuations allow you to capture or freeze program state and run it at another time or place. I have talked about program state and how it is the bane of parallel computing in a previous installment. Anything that allows program state migration always piques my interest, but I find it non-optimal to try and capture state in procedural or imperative languages. Capturing state in procedural languages is a delicate procedure. In addition to preserving the state of the execution stack, keeping track of memory you have touched and will touch is not easy.

Managing state is the difficult issue with parallel computation. That is, in a procedural language, the programmer must make sure each parallel part of the code does not have side effects — changing (or not changing) memory values that are used by other parts of the code. If you program using MPI, you are managing state at a very low level where you are explicitly copying data or state from one machine to another and making sure each parallel procedure is independent. This approach works quite well for many of the big HPC codes, but can be a tedious and time consuming programming effort.

One possible pathway to reduce the complexity of parallel programing is the use of declarative languages. Pure declarative languages do not have state — or at least the programmer does not have to worry about it. It makes for a very different way to program, but also opens up a bunch of possibilities for parallel execution because the user is decoupled from managing state or program execution. For this reason, I have advocated looking at functional languages like Erlang or Haskell. I also want to be clear, these languages are not the whole solution, but they offer a fresh non-procedural approach to concurrent programming.

If you have been paying attention, a few columns ago, I made a positive mentioned of C Blocks, which are a way to capture state in the grand-daddy of procedural languages. Indeed, I think C Blocks are a good idea because I consider C to be the universal assembly language. Plus, there is existing code that may benefit from C Blocks. What I think is a non-optimal idea is trying to capture state in many of the new age procedural languages. Of course, that is my opinion and I am sure some may disagree.

I’ll stop at this point as I’m sure I stirred up enough issues for one column. I still have not reached Best Buy Manager Status on Twitter, but then again maybe if I posted something I might get more followers. I was going to tweet from the NVidia GPU programming conference, but I came down with a cold and really did not feel like doing much of anything. Plus, I’m not sure if tweeting about non-optimal issues like placement of power outlets in airports or people in the back of the airplane who stand up and grab their bag as soon as the seat belt light goes off is worthwhile. In any case, I’m off to pick up my FedEx postal package at the UPS office.

The Return of the Vector Processor
Wednesday, October 14th, 2009

In last weeks installment, I talked about Nvidia and the subsequent announcement of their new Fermi GPU. I also invited you to take a look at the Fermi White Paper. I’m not going to rehash the white paper here, but I do suggest reading it.

Before I dive into Fermi the GPU, I wanted to take a moment and pay tribute to Enrico Fermi the person — for whom the new GPU was named. For you young whippersnappers out there, Enrico Fermi is one of the giants of physics. He was instrumental in advancing physics on many fronts including quantum theory, nuclear and particle physics, and statistical mechanics (a favorite subject of mine). He had a rare combination of talent that allowed him to be both an excellent theorist and experimentalist. His legacy is legendary as he has an element named after him (Fermium, a synthetic element created in 1952), a national lab (the Fermi National Accelerator Lab), and a class of particles that bare his name (fermions). No lightweight this Fermi fellow.

Which brings me to the Nvidia Fermi architecture. In terms of HPC it is also no lightweight. I believe it is going to be a game changer. Let me explain. My previous opinion of GP-GPU computing was certainly positive and I could see it changing the HPC game in some corners. Users were reporting fantastic speed-ups in some areas, new users could experiment with existing video cards, and the larger video market was going to keep the cost down. There were, however, some fundamental issues expressed by many of the more traditional HPC users. Until they were resolved, I assumed these issues would limit just how far GPU computing could go in the HPC world. Based on the Fermi technical material I have read, Nvidia has been listening and many of these issues have been addressed head-on. I predict this device and others like it will change the HPC landscape, but first a few details about the Fermi architecture.

One concern voiced by Michael Feldman of HPCwire was the lack of ECC memory. This opinion combined with a recent University of Toronto/Google paper on DRAM Errors in the Wild: A Large-Scale Field Study point to the need for ECC in the data center. (As an aside, I’ll have more to say on this topic in the coming months. I have been rigorously testing non-ECC systems and I find some of them quite robust.) Nvidia has been listening to the market and one of the new features offered by Fermi is support for ECC memory. This new feature is not an afterthought as ECC protection extends down from DRAM to L2 and L1 caches, shared memory, and register files. Scratch that one off the list.

Another issue was double precision floating point performance. The rule of thumb for current GT200 Tesla systems is a double precision FLOPS rate that is about one eighth that of the single precision rate. This situation was due to the number of double precision units on the chip. The Fermi design team has worked to balance this mismatch and reports an eight fold increase in peak double precision floating point performance. Also of note is full IEEE 754-2008 32-bit and 64-bit precision.

Some other features of note are a total of 512 CUDA cores per chip. A CUDA core executes a floating point or integer instruction per clock for a thread. All memory (thread private local, block shared, and global) is now fully addressable which allows for global pointers and clears the way for C++ applications on Fermi. Concurrent kernel execution is another new feature. In past designs CUDA kernels (sections of code that use the GPU) operated one at a time. If a program had multiple small kernels, each would have to wait their turn. With Fermi, multiple small kernels can run at the same time. This combined with much improved context switching support and improved scheduling make for better utilization of the entire GPU. There are many other features in Fermi that I don’t have room to mention. Check out the white paper to get more detail.

Let’s step back and look at the big picture for a minute. When I asked a friend of mine what he thought of GPU computing he said, “Well almost everybody in HPC can use an array processor.” “That is quite true,” I replied, “And, now most everybody can afford them.” Then it dawned on me, one way to look at GPU computing is the return of the vector or array processor to HPC.

In the past, I have made the point that low volume specialized CPUs, like those found in very high end vector supercomputers, had become too expensive to manufacture. In response, the market swung to commodity processors and clusters. Great reductions in price-to-performance ensued, but now we have come to the point of just how big can we scale these clusters before power, space, and even the speed of light becomes and issue? We can continue to add cores to commodity processors, but in effect we are just adding more general purpose scalar nodes to the equation. Collections of general purpose scalar nodes (i.e. the commodity x86 core) can work on SIMD (Single Instruction Multiple Data) problems just fine, but a modern GPU has been optimized for this class of problem which include both graphics and HPC solution spaces. In fairness, the general purpose scalar node can also work on MIMD (Multiple Instruction Multiple Data) problems for which GPUs were not designed. (For those who like to break the rules, however, check out the MIMD on GPU project.)

The GPU, and in particular the Fermi architecture with it’s HPC features, are bringing vector processors back into supercomputing. Enhancing the commodity scalar core with a commodity vector processor makes a boat load of sense. And now that Fermi has planted the GPU-HPC flag firmly in the ground, I see others following. Like the physicist for which it is named, I believe Fermi will contribute in big ways to many areas. I do have one request for NVidia, however. When you sit down to plan the next generation GPU, use the code name Boltzmann. Ludwig and the whole S=k*logW thing could use some of that GPU limelight.

When HPC Is Not An Afterthought
Friday, October 2nd, 2009

I’m heading out to the NVidia GPU Technology Conference this week. Notice that the title does not state whether it is a Video or HPC conference and as a matter of fact it will cover both. This concept fascinates me. Dual use hardware is how the whole cluster thing got started. One difference back then was the pioneers were told, “You can’t do that.” Good thing for us they did not listen. Today, there are many HPC products designed for the cluster market. The reality is, however, as technology gets more expensive to produce the only way to make a profit is to sell large quantities. Designed dual use is a great solution to this problem. I am not talking about pulling things from other markets as it was done in the early days, but rather, products that are specifically designed for two markets — in this case HPC and Video processing.

NVidia is a great model for this approach. Not that long ago, they added some HPC features to their products and created software tools to help people use these new features. They also made sure their products were screaming fast for the video market as well. I image a possible conversation that went something like this,

You know, if we add a little gizmo here and here the HPC people could use these.
Will it hurt the video performance?
No
Can we sell more product?
Yes.
Then do let’s do it.

And thus, the GP-GPU (General Purpose Graphical Processing Unit) was born.

One aspect of “designed dual use” with which designers must grapple is feature balance. If the HPC guys had their way, the GP-GPU would be a full blown data parallel computer on a chip. The market does not justify the cost and thus they need to negotiate chip real estate with the video side of things. In a way, designed dual use HPC will never have all the features users want, but most of what we want is better than none of what we want.

From a market perspective, the balancing act goes a bit deeper. The growth of cluster HPC was due in part to the ability of servers, Ethernet, and MPI to deliver pretty good performance to most people. This is a key point that is often missed. Let’s do a little math. Suppose there are 100 HPC users in the world and each year they spend $1 million dollars buying HPC hardware. Let’s further assume that for a certain period of time the market moves along with a handful of companies competing to sell large and fast big iron (million dollar) systems.

Along comes a technology (clusters) that can deliver 80% of the big iron performance for $100,000. Next assume, that 60% of the market is fine taking this reduced performance because it saves money. The traditional HPC ($100 million) market has been reduced to a $40 million market. The term that is is often used in this case is “disruptive technology”. Note that better features or performance did not take away market from the incumbents, it was just the opposite. It was the “good enough crowd” — those that could live without the best.

When I look at the GP-GPU market, I see some similarities. Yes it is not the best for all cases, but it may address a large portion of the market that can live with less features and lower cost. That is, a better price to performance ratio (i.e. dollars per TFLOP).

This trend is why I am watching the GP-GPU thing very carefully. And, like clusters, there is almost no barrier to trying. In the early days, you could cobble together a cluster and see what kind of “ball park” performance you could achieve. With GP-GPU computing, you probably already have a video card that can run CUDA applications. Unlike clusters, however, where MPI programs could be literally recompiled and executed, there is some reprogramming into CUDA required before you can run your codes on NVidia hardware. The good news is you can incrementally add “CUDA-ness” to your program, that is a complete restructuring is not needed. In terms of cost, it is basically your time. And, if it works for your application set, then you may be looking at quite a performance bump.

As I still must pack my suitcase and finish some emails, I’ll end here for today. I’m sure I’ll have some things to report back from the conference next week, so stay tuned. Besides, my sidekick and fellow HPC scrivener Jeffrey Layton will be there. We will be rooting around for all things HPC. And speaking for myself, I promise we will do a “good enough job” and deliver to 60% of the readers. The rest of you, well we will cover that some other time.

Update: Due to some scheduling issues, this installment did not get posted until after I returned from the GPU Technology Conference. As I mentioned, I wrote it just before I jumped on the airplane heading to the conference. I think the announcements and the event attendance support my point. As a matter of fact, I was genuinely surprised and delighted to hear the features of the next generation Fermi GPU. Here are some of the key points; support for ECC memory, 512 cores, 8x double precision performance increase, concurrent (CUDA) kernels, support for C++, and more. In terms of HPC, these could be game changing features. I’ll have more next week. For now, here is your homework, grab the Fermi White Paper and study the new features so next week when I jump into my HPC is at a cross-roads rant you will be ready.

HPC on Wall Street: Report From the Front
Wednesday, September 23rd, 2009

Last week I attended the HPC on Wall Street meeting at the Roosevelt Hotel in New York City (Follow the link to the speaker slides). What a difference a year can make. If you recall last year, the September meeting was right in the middle of the Wall Street meltdown. Lehman and Merrill were recent casualties and everyone seemed to be waiting for the other shoe to drop. Attendance was down and the most asked question seemed to be “How are you guys doing?” (”guys” meaning a company) There was also some justified panic in the air as the “Wall Street — What could go wrong?” boat was taking on more water than most people suspected.

I don’t want to dwell in the past, other than to say, at the event last year, some attendees were of the opinion that IT, and in particular HPC on Wall Street, would take several years to rebound. The abrupt failures of Lehman and Merrill would mean lots of extra hardware floating around and reduce sales. Furthermore, HPC clusters are used to calculate the risk associated with those derivative things we were hearing so much about. Once you know the risk, then you can set a price. The only problem is those fancy Monte Carlo calculations never had a variable for those mortgages that were being sold to anyone who could hold a pen. Thus, some introspection and retrenching seemed to be in order.

Jump ahead to this year, and it seemed like nothing ever happened! The break is over, everybody back in the pool. Of course, this is based on my casual observations, but there was a huge crowd and I don’t think they were looking for jobs. Indeed, people were talking about HPC and all the goodies that go with it. There were also a large number of vendors, which I took as a positive sign. As for myself, I was pulling double duty. I was working as a booth geek at the Appro International table and as the Linux Magazine HPC dude. As a journalist, I did have a chance to sit down with Erick Troan (of Red Hat RPM fame) and talk about his new company rPath. I’ll have more on the rPath “deep version control” technology in the future.

There is a little more to the Appro gig, but first, I want to jump back to last week where I talked about how 10 GigE (like both Fast Ethernet and GigE) would come down in price and see wide use in HPC. My reason was the simplicity of Ethernet. Now, before all my InfiniBand friends jump all over me, I also said, IB is the high performance leader and it is already faster than 10-GigE. I am rehashing this story because, my buddies over at HPCwire, posted an introduction to the story and then posted a rebuttal from reader Patrick Chevaux. And then, in a comment to Patrick’s post, Open MPI developer/leader Jeff Squyres indicated the number of lines of code in Open MPI for each different network transport.


  • Myricom MX (Myrinet 10G and Ethernet): 1,210 and 2,331
    (Open MPI has 2 different “flavors” of MX support)
  • Shared Memory: 2,671
  • TCP (used by Ethernet): 4,159
  • OpenFabrics (InfiniBand): 18,627

That is correct. The IB OpenFabrics interface requires about 4.5 times more code than using the TCP layer and at least 7 times more than shared memory or the Myricom MX interface (which is available for GigE through the Open-MX Project). Again, IB is good stuff and the list above does not speak to performance, but it does give an indication of the complexity of various transport layers.

As I said, there are those that prefer simplicity to performance. Perhaps a car analogy will help. There are all types of cars. They all can get you from point A to point B. If you need speed, then a Formula 1 car is your best bet. Such cars are not simple to drive, only hold one person, require specialized parts, but boy can they go. On the other hand, if you are not as interested in speed, you choose a slow pickup truck. It may not have the speed, but it just works, it is dependable, can haul lots of stuff, parts are cheap, and many of your neighbors have the same model. It is all about your needs and budget. InfiniBand is like the F1 car and Ethernet is like the pickup truck. (And please, don’t quote this out of context.)

Coming back to HPC on Wall Street, the finance models that run on clusters are not as sensitive to interconnect speeds as many other codes, thus they are a good fit for GigE and 10 GigE. Prior to the event, Appro hired me to write a white paper about 10 GigE and clustering. In the paper, I discussed various cluster designs using Appro hardware and Arista 10GBASE-T switches. (10GBASE-T is 10 GigE with standard RJ-45 connectors and Cat 6 cabling). Part of my job at the show was to answer questions about my freshly minted white paper. The paper is currently available without registration from Appro. The upshot is, if Ethernet works for you, the comfort level you now enjoy can continue with your next cluster. The white paper helps you understand how to implement a cluster using 10GBASE-T (or SFP+) Ethernet.

I expect a few more chapters to the 10 GigE story this year. I find it rather comforting that discussions about HPC interconnect options are taking place. Last year at this time, the only options people seemed to be talking about are the now worthless ones.

Twitter update: Did you read my announcement about HPC for Dummies? If you were one of the faithful followers this would be old news.

Parallel Programming: I Told You So
Tuesday, September 15th, 2009

Not being one to gloat (cough), I would like to take this opportunity to say, “I told you so.” In case you are wondering what it was exactly I said, I will recap.

I periodically bemoan the state of parallel programming. My experience with parallel programming stretches back to the late 1980’s. At this time, I was working on various parallel programing ideas including Fortran conversion and Logic Programming. The lessons I learned were two fold. First, parallel computing is hard and second we need to re-think how we do things. This opinion is shared by virtually everyone who has worked in HPC over the years. The prognosis is not that great either. Take a look at Greg Pfister’s blog where he writes about comments like Parallelism Needs a Killer Application. By the way, Greg’s other blog posts are worth reading.

In a similar vein, Mike Wolfe of The Portland Group (producers of excellent optimizing compilers) writes in HPCwire about the recent OpenCL fanfare. His article entitled Compilers and More: OpenCL Promises and Potential provides a fair assessment of OpenCL. I like to think of these articles as “buckets of cold water thrown on the masses” who have been whipped up by various media types. Every time I hear the words “parallel programming”, “solution”, “simple”, and “breakthrough” used in a sentence, I usually know it is some kind of sales pitch. In the case of OpenCL, I have read too many times that OpenCL is a simple unifying way to program parallel GPU’s and host processors. Of course, things like CUDA and OpenCL are welcome additions to any GPU programmers tool box. They help quite a bit and are important technologies. Do they solve the general parallel programming challenge? Not really and neither does anything else.

Before I move to the second issue, I want to state that my definition of the “parallel programming challenge” is to allow a programmer to write one program and have it run with reasonable efficiency on multi-core, GP-GPU, or clusters. A tall challenge if there ever was one. I don’t think it is impossible, but it might take re-thinking how we currently do things.

I believe part of the rethinking we need to do is about problem expression and execution. For instance, I have been waving the Functional Programming Flag for a while. In addition, I believe dynamic execution can solve a lot of problems. That is, the program will need to decide how to do things at run-time and not compile-time. The basic idea is to use a functional approach to naturally express parallel parts of a program and create a queue of parallel work. When the program runs, the queue is executed depending on the type of resources available. If there is one core, then the queue is resolved one task at a time, if there are many cores, the queue is solved in parallel. I have explored this idea in a series of three articles I wrote several years ago. ( You Can’t Always Get What You Want, The Ignorance is Bliss Approach, and Explicit Implications of Cluster Computing).

Of course, I don’t propose that I am the only one considering this approach to a parallel computing. Indeed, I have always assumed there are those more adept than me working on variations of this idea. Recently, I was delighted to read that Apple has open-sourced a project called Grand Central Dispatch (GCD). Essentially, Apple believes GCD is a better way to manage multi-core parallelism than using threads and has included it in the latest version of Mac OS X (10.6, Snow Leopard).

What is GCD exactly? As always, a little background is needed. GCD is based on work queues and “C blocks”. Most people get the work queue idea, but “C blocks” are a somewhat new concept for most programmers. C blocks are not a standard feature of the language, but look like function definitions with one big difference — they can capture the state from their surrounding context and save it (read-only) within the block for later execution. Like C function definitions, they can take arguments, and declare their own variables internally.

When a block expression is executed both a reference to the code within the block and a snapshot of the current state of local stack variables at the time of its invocation is created. The code is not executed, but a type-safe opaque value is created that can be assigned to variables, passed to functions, and otherwise treated like a normal language value. One of the way to use blocks is to place parallel blocks in a queue.

GCD implements two types of queues, Dispatch and Operation. The dispatch queues are a C-based mechanism for executing custom tasks. A dispatch queue executes tasks either serially or concurrently but always in a First-in-First-out order. A serial dispatch queue runs only one task at a time, waiting until that task is complete before dequeuing and starting a new one. By contrast, a concurrent dispatch queue starts as many tasks as it can without waiting for already started tasks to finish.

Operation queues are more complex than the First-in-First-out dispatch queues and take other factors into account when determining the execution order. Primary among these factors is whether a given task depends on the completion of other tasks. You can configure dependencies when defining your tasks and use them to create complex execution-order graphs for your tasks in operation queues.

If you are a Mac user, you can play with GCD using libdispatch. The libdispatch project consists of a user space implementation of the Grand Central Dispatch API. In order to implement the full API for Grand Central Dispatch, C compiler support for blocks is required. Contributions to the project will be continually evaluated for possible inclusion in future releases of Mac OS X. The sources are available under the terms of the Apache License, Version 2.0 in the hope that they might serve as a launching point for porting GCD to other platforms.

If GCC gets block support then libdispatch will probably be available to the Linux crowd as well. Apple’s motivation for opening-up GCD is probably to gain more support for what has become a key, but non-standard technology in their kernel.

Here is my take on these developments. Aside from the fact that C is not a functional language, but blocks do retain state, recasting a program using a “work queue” model offers a huge advantage of traditional methods. Executing the queue can be decoupled from the program. Thus, a program can, in theory, run on one core, or one thousand, without having to change any code. Of course, it will run slower on one core, but it will still run. Or, consider a typical compute node with eight cores. Suppose a dynamic program is using all eight cores with a large number of jobs still remaining in the queue. Some simple algorithms can be developed that allow remote cores (other nodes) or even heterogeneous cores (different types of nodes) to be used by the program.

Finally, I’m going to predict that this approach will gain in popularity if C blocks become available in GCC and other compilers. Unfortunately, it will not be due to anything I have written (or will write). Nor will it be due to the queue model, dynamic execution, multi-core, or the open availability of GCD. The biggest driver will be a small, but powerful, addition to the C language called C blocks.

The Commodity Push
Tuesday, September 8th, 2009

I have been reading about 10 GigE (10 Gigabit Ethernet) lately. It seems 10 GigE is starting to enter the cluster picture. There are some things to consider, but in general, “Good Old Ethernet” is taking its next big jump. I am convinced that over the next year there will be a significant up-tick in 10 GigE HPC clusters.

Why am I certain about this prediction (vague as it may be)? It is quite simple, if history is any guide, Ethernet is going to keep on rolling. Before I begin my nostalgia laced discussion of Ethernet, let me be the first to say, I do not believe in all-or-nothing scenarios (at least in HPC). InfiniBand (IB) is not going anywhere. There is room for IB and 10 GigE, just like there are different types of cars. They both get you where you want to go, but depending on your needs and budget, the one that is right for you may not be the best choice for the next guy. Therefore, because I’m talking about 10 GigE does not mean I am prediction the demise of IB. More like I am predicting the demise of GigE use in clusters. You know the interconnect used in 56% of the recent Top500 list. Okay, you in the back there you can sit down now.

My 10 GigE prediction is based on the following rule of thumb, Speed, Simplicity, Cost, pick an two. I believe 10 GigE will win because of simplicity and cost. IB is already faster and has better latency and if you need this level of performance you are not even looking at Ethernet direction. The joy of clustering is that one size does not fit all and you can build your cluster around your needs.

Looking at cost, you might conclude that 10 GigE is expensive right now, and you would be right. Let’s jump in the WABAC machine and look at what a Fast Ethernet switch costs in the late 1990’s. I can see a 2U Foundry Networks Fast Iron Workgroup switch with lots of lights and 16 ports for $4995. That is Fast Ethernet. Jumping ahead, I see similarly priced GigE switches in the not to distant past. In each case they were big, hot, and built to last. Now it is possible to buy an 16 port GigE switch for less than $200. Smaller versions, encased in plastic no less, are available at a lower price. The same can be said for network interface cards (NICs). Very often the NIC ends up on the motherboard as well. Thus, the commodity pressure guarantees low cost.

The good news about the commoditization of technology is cost and availability. The bad news is it can also produce a lot of junk. Some of the plastic GigE switches I mentioned just don’t work. They may work for Joe or Jane SixPack, but when pushed they fall down. For this reason, swimming in the commodity stream requires testing some assumptions and/or paying for higher quality parts — again your choice. I have seen benchmarks improve by 25% just by swapping one inexpensive GigE switch for another. The lesson here, the gems are out there, test before you buy.

Let’s move on to the simplicity factor. For the most part, Ethernet is a plug-and-play technology. It just works. When you are tying to get a cluster up and running, having dependable networking makes life much easier. All the services you know an love, NFS, schedulers, MPI, run over Ethernet. The other nice thing about Ethernet are the simple inexpensive cables. Click the cable in and presto, the link light goes on (unless something is broken). Just like there is a down side to low cost, there are some things to consider with the whole “plug-and-play” approach. Because you can ping between nodes, does not mean things are optimized. Indeed, many people are not aware that Ethernet connections can be tuned with either kernel module arguments or with ethtool. On almost all NICs you can also set the MTU size (the size of the Ethernet data packet). This feature becomes more important as the bandwidth increases because the standard 1500 byte Ethernet MTU creates a lot of overhead with GigE and 10 GigE. Tuning these settings can help (or sometimes hurt) performance. The good news is you can always fall back to the default mode if you goof up your settings.

The other nice thing about Ethernet is it has the ability to do User Space communications. Once the providence of the high end interconnects, Ethernet can now send messages without kernel overhead (copying and TCP/IP processing). A few projects that are worth looking at in this regard are: Genoa Active Message MAchine or GAMMA which is famous for achieving less than 10 µsecond latencies over GigE. It does require a patch to the Ethernet driver and only supports certain Intel Ethernet chip-sets. Another optimized communication protocol is Intel® Direct Ethernet Transport (DET) which works by providing a uDAPL like InfiniBand interface over GigE. uDAPL is the User Direct Access Programming Library that defines a single set of user APIs for all RDMA-capable transports. DET includes a kernel module and a uDAPL library for Ethernet and will work on almost any Ethernet NIC. It can linked with any software requiring uDAPL library. Finally, there is the Open-MX project. Open-MX is based on the Myrinet MX protocol and can run over any Ethernet connection. Essentially, any software that links to the Myricom MX library should be able to link with Open-MX. Depending on the chip-set Open-MX latencies as low as 10 µseconds for GigE have been reported.

With each iteration of Ethernet there are always some changes. Perhaps the biggest difference between 10 GigE and older Ethernet standards is the abandonment of half-duplex operation in favor of full-duplex. Therefore, backward compatibility should be considered before mixing Ethernet standards. In terms of cabling there are now 10GBASE-T cards and switches that use Cat 6 cables and the familiar RG-45 connectors for distances up to to 55 meters (Cat 6a can be used for 100 meters, Cat 5e is not part of the specification, but should work similar to Cat 6 cable).

How long before 10 GigE becomes a total commodity and hits the desktop? I’m not sure. As in the past, it starts out as backbone network connecting switches, then shows up on server motherboards, and then in the desktop environment. Of course, there will the those who say the desktop will never need 10 GigE. Just like they said about Fast Ethernet, and GigE. To those network-neigh-sayers, how is that coax 10Base-T hub working out for you these days? You did use Cat 6 cable for everything right?

In case you missed it, I had a big announcement on twitter. Something about an HPC for Dummies free ebook that is not a real book, but good enough. Check it out.

One Node For One Process
Wednesday, September 2nd, 2009

Multi-core is on my mind again. I can’t help it. The other day, I was thinking about benchmarking and what I wrote about in Good Enough Will Have To Do. Then it hit me, a possible way out of the multi-core (and GP-GPU) quagmire. Before I reveal my somewhat obvious solution, I need to set the stage.

The typical MPI program is a collection of processes that communicate via messages. These processes can live on the same multi-core node, on another node or a combination of both. Before multi-core there was one or two processes per node. The user often had some control over where his MPI processes would go — either dispersed, one per node or compact, two per node. And, more importantly, the user usually knew what arrangement worked best with his/her codes. With multi-core this has changed a bit.

The user now pretty much is at the mercy of whereever his processes land. There is also the use of processor (or core) affinity to pin a process to specific core, but it introduces another set of problems on shared nodes. In my opinion, such fine grained control should not be the users responsibility.

Let’s move on to GP-GPU and clusters. Even if you sweep the MPI/multi-core issues under the rug and just run on any available cores, there is the issue of GP-GPU in clusters. How does one adapt an MPI code to a node with GP-GPUs? If you enable each MPI process with the ability to use a GP-GPU, then you need to make sure that the processes are balanced so that the GP-GPU resources (which can vary from cluster to cluster) are used effectively (i.e. if one node has two GPU-GPU processes and another has six, then things are not balanced). Languages like Cuda and OpenCL do not address the cluster model.

Having thought about this issue I believe there is a solution to this mess. It is not the best of solutions but it is workable.

It came to me when I was thinking about running HPL on a cluster. (HPL, High Performance Linpack, is the program used to rank computers on the Top500 List). When one runs HPL on an eight-core node, you do not run eight MPI processes, you run a threaded MPI process. Therefore, if I have a cluster with 128 nodes (each with eight cores), my MPI job has 128 processes (i.e. mpirun -np 128 …). I don’t run 1024 MPI processes because the threaded implementation of HPL provides better performance. Then I got hit with the obvious stick. Threads (OpenMP or OpenCL/Cuda) on nodes, messages between nodes, that can solve the problem.

The idea is simple in concept. If there is one process, then it can manage using the node resources, whether it be cores or GP-GPUs. From programming standpoint, the MPI structure of the code may not need to change. What would need to change are the inner loops, but it may not be that simple. If you take a look at Experiences in Tuning Performance of Hybrid MPI/OpenMP Applications on Quad-core Systems by Ashay Rane and Dan Stanzione of the Fulton High Performance Computing Initiative, Arizona State University, you can see some of the issues that are involved.

There are some drawbacks to this model. You are going to have to change some code as there is no compiler option. My guess is the changes are going to range from trivial to rip your program apart and rewrite it. The goal would be structuring your program so that if the node had lots of cores, then you could use them, or if the node had GP-GPUs available you could use them as well. Programs would need to be recompiled or have some kind of run-time switch. A tall order for sure, but I don’t see any other way out of this in the near term. Cores will continue to increase, GP-GPUs will continue to show up on nodes, so perhaps this approach will provide a path forward.

Of course, you can always try simple things. For instance, the Portland Group Compilers support NVidia GP-GPU parallelization. I am sure other compilers investigating this approach as well. Almost every compiler also supports OpenMP (including the GNU compilers). And, it may not take much to try a few things with your codes as adding a few pragmas to your code is not that difficult. It is worth trying, because I assume if it does not work out, you will let me know.

I should also mention that I do not care for mixed models. It takes a lot of programming effort and it makes optimization much more difficult. If you think about the MPI communication between nodes, it only makes sense if the all the parallel parts can be done faster then doing it sequentially on one node. That is, the parallel computation plus the communication overhead time has to be less than the time to do it on single node. In a straight forward MPI program, the inner-loops are done on a single core. If there are now eight cores working on that chunk of code, that chunk you sent to single core just got eight times faster — and maybe even faster if you are using a GP-GPU. From an MPI perspective the communication overhead now carries a much heaver weight. (The communication/compute ratio is what determines MPI efficiency). You can gather some data and do a “back of the envelope” calculation to see how this will play out with your code, but often times it is easier to figure this out with some trial and error testing.

In closing, I would be interested to know what you think about this idea. I am sure there are those who actually tried such a thing and have some good insights. As I am found to state, One benchmark is worth 100 opinions, unless you write about HPC then you can just throw ideas an opinions at the wall and see what sticks.

By the way, I’m still doing my low frequency twittering I’m up to 106 followers. If you are one of the chosen few, stay tuned because I may revel some utterly useless mundane part of my life.

Eating, Smashing, and Mixing
Thursday, August 27th, 2009

The Oracle/Sun takeover is in the news. It was reported that the Justice Department gave the acquisition the go ahead. Of course the powers that be must check things over, but most people will tell you when you are up to you chest in water, it might be time to sell the boat. I say this with some sorrow. I hate to see companies that contribute go by the way side. I am not saying Oracle does not contribute, but things will be different without Sun.

I have had similar concerns in the past when companies get eaten or smashed. The one doing the eating gains some weight, then looks in the mirror. “Seems I’m a little heavy here, I don’t want that, redundant over here, I’m going to have to do something about this.” It usually takes a year or so, but at some point the shake-out (shake-off) happens. There is also the “sword over my head” syndrome, where no one from the acquired company wants to make any decisions. If you make the wrong one, you are out. So there is often a period of running in the sand as it were.

Sometimes companies vanish, other times they retain some identity. I was around when they smashed up AT&T. The current “AT&T” is not the old AT&T, thought they current holders of the name would like you to think so. I had the privilege of visiting The Bell Labs in Murray Hill, NJ during graduate school (AT&T’s research center). One the researchers at the lab was doing work similar to mine, so I asked him to be on my committee. I always had a sense of awe when I visited. This is where UNIX, the transistor, the laser, and countless other breakthroughs took place. They had a huge pedigree of Nobel Laureates as well. This place was a national treasure. Like all good things, the hammer came down, smashed it apart then let the pieces dry up an blow off into the wind. Nice work.

Which brings me to the Oracle/Sun deal. Before, I lament any further, I want to share an HPC story. Back in the day, there was a company called nCUBE, which was founded in 1983 by a group of Intel employees. The nCUBE machine was a true parallel computer based on a hypercube design. It had no switches and communicated through direct processor to processor links. The dimension of the cube was the number of links connecting each node. Sandia National Labs had a 1024 processor system at one point.

At the time (early 1990’s), I was working on various parallel software projects and got to know some people at nCUBE (parallel computing was a small world back then). There was also a fairly large nCUBE over at Rutgers University, to which I was given access. As an architecture, the nCUBE worked rather well. During the course of my work, I had many discussions with the local sales representative who found his job a little challenging. It seems that in 1988, Larry Ellison purchased a controlling chunk of nCUBE. The plan was to use the nCUBE for a parallel database machine. The sales rep told me he was hired to sell this machine into the database market. As things go in the computer industry, it took much longer to get the Oracle database working than was originally planned. This delay put my sales rep friend in a tough position. Instead of selling OLTP to the suits, he was selling HPC to the geeks. He adapted, but it was tough for database guy to grok the HPC space where the intricacies of parallel computing were often at the front of almost every discussion. Recall, MPI did not hit the scene until after 1994.

Surprisingly, nCUBE managed to stay afloat and was sold in 2005, but they had long since moved out of HPC. With respect to the Oracle/Sun deal, I have a concern for the oil and water mixture of database and HPC. Both run on the same hardware as they did back in the early nineties. Both are real markets. The culture could not be more different, however. As I saw first hand with nCUBE the suit/geek impedance mismatch is rather large.

Sun has a big HPC presence. They also contribute quite a bit to the community — Sun Grid Engine being a good example. More importantly, there are people at Sun that understand and are successful in the HPC market. What is Oracle going to do with this asset? And, I do mean asset. In reality, however, the size of the HPC market compared to the database market is rather small, so I assume from an accounting perspective anything can happen. Letting Sun HPC die on the vine will be difficult to watch as would a wholesale elimination of the division. Selling it may make some sense, but to whom? Keeping it would require mixing two very different mind-sets, neither of which is wrong, they are just different. For now, like everyone else, I’ll have to wait and see.

Before I close, I cannot help but mention the value of Open Source in situations like these. Stuff happens. Businesses grow, die, get eaten, change course, or just do stupid things. If the software you built a career on is tied to the destiny of one company, then you run the risk of going down with the ship. Sun is the caretaker of some very popular open source applications, MySQL, Open Office, Grid Engine, Lustre, to name a few. Each of these comes with an insurance policy that says something to the effect “if we go away, you don’t have to.” Thank you Sun, you will be missed.

Why You Should Touch MAGMA
Wednesday, August 19th, 2009

In all my worrying about multi-core, I kind of forgot about the whole GP-GPU thing. I knew GP-GPUs were coming and I thought the idea was pretty interesting. I was not sure at the time, however, that GP-GPU processing was going to become mainstream. At present, I am convinced GP-GPU computing is going to become a staple of HPC computing. My rationale is simple, the stream processors (the video card processors) are going to be everywhere in form and therefore they will get used. Just as HPC people don’t like to see cores idle, so will be the stream processors on the video hardware.

As an example, consider the low end desktop motherboards chip-sets. If you use a GeForce 9400 chip-set for Intel Core processors you get sixteen stream processors (540 MHz) in your box. If you chose the AMD 790GX you now have forty stream processors (700 MHz) at your disposal. Expect to see GP-GPU capable hardware anywhere there is a video port, which will be just about everywhere.

From a hardware standpoint, multi-core and GP-GPU hardware is a bargain - lots of cores for cheap. But, there is the software issue. How does one program such a conglomeration. There is some good news as the OpenCL language is on its way. Open CL was developed by Apple Computer and is a standard programming API for GP-GPU and multi-core hardware. It is based on the ANSI C language, but adds some extensions to support parallel operations. At this point if you want to play with OpenCL, you can download an x86 version, as part of the ATI Stream Software Development Kit (SDK), from AMD. Both NVidia and AMD/ATI have pledged support for the standard. An important aspect of OpenCL is the ability to choose different resources at run-time. That is, if an OpenCL application is designed correctly, it can probe for available hardware and adjust execution based on the current environment (i.e. run-time binaries can be portable across many different hardware platforms.) OpenCL also supports a memory hierarchy often found on GP-GPUs. Because of it’s complexity, OpenCL is considered a low-level interface and not the best choice for novice programmers. Indeed, as many HPC applications are already written in Fortran or C, only C programs are possible candidates to port to a GP-GPU. Which is a nice segue into one thing I’m excited about.

Anyone who has been in the HPC game, should know about Automatically Tuned Linear Algebra Software (ATLAS) libraries. This software project was developed by Jack Dongara’s (The Top500 guy) and his crew at The Innovative Computing Laboratory at the University of Tennessee. ATLAS was needed because crafting optimized linear algebra routines for different processors was tedious. (although both AMD and Intel provide hand tuned libraries is for their processors). The nice thing about ATLAS is that after running it on a target platform you end up with an optimized library. It has become so automated, the optimization process can be part of the rpm installation - although you may want to head down to the corner for a cup of coffee and a newspaper. In summery, ATLAS is a nice piece of work that solves a difficult problem.

Given the success of ATLAS, I was excited to hear that Dongara and team are working on the the multi-core/GP-GPU issue. The have recently released the first version of MAGMA or Matrix Algebra on GPU and multi-core Architectures. As stated on the web page, the project’s goal is to develop innovative linear algebra algorithms and to incorporate them into a library that is similar to LAPACK in functionality, data storage, and interface but targeting the next-generation of highly parallel, and heterogeneous processors.

For those that don’t know, LAPACK (Linear Algebra Package) is a set of widely used (and well written) subroutines for HPC. The optimized ATLAS routines are used by LAPACK. The goal of MAGMA is to allow LAPACK users to use subroutines optimized for simultaneous multi-core and GP-GPU execution. That is, as a user you don’t need to know about the details of your underlying hardware only that the LAPACK software is running optimally on your hardware.

There is no doubt the MAGMA project is tackling a difficult problem. I expect the results to take some time, but I am sure there will be great strides made with this project. As an end user, you can expect to benefit from this work in the future. And, as a member of the community, you can help the project right now. If you have the right hardware, why not pull down a version and play with it. Your feedback will be important to the MAGMA team and help them build a better package. Don’t worry though, your hands won’t get burned touching this MAGMA unless you are holding your video card.

The Core-Diameter
Wednesday, August 12th, 2009

Last week I moderated a webinar entitled Optimizing Performance for HPC: Part 2 - Interconnect with InfiniBand. It was a great presentation with a lot of practical information and good questions. If you missed it, it will be available for a few months, so you still have a chance to check it out. As part of the webinar, Vallard Benincosa of IBM, mentioned that the speed of light was a becoming an issue in network design. In engineering terms, that is refered to as a hard limit.

I started to think about this limit and how it would effect the size of clusters. I did some back of the envelope math to get an estimate of how c (the speed of light) will limit cluster size. I want to preface this discussion with a disclaimer that I thought about this for all of 20 minutes. I welcome variations or refinements on my ciphering.

Let’s first consider the speed of light (SOL) in a fiber cable. The number provided by the Qlogic crew for the webinar was 5 ηs (nanoseconds) to travel one meter in a fiber cable (It takes light 3.3 ηs to travel one meter in a vacuum). How can we translate that into a cluster diameter? Latency is measured in seconds and the SOL is measured in meters per second. Here is one way. First we have to define some terms:

LT is the total end to end latency
Lnode is the latency of the node (getting the data on/off the wire)
Lhop is the latency of the switching chips
Nswitch is the number of switch chips.
Lcable is latency of the cable, which is a function of length

A formula may be written for the total latency as follows;

(1) LT  =  (Lnode + Lswitch*Nhop + Lcable)

If we take equation 1 and solve for Lcable, then divide the right hand side by 5 meters/ηs we get what I call the core-diameter:

(2) dcore  =    LT - (Lnode + Lswitch*Nhop)
5

The core-diameter is the maximum diameter of a cluster in meters. Let’s use some simple numbers. Suppose I need 2 μs (microseconds) latency for my application to run well (this is LT) and my nodes contribute 1 μs and I use a total of 6 switch chips with a latency of 140 ηs (nanoseconds). I get a core diameter of 32 meters. This diameter translates to a sphere of 17 thousand cubic meters. If we take an average 1U server and assume it’s volume is 0.011 cubic meters, then we could fit about 1.6 million servers in our core diameter. In practical terms, the the real amount is probably half allowing for human access, cooling, racks etc. So we are at about 780 thousand servers. If we assume 8 cores per server, then we come to a grand total of 6.2 million cores. If we run the numbers with LT of 3 μs the number explodes to almost 600 million servers and we can see why cable distance has not been an issue.

A few things about my analysis. Obviously my numbers could be refined a bit, but as a first pass they seem to work. Scaling an applications to such large numbers may be a bigger challenge than the SOL, but it does put some limits on just how big a cluster can become. Actually, it is a bit more limited than my simple analysis. There is a push-pull effect. Better scalability comes from lower latency, which decreases the diameter. Thus, in order to increase (push) the number of cores, I can use, I need to reduce the latency which due to the SOL reduces the diameter (pull) or actual number of cores I can use. Perhaps some enterprising student could come up with a model that captured this effect. I should also mention that refining the assumptions can change the actual numbers, but the push-pull effect due to the SOL is the same.

I have run out of room on the back of my envelope as I don’t think this analysis can be pushed much farther without some refinements. I’ll leave it as a exercise to the reader to continue the analysis. I will coin the term core-diameter, however, as it sounds cool.

Moving on, I wanted to mention another type of progression in which parallel computers play a roll. There are those that believe we are in a period of accelerating technological change. If you look at the Top500 as one example, in 1993 the the top machine recorded 60 GFLOPS, this past June we hit 1.1 PFLOPS, that is 5 orders of magnitude in 16 years. There are those that are interested in discussing the effect and/or consequences of this types of progress. The main idea is that we are rushing toward a singularity of sorts that will result in a super-intelligence. Advances in Artificial Intelligence, Nano Technology, and Biology may be pushing us closer to a potential singularity. And, behind all of the these technologies, lies an HPC cluster. That would be where we, the cluster geeks, fit into the picture. Each year there is a Singularity Summit where these issues are discussed. This year it is in New York City. I think I’ll head over and see what the visionaries have to say — while there is still time.

Over There Vs. Over Here
Wednesday, August 5th, 2009

It is funny how solutions develop. Very often, a “perceived problem” is identified and discussed, technical solutions are put together, and the problem is solved. Many times, however, the problem may turn out to be not as big a problem as you may have thought. Of course, along the way, some really nice technology may result, but it may not be the “killer solution” (i.e. big success, everyone uses etc.) you were expecting. Then there are other problems, often unforeseen, that find a solution from some other unforeseen area. Clusters “sort o

It is funny how solutions develop. Very often, a “perceived problem” is identified and discussed, technical solutions are put together, and the problem is solved. Many times, however, the problem may turn out to be not as big a problem as you may have thought. Of course, along the way, some really nice technology may result, but it may not be the “killer solution” (i.e. big success, everyone uses etc.) you were expecting. Then there are other problems, often unforeseen, that find a solution from some other unforeseen area. Clusters “sort of” fall into this category. In fact, there were many who thought clusters just some kind of DIY fad.

I was reminded of this scenario by a recent post to the Beowulf Mailing list by Don Becker:.

BProc is based around directed process migration — a more efficient technique than the common transparent process migration. You can do many cool things with process migration, but with experience we found that the costly parts weren’t really the valuable ones. What you really want is the guarantee that running a program *over there* returns the expected results — the same results as running it *here*. That means more than knowing the command line. You want the same executable, linked with exactly the same library versions in the same order, with the same environment and parameters.

In years past, process migration and clusters was of great interest because it brought a unified process space across the entire cluster. (Still a great feature as implemented in Scyld’s BProc), but, as Don says it is not what seems to really important right now. I recommend you read the full post and jump over to our recent Interview with Don Becker. Don is a busy cluster kind of guy and when he speaks up he usually has something interesting and important to share.

The “over there” vs “here” problem is one of scale. With a small 32 node cluster it would seem like a non-issue. Bump that up by two or three orders of magnitude and you might begin to see how this could be a problem. The less experienced, might suggest loading the same thing on all the nodes. Sure, good plan, at first, if something were changed for any number of reasons, you could have a problem. Another question worth asking, is how to verify that the execution environment is what you want.

To be clear, Don, states that you don’t need directed process migration to ensure consistency, but BProc can be used to achieve that goal and provide other nice things. Which leads me to another thought.

One of the questions I am often asked is “What is the virtualization play for HPC?” I usually reply that there are issues that need to be resolved before virtualization and HPC walk hand in hand, but process migration in the form of check pointing would be a great thing to have. Thinking about the “over there” vs “here” problem in terms of virtualization, however, may just be the killer HPC/Virtualization application that solves a big problem.

Imagine, creating a tested working image of your application, operating system, and file system and running it on a virtual HPC machine. The “over there” vs “here” problem goes away because, “over there” is “what is here.” Of course, we talking about scale and pushing a large number of images out to thousands of nodes is an issue. And, notice I threw in the file system part. I believe before HPC can be virtualized (or “clouded”) the I/O issue (both compute and file system) needs to be resolved. I suspect this will be through some form of I/O specification that travels with the job image. The specification will allow the cloud the run the application on the right hardware. The current cloud definition is rather loose when it comes to I/O (i.e. it will be there, just can’t say how exactly fast or consistent it will be).

Speaking of cloud issues, I read two interesting articles recently. The first seems to be a possible solution to what I consider a thorny issue: cloud security. That is, as soon as my data leaves my walls, I do not have 100% control over it. And, anything less than 100% means I cannot guarantee security. Of course you can encrypt it, but then to operate on it in the cloud, you need to unencrypt it in the cloud, which means it is still naked data. That is until, now. Recently, an IBM researcher has solved the problem of fully homomorphic encryption, which to you and me means the ability to use encrypted information without un-encrypting it. (i.e. data always remains encrypted which means the result is always encrypted). Problem solved. Nice work. When do we see the demo?

The other issue I read about was the lack of entropy in the cloud. (Entropy is a measure of randomness). Basically, a virtualized instance does not have access to some of the physical means to build up it’s “entropy pool” and thus could become more predictable. Since randomness is the key to security, this might make virtualized servers more vulnerable. Of course there are some ways to fix this, however, I thought about HPC applications first and how this could have an effect on Monte Carlo results.

To sum things up, it seems the HPC problem space is evolving. I noticed that I am talking about virtualization and cloud much more than in the past, but yet there is no big killer HPC service/application out there. One other thing I have noticed is that the more open the discussion, the more solutions seem to flow. I suppose that allows solutions to get from over there to over here, and vis-versa.


f” fall into this category. In fact, there were many who thought clusters just some kind of DIY fad.

I was reminded of this scenario by a recent post to the Beowulf Mailing list by Don Becker:.


BProc is based around directed process migration — a more efficient technique than the common transparent process migration. You can do many cool things with process migration, but with experience we found that the costly parts weren’t really the valuable ones. What you really want is the guarantee that running a program *over there* returns the expected results — the same results as running it *here*. That means more than knowing the command line. You want the same executable, linked with exactly the same library versions in the same order, with the same environment and parameters.


In years past, process migration and clusters was of great interest because it brought a unified process space across the entire cluster. (Still a great feature as implemented in Scyld’s BProc), but, as Don says it is not what seems to really important right now. I recommend you read the full post and jump over to our recent Interview with Don Becker. Don is a busy cluster kind of guy and when he speaks up he usually has something interesting and important to share.

The “over there” vs “here” problem is one of scale. With a small 32 node cluster it would seem like a non-issue. Bump that up by two or three orders of magnitude and you might begin to see how this could be a problem. The less experienced, might suggest loading the same thing on all the nodes. Sure, good plan, at first, if something were changed for any number of reasons, you could have a problem. Another question worth asking, is how to verify that the execution environment is what you want.

To be clear, Don, states that you don’t need directed process migration to ensure consistency, but BProc can be used to achieve that goal and provide other nice things. Which leads me to another thought.

One of the questions I am often asked is “What is the virtualization play for HPC?” I usually reply that there are issues that need to be resolved before virtualization and HPC walk hand in hand, but process migration in the form of check pointing would be a great thing to have. Thinking about the “over there” vs “here” problem in terms of virtualization, however, may just be the killer HPC/Virtualization application that solves a big problem.

Imagine, creating a tested working image of your application, operating system, and file system and running it on a virtual HPC machine. The “over there” vs “here” problem goes away because, “over there” is “what is here.” Of course, we talking about scale and pushing a large number of images out to thousands of nodes is an issue. And, notice I threw in the file system part. I believe before HPC can be virtualized (or “clouded”) the I/O issue (both compute and file system) needs to be resolved. I suspect this will be through some form of I/O specification that travels with the job image. The specification will allow the cloud the run the application on the right hardware. The current cloud definition is rather loose when it comes to I/O (i.e. it will be there, just can’t say how exactly fast or consistent it will be).

Speaking of cloud issues, I read two interesting articles recently. The first seems to be a possible solution to what I consider a thorny issue: cloud security. That is, as soon as my data leaves my walls, I do not have 100% control over it. And, anything less than 100% means I cannot guarantee security. Of course you can encrypt it, but then to operate on it in the cloud, you need to unencrypt it in the cloud, which means it is still naked data. That is until, now. Recently, an IBM researcher has solved the problem of fully homomorphic encryption, which to you and me means the ability to use encrypted information without un-encrypting it. (i.e. data always remains encrypted which means the result is always encrypted). Problem solved. Nice work. When do we see the demo?

The other issue I read about was the lack of entropy in the cloud. (Entropy is a measure of randomness). Basically, a virtualized instance does not have access to some of the physical means to build up it’s “entropy pool” and thus could become more predictable. Since randomness is the key to security, this might make virtualized servers more vulnerable. Of course there are some ways to fix this, however, I thought about HPC applications first and how this could have an effect on Monte Carlo results.

To sum things up, it seems the HPC problem space is evolving. I noticed that I am talking about virtualization and cloud much more than in the past, but yet there is no big killer HPC service/application out there. One other thing I have noticed is that the more open the discussion, the more solutions seem to flow. I suppose that allows solutions to get from over there to over here, and vis-versa.

Commodity Software
Tuesday, July 28th, 2009

Rummaging through the HPC attic, I found some more material for my HPC Master Series. These are editorials written by the HPC leaders and pioneers for ClusterWorld Magazine. In case you missed it, I have already posted Beowulf in Chrysalis by Tom Sterling and The Grid by Ian Foster. Good stuff and worth reading at any time.

This week I want to add an excellent piece by Bill Gropp on Commodity Software. For those that don’t know, Bill and his crew at Argonne National Lab, have brought us the likes of MPICH, MPICH2, and PVFS2. In 2007, Bill left Argonne to join the University of Illinois at Urbana-Champaign as the Paul and Cynthia Saylor Professor in the Department of Computer Science.

Bill’s original editorial has the title Commodity Software? Is cluster software any good? an asks some very good questions. Reading it 5 years hence, there were several timely things that struck me about his post. His main point was the progress in software depends on both standards and the ability to innovate. I thought his example of NFS was right on the mark. (Sorry you have to read the piece to get the full explanation.)

At the time it was written, the commodity hardware wave was just breaking on the HPC shore. Jump ahead and we all see that many of the proprietary sand castles were washed away by that wave. There may be a similar argument for software as well. It is well known that writing good parallel HPC software is hard. Indeed, if you have read anything I have written, you will recall that I toss at night thinking about these issues and multi-core just makes it all the more difficult. I recall hearing about ISV’s who chose not to “parallelize their codes” because it is too hard or expensive or both. (The too hard part means they can’t find people and the too expensive part means if they find the people it is going to take some time.) And, of course they want to make it as flexible as possible so that when the next hardware architecture arrives, they are not re-writing code again. This last goal is extremely difficult to achieve, by the way.

My favorite part of Bill’s discussion is the following: Part of the solution is to emphasize commodity software. That is, software that is written to an agreed upon standard. Applications that use commodity software can pick and choose their software platform in much the same way that commodity hardware makes it possible to pick and choose the hardware platform. But there is danger here too. If we insist on the current set of standards, we stifle innovation and prevent the development of better standards.

Commodity software is not necessarily a new idea, but it is not “just open software.” Commodity software is designed to run on a variety of commodity hardware platforms. As Bill points out, this is tricky because you need to balance innovation and standards. One of Bill’s progeny, MPICH, is a good example of this idea. This software was designed to bring the MPI standard (as defined by the MPI Forum) to as many hardware platforms as possible. Before clusters were everywhere, there were quite a few parallel machines supported by MPICH. As a matter of fact, when it was developed, the “P4 transport layer” used by clusters was almost an afterthought to placate those same people stringing together workstations with things like PVM.

We can learn about commodity software by looking at MPICH and its successor MPICH2. Of course the “open” factor played a big role as the community helped add robustness to each package either through suggesting code, bug reports, and usage cases. And like any good commodity implementation, there were other packages that implemented the same standard and yet offered a different implementation and feature set. I have test scripts that can run LAM/MPI, MPICH, MPICH2, and Open MPI by simply applying a few sed scripts to a Makefile. (It should be noted that both MPICH and LAM/MPI are in maintenance mode and you should be using MPICH2, or Open MPI, or other newer MPI versions for your applications.)

Commercial vendors may not like the idea of commodity software because it is hard to lock-in customers. I would counter that the trend in HPC is toward commodity software and away from lock-in. The argument is not one of philosophy but of practicality. I believe that without community support it will become cost prohibitive to offer some packages in the HPC space. Currently, the cost of some software applications outweighs the cost of hardware by a large margin. Users may begin to look at the cost of commercial software versus the cost contribute to a commodity/community software project that offers similar functionality. And, because they are helping develop the application there is the potential to meet their needs better than a commercial product. Those vendors that chose not to port their applications to the HPC space may find the market has moved past them in support of community software. There is still a commercial angle for commodity/community software, however. Every single production cluster team I have worked with knows the value of application support and will gladly pay for it. At the end of the day, everyone needs results and delivering results is not a commodity process.

Fireflies and Ants
Tuesday, July 21st, 2009

Insects fascinate me. Not because, they have six legs and pretty much dominate the earth, but because they have figured out ways to handle large numbers of themselves. In the past, I have marveled at how large numbers of ants and bees seem to operate as one organism. As core counts and clusters get larger, I always find it instructive to see how the problem of scale is solved by nature. Not that I sit around thinking about insets all day, but recently I watched a show about networks called “Connected: The Power of Six Degrees” on the Science Channel. Fascinating stuff, Kevin Bacon and all. The part that caught my interest however, was the synchronized fireflies.

It turns out that some species of firefly will synchronize their flashes so that the entire population is flashing at the same time. Now that is some trick. After a small bit of googling, I found The Synchronous Fireflies of Elkmont. Before you think I gone totally off the deep end, rest assured I’ll get back to clusters and HPC shortly.

From the article, were it states, “any system will tend toward synchronous behavior if the individuals involved follow some sufficient set of rules.” In the case of fireflies, those rules include:


  1. Individual fireflies possess an internal timer or oscillator.
  2. They can sense when their immediate neighbors are flashing.
  3. They tend to advance their cycle in order to flash before their neighbors flash.

Mathematician Steven Strongman of Cornell University has shown that these conditions are sufficient to lead a whole forest of fireflies to flash in synchrony. No reason for why the rules came about was given, however, similar sets of rules are believed to exist throughout nature. Some other examples of natural synchronization include women’s menstrual cycles, the movement of electrical impulses between nerve cells, the onset of epileptic seizures, and the synchronous movements of flocks of birds and schools of fish.

Before we get back to clusters, let’s switch to ants for a moment. One of the problems with managing large amounts of anything, is communication. For instance an ant colony has a queen, but she does not direct each individual ant. Each ant has a job (dig, forage, defend, etc) and follows a set of rules for that job. Ants communicate using chemicals. For instance they can mark a trail to food or send alarm signals to each other. Communication is local. There is no way to broadcast to every ant.

Cluster with many cores may have a similar problem — central control takes too much time and overhead. Let’s consider a master/worker model used by some programs. As the number of workers gets large, the master can easy become overwhelmed trying to synchronize 10,000 cores. What if a more dynamic approach was used instead of central synchronization. What if like the fireflies, cores could synchronize globally using simple local rules.

Taken further, what if programs were written in a way to self organize to solve a particular problem. That is, a program would consist of thousands of individual programs that would organize in such a way as to solve a particular problem. You might have IO programs, matrix multiplication programs, FFT programs, etc. which, through a set of rules would self organize and provide you with a result. Or, a programs behavior might change depending what its close neighbors are producing or consuming.

At this point, you probably think I have spent too much time with the insects. Perhaps. It should be noted however, that faster clocks have given way to multiple cores. In a similar, way, insects, with their exoskeletons can only get so big, so instead of becoming large, they became many. Like today’s processors instead of scaling up, they scaled out. Now you know why I think about insects.

My twitter following has reached 108, but I have new goal. I want to achieve Best Buy Manager Status. That is, I need 142 more followers. Help me out here. If I can’t qualify for selling TV sets, car stereos, and washing machines, then what am I doing writing about HPC. Oh the shame of it all.

It's About Time
Wednesday, July 15th, 2009

I was half way through this weeks column when I noticed Rumor SGI Terminates PFLOPS Deal and then I read The Business Side of HPC on Joe Landman’s blog. I need to chime in on this story.

In case you did not read about it, I’ll summarize the reported rumor. If you recall, old SGI is gone. Rackable systems bought some of the assets, but as a legal entity SGI is history. While it was alive, however, it signed a deal with the National Science Foundation and the Pittsburgh Supercomputer Center to deliver a 1 PFLOPS (1×10^15 Floating Point Operations per Second) system for $30 million. The new SGI/Rackable is under no obligation to honor the deal and it is rumored that they have decided to walk away.

Most people may find this incredibly hard to believe, but there is one big reason you walk away from a deal — if you will lose money. And, if you do some math, you will find that price works out to $33 per GFLOP. As someone who specializes in cheapskate supercomputing, that is a too good to be true deal — kind of like finding a gas station selling gas at $1/gallon. Of course, you have to wonder how long that gas station or computer company would stay in business. Oh wait, old SGI is out of business.

One also must ask, why sell so cheap? The answer takes a little explaining, but basically, there is a lot of “buying business” at the high end of the HPC market. I like to call it “buying a press release”, but the idea is the same. In addition, there is what I call a tradition of “give us a gift” mentality at many educational and government institutions.

In the past, I have personally seen RFPs for systems that were asking for “below market pricing.” It was an obvious ploy to see if they could get a company or two to give them a cluster in return for all the HPC good will they capble of producing. The vendor gets prestige, placement on the Top500, and of course copious press releases. In a way, I can’t blame the institutions for trying this. After all, if all the vendors say “No thanks” then they have to rework their RFP and live with the fact that they were not as important as they thought. In addition, in years past when margins on big iron supercomputers were large, many non-commercial institutions were used to seeing huge discounts from the vendors who, by the way, still made money.

For those institutions that still want to play “squeeze the vendor for a gift”, I have this to say. Be careful. You are playing a rather poor end game. Going back to our gas station analogy, if everyone goes for the $1/gallon station, then very soon the other gas stations go out of business, then when the “gas gift” comes to an end because no one can loose money forever, leaving you with no gas. I have offered a similar invitation last year when I went on a rant about cluster vendors. There are vendors who really help move the market and community forward and then there are those who want to sell you cheap hardware. The vendors you choose will have a big impact on the future of this market. And, for those who want to spout off “let the market decide” tripe, remember HPC is specialized niche. If you squash all the good companies who provide good products and services, which usually cost more, in favor of cheap hardware, you may find yourself wanting in few years.

On the other side of fence are the vendors who believe that “buying business” is a good idea. I suppose that it is a kind of loss leader mentality. If we get this system in, then we can sell them more at a higher prices. Or perhaps, it is that placement on the Top500 list that interests you. Guess what, I have a little secret, inside the cases, it is pretty much all the same stuff. If you want to differentiate, add value, not discounts. Oh and those Top500 press releases, everybody has them now. No big deal.

In closing, I would like to congratulate, SGI/Rackable CEO Mark Barrenechea for “just saying No”, if the rumor is true. I hope others follow his lead. As for PSC and the NSF, look, I will still love you if you have less than 1 PFLOP. After all, there is still work to be done.

Concurrent and Parallel Are Not The Same
Tuesday, July 7th, 2009

In case you did not have a chance to read the column from last week, I am taking my yearly vacation at the Jersey Shore. Please refrain from the jokes, lest I pull out the Bruce Springsteen trump card. I try to spend two continuous weeks with family and friends each year. I have found that one week is just too short. I need two weeks. The first week is used to try and forget about all the stuff I did not get done before I threw the laptop in the car and say “let’s go.” The second week is used to try and remember and organize all the stuff I have to do when I get back. My plan usually breaks down somewhere around 10 AM my first day back to work.

This year I had a bunch of writing to do (including this column), so it was kind of a working vacation. Not to worry, I’ll have my feet in the Atlantic Ocean in few hours. In any case, my dilemma is as follows. Write an insightful column quickly and get to the beach. It may surprise some readers, but I do like to research some of the topics I write about. At a minimum, I like to include enough URLs so that if you actually want to investigate a topic further, more information is just a click away. As an aside, I am constantly amazed at how much content on the web has absolutely no external links to supporting material. I thought that was the whole idea. I mean how hard is it to add a Wikipedia link to a discussion of Clos Networks or some other networking technology.

Back to my dilemma. What can talk about that will get me to beach before the water ice guy packs up for day? Although, I don’t to like to rehash things I have written about in the past, I will be making a an exception this week. Not necessarily because it is easy, but because I think some messages need reinforcing. Therefore, all I have to decide is what message I should I hammer home on this July morning.

The answer is simple — understanding the difference between concurrent and parallel. I believe these two terms are often used interchangeably while, in my opinion, they are represent two different concepts.

Let’s start with concurrency. A concurrent program or algorithm is one where operations can occur at the same time. For instance, a simple integration, where numbers are summed over an interval. The interval can be broken into many concurrent sums of smaller sub-intervals. As I like to say, concurrency is a property of the program. Parallel execution is when the concurrent parts are executed at the same time on separate processors. The distinction is subtle, but important. And, parallel execution is a property of the machine, not the program.

If execution efficiency is important (i.e. you want things to go faster by adding more cores), then the question you need to ask is “If I run everything that is concurrent in parallel, will my code run faster?” If the answer were “yes” then we would not be having this discussion. And, since the answer, is “no”, then the question is “What should run in parallel?” which is obviously, the portions of code that lower execution time.

This decision is one of the reasons cluster parallel computing is hard. It really does depend on the machine. Take our integration case. If the integration interval is small, then breaking it up into small sub-intervals and sending them out to other nodes will result in extending the execution time of the program due to parallel overhead. If the integration interval is huge, then parallel execution may make sense. Because parallel overhead can vary from cluster to cluster, there is no easy way to predict overhead beforehand. (i.e. The parallel overhead is larger for GigE vs InfiniBand when sending small packets.)

The same applies to multi-core. The overhead for thread communication is lower, but there is still overhead (see my HPC Hopscotch for background on SMP memory). There is no free lunch — everyone has to deal with overhead.

In summary, the point I want to make is this, Concurrency is a property of the program and parallel execution is a property of the machine. What concurrent parts should and should not be executed in parallel can only be answered when the exact hardware is known. Which I might like to add leads to the most unhappy conclusion when dealing with explicit parallel programming, There is no guarantee of both efficiency and portability with explicit parallel programs. Yes, I know, a sad state of affairs. I’ll let you wrestle with that for a while, in the mean time, I’m going to the beach.

HPC From the Beach
Wednesday, July 1st, 2009

Each year at this time my family and I head to the Jersey shore. I have to admit, I have been here for four days and have not yet been to the beach. That is not entirely true. Last night I watched the sun set over the water with requisite dog and cigar.

Wait a minute. Those who are not geographically challenged, will recall that New Jersey is on the east coast and the sun sets on the west coast. True enough, but in certain parts of southern New Jersey you can see the sun set over the Delaware Bay. Looks like an ocean sunset and works for the tourists.

If you are wondering where I am, I can tell you in a single sentence. Go to the State of New Jersey and head south as far as you can go without going into the water. A nice short and concise algorithm. And, due to the geography that simple sentence will put you within 300 yards of where I am sitting right now.

While we are talking about algorithms, notice that I told you what to do, not how to do it. Such is the nice thing about declarative programming. You the traveler can choose to take a car, bicycle, personal helicopter, or whatever to get here. And your mode of transportation determines how you travel. I doubt you would take a bicycle on major highways nor would you limit you flight path to roads were you in a helicopter.

Interestingly, this algorithm is executed in parallel all throughout the summer. That is a large group of automobiles each head to the same destination executing their version of the “head south” algorithm. There is also communication between the travelers and excluding the friendly hand jesters we all experience in traffic, subtle local communication between drivers creates few collisions. There is also quite a bit of congestion, but this can be avoided by adding some intelligence to your algorithm. First, you can utilize traffic data now broadcast on radio, cell phone, and GPS unit. Second, by timing your travel you can avoid major congestion.

Obviously, the bulk of drivers use a basic algorithm even though there is quite a lot of congestion. Others use more intelligent algorithms and thus minimize their travel time. There big difference is the static vs dynamic algorithm. Those that travel the same way at the same time each year would seem to have a static algorithm. The drivers that adapt to conditions would seem to have a more dynamic algorithm.

Most people, I suppose, probably do not think about traffic jams in this way. I find it instructive in a sense. The overall declarative algorithm “head south the the beach” is executed by a large population of “processors” (drivers). The static algorithms, while successful, do not take into account changing conditions. The dynamic algorithms try to adapt to the available information.

When I think about programming clusters, multi-core, and GP-GPUs, I long for a declarative solutions (i.e. multiply these two matrices). The declarative system would “understand” how to use the hardware at hand to give me a solution in the shortest time. Like my traffic analogy, I think that a dynamic approach would lead to the best results. For instance, I would have one binary and that would adapt to what ever hardware on which it was run (assuming it had knowledge of the hardware). Thus, how it gets to the answer depends on the external conditions — that cannot be predicted at compile time. Or, like my traffic analogy, the flat tire that causes a backup, that causes me to take the back way through the pine barrens and stop at that ice cream place. Ah the joys of optimization.

In closing, as I sit finishing up this column, my teenage daughter and her BFFL listened to me read it out load. Their feedback was quite helpful. My daughter, Taylor, is questioning why anyone would pay me to write such a thing, while her friend, Carla, cannot stop ROFL. At least the dog is taking me seriously.

My twitter count is up to 97! However, I noticed some of my “followers” don’t seem so interested in HPC as some of their posts are how should I say “biological in nature.” In any case, join my quest to have 100 followers and read my pithy posts at thedeadline on Twitter.

Scaling Bandwidth
Wednesday, June 24th, 2009

In my last two columns (Small HPC and HPC Hopscotch) I have been talking about multi-core, memory, and HPC programming. The recent release of the AMD six core Opteron got me thinking about this topic. It will soon be possible to buy a 12 core workstation (or even a 48 core version!) I recall the days when a 32 processor cluster (16 nodes with dual single core processors) was a nice addition to any lab or even computing center.

I also talked about memory locality and how multi-core has introduced a new hierarchy, near, near-by, and non-local memory. In the past an MPI programmer really thought about local memory and non-local (distant) memory on another node. Distant memory was only changed by sending a distant process a message using MPI. Near-by or SMP memory with a bunch of cores attached represents a different (although not all that new) programming paradigm for HPC users. You can still run MPI programs on SMP systems, but a threaded shared memory programming model is also attractive as you don’t need any of that “MPI stuff.” Seems like the cluster may become obsolete for all but those really big jobs.

Not so fast. Two things need to be considered. First, I invite you to read what I had to say back in 2006 about the programming issue. As we Wind On Down The Road there are two paths you can go by: message passing and threads. You may have to make a choice at some point. Realize that this is not a trivial decision. It is going to matter.

The second point I want to make is about memory bandwidth. Basically, it does not matter how many cores you have if you cannot keep them busy. There are limits to how fast memory can transmit data to and from cores. The placement of memory controllers on the processor allows parallel (simultaneous) memory access on an SMP motherboard. That is, instead of one memory controller for all the cores, there is now one for each processor socket. The memory controller is in charge of its own bank of memory. In this way, if a core in one socket is using local memory, it feels no affect when a core in the other socket is using its local memory (and vis versa). In effect the memory access has been parallelized. Note, if memory from another controller is needed, then QuickPath or HyperTransport step in.

This memory controller parallelization is why clusters are so powerful. If you consider that each node has at least one memory controller with an associated bandwidth, then N nodes has N times the memory bandwidth of a single node. Some numbers may help.

Consider the latest six core Opteron (Istanbul). I have read that a 24 core systems (4 processor sockets) has demonstrated a Stream bandwidth of 41 GBytes/sec. (Note this is a vast improvement over a 16 core Shanghai system which gave 25 GBytes/sec. Google on “HT Assist” for more information.) That means we have 1.7 GBytes/sec of bandwidth per core.

Now lets turn back the clock a bit. I dug up some old Stream numbers for dual core Opterons. A four core Opteron system (two dual core processors) was able to deliver 12 GBytes/sec or 3 GBytes/sec per core. If you were to run a memory bound MPI application on 24 cores, what would you want to use? A single 24 core SMP workstation (total memory bandwidth of 41 GBytes/sec) or six dual socket, dual core nodes (total combined memory bandwidth of 72 GBytes/sec).

In general, the more SMP cores you add to the mix, the lower the memory bandwidth. The more cluster nodes you add, the more the memory bandwidth scales. Note, I did not report processor or memory speeds or address the cost/power of six servers vs one workstation because I am making a “ballpark” argument to illustrate a point about clusters. While multi-core is packing more and more cores into a processor, memory bandwidth becomes a limiting factor. You may be able to pack 24 or even 48 cores in to a SMP workstation, but the total memory bandwidth may never be as good as a three or six 8-core cluster nodes.

Finally, if you follow my arguments, the best cluster node might be a single core with lots of memory and a fast interconnect. Given that single cores are hard to find the next best thing might be a single socket dual-core server node — while they are still available. Now you could use a bunch of single socket desktop motherboards and build a cluster, but who would ever do anything like that.

I now have 71 people following me on Twitter. I’m impressed because I don’t say that much. A man of few, but choice words I suppose. I even figured out how to tweet from my new phone and still I don’t say that much. I just tweeted about writing (in this column) about not tweeting. Careful a self referential infinite loop is developing.

HPC Hopscotch
Wednesday, June 17th, 2009

In my last column, I discussed a possible future for HPC programming. Briefly, I assumed that as cores per socket continue to grow, HPC programming models would diverge. I assume a small thread-based model would be used on multi-core systems (single memory domain) and a large model based on the current message passing (MPI) methodology used in clusters. It seems that I was not alone in thinking about this issue.

Continue reading HPC Hopscotch »

Small HPC
Tuesday, June 9th, 2009

Later this week I am heading off to my 30 year college reunion (yes I am that old). I attended Juniata College in central Pennsylvania. Juniata is one of the small liberal arts colleges for which Pennsylvania and the northeast are well known. I recall at one point in my freshman year an upper classman telling me “You will never pass Organic Chemistry. It is were they weed out the pre-med students. Just forget it.”

Continue reading Small HPC »

A Brain in a Band-Aid Box
Wednesday, June 3rd, 2009

A recent story caught my attention. It was about the new Dell XS11-VX8. A few things took me aback. First, it is not an Intel or AMD platform, and second, it is not multi-core. To be more specific it uses a VIA Nano processor (The Nano is similar to the Intel Atom processor, which are found in many low power netbook designs.) Now the interesting part, Dell has designed server modules each of which is about the size of a 3.5-inch disk storage bay (slightly longer). The module is a x86 64-bit Nano server that can hold up to 3 GB of main memory, an optional 2.5-inch disk or SD card, and provides two GigE ports. Dell reports an idle power draw of around 15 watts that can increases to peak of 29 watts under load. Twelve of these can be housed in a 2U chassis. They are intended for China and India where power is at a premium.

Continue reading A Brain in a Band-Aid Box »

Is La Toya Jackson a Prime Number?
Tuesday, May 19th, 2009

There’s a new website in town. It is called Wolfram Alpha (W|A). I must admit I am fascinated by the concept. The tagline on the site is Making the world’s knowledge computable. Sounds different. However, before I discuss my experiences with the site, I would like to be clear about one thing. W|A is not Google nor is trying to be Google or some form of “Google Killer.” So going to Wolfram|Alpha and searching for recipes is not going to work as you would expect. I’ll get in to this a little bit more below, but there is a tendency to compare everything to Google these days — and conclude Google is better and will smash these pesky sites in short order. The key difference is “computable knowledge” versus “searchable web”. In case you don’t know, Wolfram is the Stephen Wolfram who created Mathematica. He also wrote A New Kind Of Science. I never met Wolfram, seems like a bright guy, I wondering if he went to the Donald Trump school of branding.

Continue reading Is La Toya Jackson a Prime Number? »

Clouding Of The Grid
Wednesday, May 13th, 2009

A while back, I read “Cloud computing is grid, but easier.” Maybe. I believe cloud computing is different and not all that close to grid. In addition, I believe the challenges presented by creating grids coupled with virtualization actually opened the door for cloud computing. I’ll back up my arguments in a minute, but first I invite you to take a look at a grid article from 2004 written by Ian Foster — the architect of the grid concept.

So what is a grid? Foster defined three criteria for a grid:

Continue reading Clouding Of The Grid »

Incremental Twiddling
Tuesday, May 5th, 2009

I have a theory. It goes something like this. At one point there was the primal program. Let’s call it hello.c it does something very simple, it prints Hello World. My theory states that virtually every program is a variation of this program. From the BASIC 10 PRINT “Hello World!” or Fortran PROGRAM HELLO WRITE (*,100) STOP 100 FORMAT (’ Hello World! ‘ /) END to the classic main() { puts(”Hello World!”); return 0; }, most beginners started adding code to these simple examples.

Don’t believe me? Maybe you used some modern wimpy language. I’m sure you started with a simple example program. Let’s do a survey. How many of you started programming with a one or two line program. Raise you hands. Keep your hands up. How about the rest of you. How many of you took an simple existing program and modified it as the basis of a new program. I see lot’s of hands. Some have both hands raised. Okay, you can put hands down now.

Continue reading Incremental Twiddling »

The Servtainer Has Arrived
Tuesday, April 21st, 2009

When I was a young buck, I worked in a commercial bakery the summer between college and graduate school. This place could sure turn out the bread. It was highly mechanized and ran, as far as I could tell, very efficiently. The first day on the job, I asked one of the workers about all the large empty vat type things along the wall. When I mean large, I mean, about five feet wide, by ten feet long, by about three feet high, stainless steel behemoths on wheels. The worker looked at me and said, “That where they mix the dough, you don’t’ think we make this like your mommy makes it at home do you college boy?” I just nodded and went back to pushing my broom.

Continue reading The Servtainer Has Arrived »

Expediting Synergistic Paradigms
Tuesday, April 14th, 2009

April 15, 2009: Linux Magazine, HPC writer Douglas Eadline, a market leading HPC hipster writer dude, reports on the SGI situation. SGI, formerly known as Silicon Graphics, is going down the tubes due to an industry leading April 1, 2009 bankruptcy filing (no joke). Tracing the bottom-up market trend, he recalls several distinctive user driven top-down press releases that shaped the industry.

June 10, 2005: Linux Networx Announces New CEO, ECLIPSE Certification

The Linux Networx board of directors announced the appointment of Robert H. “Bo” Ewald as chief executive officer (CEO).

“Linux Networx is recognized by the HPC industry as the leader in delivering the highest performance Linux cluster computing solutions and customer service worldwide,” said Ewald. “I’m looking forward to expanding into new international and domestic markets with supported configurations that meet our customers’ requirements.”

Continue reading Expediting Synergistic Paradigms »

The Clustering Way
Tuesday, April 7th, 2009

Nehalem, erh-ahh, the Intel Xeon 5500 series has arrived. First let me say, from what I have seen and read Intel has done a stellar job with the new micro-architecture (The tock step in their “tick-tock” strategy). In addition to improved memory bandwidth there are many nice performance, power saving, and virtualization features as well. All hail Intel — nice job. That said I have some observations that may deflate the Nehalem balloon a bit. Well someone has to do it.

Continue reading The Clustering Way »

Multi-core Clip Show
Tuesday, March 31st, 2009

Over the years, I have developed a reputation for writing things down. It started when I was younger, but I really earned my stripes in the early days of HPC clustering. My motivations were self serving. Rather than write the same response to a mailing list question for the tenth time, I figured I would write it down and send a text document to those who were interested or better yet, I’ll put it on the web. Now my response was easier, I could easily write “take a look at this URL” rather than write a long response. There are those, however, who have a knack for writing excellent exhaustive answers to cluster issues many times over. I won’t mention any names, but some guy with the initials rgb has such a reputation on the Beowulf Mailing list. Don’t tell Robert Brown I mentioned his initials. Thanks.

Continue reading Multi-core Clip Show »

Good Enough Will Have To Do
Wednesday, March 18th, 2009

In the past, being fast was a bit more simple. Of course I’m talking abut computers, clusters, and HPC and not runners, race cars, or any other things that moves. If you wanted to know if computer A was faster than computer B, you ran the same program and compared the results. In the same sense, if you wanted to know who was the fastest runner, you got a stop watch and said “go.” The fastest time is still the fastest time, how you got there seems to matter more these days. In the case of the runner, we are now required to test for the possibility of performance enhancing drugs. In the case of computers, we need to be more diligent as well. I’m not talking about overclocking, which in a sense is like performance enhancing drugs for computers — pushing a system to it’s limits and risking damage. At least over-clockers brag about their accomplishments.

Continue reading Good Enough Will Have To Do »

« Prev |
Follow Linux Magazine
Rackspace