x
Loading
 Loading

Under Deadline

Senior HPC Editor, Douglas Eadline, Ph.D, blogging on clusters, multicore, interconnects, and everything high-performance.

Summer Lovin' Erlang
Wednesday, July 2nd, 2008

Last week I put my Erlang flag in the ground as it were. I was expecting a raft of flaming comments. Did not happen. I doubt there was a collective head nod from my vast audience, and I wonder why so quiet? Perhaps everyone is on vacation, or getting ready for vacation and have put off reading my insightful column for when they take a vacation. Speaking of vacations and beaches, that includes me. Weekly columns are still required, however. I have set aside some beach time to continue reading Programming Erlang: Software for a Concurrent World by Joe Armstrong. I have not yet, however, brought myself to bring my Eee web-book to the beach to play with some code. Bright sun not withstanding, I fear the union of silicon, both refined and shoreline, will not be a happy affair. Although this Asus photo might seem to indicate otherwise.

Continue reading Summer Lovin’ Erlang »

A Functional Stand
Wednesday, June 25th, 2008

This week I am going to climb back on my parallel programming soap box and make a proclamation.
If you recall, I had started a discussion on
declarative programming methods. Compared to “procedural” or imperative programming (C, Fortran, Perl, etc.), the declarative approach focuses more on the “what” instead of the “how.” HTML is good example of this idea. When you create HTML, you telling the browser “what” to you want on the screen, the how is up to the browser. Because I have not told your computer how to render the information you can choose to view it using any number of browsers including a text based browser like lynx or elinks.

When one writes an procedural program (C, Fortran, Perl, etc.) you are “closer to the hardware” and farther away from your problem. When you add MPI, OpenMPI, or Pthreads you have moved closer to the hardware and further away from the problem at hand. The programmer has more to manage which also means more opportunities for bugs, but that is another issue.
These additional parallel responsibilities are why procedural parallel computing is harder than sequential computing. It is why machine code is harder than a high level language. And, finally, it is why we are facing a bit of a crises in the age of multi-core.

Continue reading A Functional Stand »

Can You Feel It?
Wednesday, June 18th, 2008

Back in late 90’s there was a time when clustering was the rage. A few boxes, a fast Ethernet switch, some Linux software and presto, you had a cluster. Like every new technology there are misconceptions, hype and detractors. Eventually, the price performance numbers ushered in a new HPC computing paradigm. In some cases people cobbled together clusters, just to build them, in others they built clusters and ran actual HPC programs on them. Indeed, there seemed to be story after story about a new “off the shelf system” built at low cost or from some systems that were “just lying around.” They would get an MPI or PVM program working and find that they now had their own personal supercomputer. There seemed to be several enabling factors that ran through the success stories. These included:


  • The performance is better than the existing system I was sharing with everyone else.

  • The hardware cost is really low (price-to-performance is high).

  • I did not have to buy any software.

  • I can do things I could never do before on my desktop.

  • I own this resource and have exclusive use.

Continue reading Can You Feel It? »

A Surging Rant
Wednesday, June 11th, 2008

This week I decided to take a break from my planned discussion of declarative languages. Mind you, this is an unplanned break as I’m at the point in my discussions where I’m about to reveal the answer to all your problems. I can’t do that this this week because I’ve been overcome with a rant. Much like that wave you see out in the ocean slowly building as it moves closer. There is simply no stopping it.

Usually my rants have a trigger. I’ll read something or hear a comment about an important issue. Sometimes I agree with the the opinion, sometimes not, often it is that bump on the waters surface. The wave has begun.

Recently on the Beowulf mailing list there was a discussion about vendors and cluster procurement. (The thread may still be active.) The upshot of the initial comments is that large vendors sometimes deliver servers in a state that is inconvenient for clustering. And, they seem clueless to the customers needs. From my experience, issues such as BIOS configuration, PXE booting, disk-full, disk-empty, disk-less systems are often a problem for cluster customers.

Of course the big vendors want to sell to a majority of the market. Good economics they say. Fine, then let’s be clear about what it is you are selling — racks of servers. A rack of servers is not the same thing as an HPC cluster. If you are new to the HPC thing, go back and read that again. There is big difference between buying an “HPC Cluster” and “racks of servers.” Many customers and vendors think they are the same thing because they look the same. A perfectly understandable conclusion. But, now you know they are different. Consider yourself enlightened. In general,the creation of an HPC cluster usually works is in one of these three ways:

  1. Buy rack of servers, invest the time (cost) to turn it into a cluster using open software packages or cluster distributions. Possibly the least expensive way to go.
  2. Buy a rack of servers and buy a cluster software stack and set up the system yourself. Usually not as cheap as option 1, but usually less expensive than option 3 below and there should be someone to help you get things working.
  3. >Buy a turn-key system and start working right away. This option is usually the most expensive in terms of initial cost, but you are guaranteed a working system.

Option 2 and 3 usually have some kind of software support option. The thing to realize is that there is an additional cost above the hardware (racks of servers). Not until the racks of servers is working together as system can you really call it a cluster. Also note that option 1 presents the largest possible variation in cost (a risker bet as it were). Should you buy that truck load of hardware only to find out that there is an “issue” with configuration you may be performing a lot of extra work (even at the graduate/student wage rates). In addition, you have assumed the responsibility for maintaining and upgrading the software. For the seasoned cluster administrator, option 2 or 3 may be the lowest cost when you figure out the total cost of ownership over 3-4 years (and the amount of headaches).

The above analysis is a bit simplified as your situation may vary. It still makes the point that there is a “cluster cost” that must be paid if racks of clusters are to be called a cluster. Understanding this distinction has got my ranting wave rolling thus far. The market should get the difference by now. In the past I recall certain vendors bragging about how many “clusters” they sell a month. I would listen, then sigh, and explain that selling a rack of servers is not a cluster, the cake still needs icing. Viewed only in terms of hardware, the ticket to the HPC ball got extremely cheap. Even the rackem-stackem-fly-by-night-want-to-be-HPC vendor could get in the door. And yet, most of these systems are not operational clusters.

The fact that anybody with a screwdriver can call themselves an HPC vendor is mildly disconcerting. What really brings my rant wave crashing down are the vendors, big and small, that have no clue as to what opened the door for them in the first place.

If you are a customer, next time you are thinking of making a hardware purchase, put this simple question on you RFP, “What has your company done to support the open HPC cluster community/market?” I invite you to take this response very seriously because your purchasing power is stronger than you think. Throwing business at those companies who have a seat on the HPC cluster clue-train will help ensure that much of the excellent “free” stuff that is used to build your cluster, stays free. And, the requisite phrase, “free as in speech” is particularly important here. Of course there is free (openly available) software and some vendors do contribute in a variety of ways. The other “free” things are the discussions (free as in speech) taking place over the Internet, at conferences, and most importantly over a glass of “free beer” at those hospitality parities. Vendor that support the community deserve your dollars. And there is usually a bonus, these type of vendors deliver HPC clusters.

If you are vendor who has not joined the community, take heed, I’m not in particularly nice mood today. So you want your slice of the ever-growing HPC cluster market. Fine, how do you think you got to the feeding trough? Here’s hint, it is not so much what’s in your boxes, the color of your cables, or your keen business sense. In my opinion, you got here on the back of all those who built a market/community of open and freely available software, information, and conversations. You would do well to embrace the whole co-operation thing. Sharing helps make the pie bigger. The ways to get involved are numerous. Support (time or money) a project that helps your customers, contribute to a mailing list, share your experiences and best practices. Think of it as a focus group and listen. You will get information about the HPC market that you cannot buy. You will learn more about this market than you ever thought possible. And, then, maybe then you learn how to sell clusters instead of racks of servers. One word of advice, don’t try and turn every encounter with the HPC community in to a sales opportunity, it will absolutely not work.

In closing, when I started this rant I told myself, I am not going to single out any vendors as I don’t like kicking down doors and taking names. You should be aware, however, that there are Linux Penguin companies that are highly skilled and often help streamline the understanding of things like scalable informatics. Those kind of vendors know the difference between a rack of servers and cluster because they live by the open source credo — give a little and get a lot. In my opinion, they give a lot and get very little, but that is another wave forming in the ocean

The X=X+1 Issue
Tuesday, June 3rd, 2008

Like many people my age, my first programming experience was with the BASIC language. I recall using an ASR 33 telprinter to create/edit/run BASIC programs. I would marvel at how the cylindrical print head would dance and bang across the paper printing error messages from my ten line programs. Of course, at 300 BAUD you had plenty of time to watch things happen. Back in the day, BASIC was usually the first computer language you would learn. There were other more “advanced” languages like FORTRAN or COBOL that were deemed too difficult for the novice. By the way we capitalized Fortran back then.

Continue reading The X=X+1 Issue »

The Lawnmower Law
Wednesday, May 21st, 2008


In my previous column I mentioned Amdahl’s Law. Before you click away, rest assured I have no intention to talk specifically about Amdahl’s Law and I promise not to place a single equation or derivation in this column. Often times people are put off by Amdahl’s law. Such discussions usually start with an equation and talk of the limit as N goes to infinity. Not to worry. There are no formulas, no esoteric terms (sorry, no big words), just the skinny on the limits of parallel computing. I’ll even go one further, I’ll hardly mention parallel computers, multi-core, and other such over worked topics. This month’s I’ll discuss lawn care.


Like most home owners, I have a lawn. A most-of-the-year green thing that provides a large bathroom for my dog and lots of work for me. Having recently climbed the lawn care ladder, I am now a the proud owner of a John Deere lawn tractor/mower. My new ride has given me the luxury of sitting while mowing and the opportunity for deep thought. Yes, there is much to ponder while riding upon the pinnacle of suburban achievement. Ah, but I digress.


While my green and yellow lawn Harley makes quick work of my grass and weeds, I often think, while mowing, how could I do this faster? Of course, the obvious answer is get a bigger mower. There are those larger triple-blade units that would work much faster by cutting a bigger swath. They still only cut one swath at a time, so the speed-up would not be all that great. Then it hits me, what I need is a swarm of these green machines and crew of experienced yard-men like myself. But wait, what about the edges. I use a push mower for the edges after I am done with the big area. And, there is no sense in doing the edges until the big areas are done, just in case some tight spots were missed by the riding mower. I am also going to make a perfectly valid, but nonsensical, assumption that there is only one push mower available.


How fast can I mow the lawn with a team of riding mowers and one push mower? Let’s take a look at some numbers. Assume it takes me 60 minutes to do the entire yard (riding and pushing). The push mower takes 20 minutes and the riding takes 40 minutes. If I get ten riding mowers and drivers, I should be able to give everyone equal areas to cut and get the big area done in a remarkable 4 minutes. But then, I have to do the edges with the push mower. That adds 20 minutes for a total of 24 minutes. Much better than 60 minutes, but still I could do better. Let’s suppose I use 40 riding mowers. Using the same analysis, I get the yard done in 21 minutes. What if I use 100 mowers? As you can see the slow step is limiting my speed-up. The result sounds familiar to something that Gene Amdahl proposed: The speedup of a program using multiple processors in parallel computing is limited by the sequential fraction of the program.

Indeed, just as with parallel processors, there is a point of diminishing return. Adding the first 10 riding mowers reduced the time by 36 minutes. Adding another 30 only saved me 4 minutes. Adding 100 mowers makes little sense since I’ll never get below 20 minutes. (Although I would love to see such a lawn mowing demolition derby — in my neighbors yard of course.)


Taking a closer look, I think I have been too generous with my speed-up estimates. In reality, there is task of co-ordinating the team of parallel lawnmowers. The co-ordination includes getting them to onto the lawn and in to the right starting position, telling them where to mow, and then getting them off the lawn. Thus, there is going to be some overhead required. The overhead takes time, lets assume 1 minute per additional riding mower. Therefore, adding 10 mowers will require 10 minutes of overhead and using 40 mowers will require 40 minutes of overhead. The new numbers look like this:

Riding Mowers 1 10 40
Parallel mow time 40 4 1
Parallel setup time 0 10 40
Push mower time 20 20 20
Total time 60 34 61


Wait a minute (pun intended). Using 40 riding mowers takes longer than using one! How can this happen. Simple, the overhead adds to the sequential portion of the job. A sequential portion that was not there when one mower was used. Remember that little tidbit when thinking about parallel programming. Even though the 40 riding mowers get done in one minute, they still need to be “setup” to mow.


Based on the above analysis, one may think that this whole parallel thing is a waste of time. In some cases it is. I doubt I’ll never need more than one riding mower. If I had to mow an entire golf course, then 40 riding mowers might make sense because the problem is now bigger. (And, yes they use huge gang mowers on golf courses. I used to pull such a device. It would take 2-3 days for two people with two tractors to do the entire course. We each pulled a 9-way gang mower unit. That is about the equivalent of 18 riding mowers working at the same time.)


Finally, I assume you are wondering, What is the Lawnmower Law? It has nothing to with what I have been talking about. The above discussion is Amdahl’s law. The amount of speed-up you can expect is limited to the sequential portion of your code plus the parallel overhead. It turns out it is general rule that can be applied to almost multi-worker situation. What about the Lawnmower Law? Glad you asked. Here it is: When you need to mow the lawn, get the neighbor kid to do it.

Parallel Gedanken
Tuesday, May 13th, 2008

Sit down, take a deep breath and relax. Let your mind wander for a bit. I’m going to talk about something that runs counter to one of the basic tenants of computer performance. Something you may believe to be true for ever and always. Last time I talked about how concurrent parts of a program are those that can be computed independently. I also explained how the parallel parts of a program are those concurrent parts that should be executed at the same time in a way to increase the speed of a program. As I discussed, the differentiator is overhead. It is now time to take the next step. By the end of this column, I hope to convince you that the following is sometimes true:

In a parallel computer, faster processors are not always better than slower processors.

This statement is kind of a trick because it makes no mention of how the processors are connected. And, that is the issue. The interconnect contributes to the overhead and thus the scalability — or how many processors you can add to your program before it stops getting faster. Scalability is also determined by Amdahl’s Law, which states that the overall performance of your parallel program will be determined by the amount of serial content. That is, the parts that cannot be executed in parallel. These parts are eventually going to limit how fast your program can run. We will come back to that later, but for now let’s talk about scaling parallel parts.

Lets use a simple Gedanken-experiment (thought experiment) to see how slower processors may help in some situations. Assume we have a program that has a large concurrent loop with a certain amount of overhead. Let’s assume we have eight very fast processors and a medium speed network (i.e. the network is slower than the processors ability to eat and crunch on data). What will happen with our program? It doesn’t take too much Gedanken to figure out that the network is going to limit the ability of the shinny new processors to do work (i.e. they will be waiting on the data). At some point, our program will not stop scaling and we will see less than an 8X speed up.

In a second experiment, let’s replace those eight processors with thirty two processors that are 4 times slower, but are matched to the medium speed network (i.e. the processors do not wait on any data). In this case, our program scales perfectly and we will have reduced our single processor time by 32. Since our processors are now 4 times slower, the time for our 32 processors would be equal to an 8X scaling for our 8 fast processors. Recall, we were unable to get an 8X scaling with fast processors. QED

If you understand this concept, you now understand why the interconnect is so important for many cluster applications. And why many building clusters spend good money for a fast interconnect. Feeding processors is important. The latest and fastest processors need the latest and fastest interconnects. Unfortunately, the rate at which processors increase in speed, often exceeds the available interconnects. Multi-core has changed this a bit, but the issue is still the same. Messaging rate and throughput have become important because there are multiple cores sharing the interconnect.

The “slow is better” approach is also employed in two commercial supercomputers. The venerable IBM Blue Gene, which consistently beats up any contender for the Top 500 spot, uses 700 MHz Power PC 440 processors (plus a floating point unit) and fast balanced interconnect. Another balanced system approach is from SiCortex which uses 5832 500MHz MIPS processors and a very fast balanced interconnect. The key word is balance. There is also a huge reduction in energy costs for these two systems. Something to keep in mind in the age of green.

Of course, applications vary and some applications are not as sensitive to the interconnect speed and do see benefits from using faster processors. Multi-core will certainly change this behavior, which is why benchmarking is important. As often quoted on the Beowulf Mailing list, “It all depends on the application. Benchmarking your application(s) is the best measure. YMMV (Your Mileage May Vary).” Words well worth remembering.

I will conclude this discussion with one more data point which comes to you first hand. Back in the day, a co-worker of mine, Anatoly Dedkov, and I were intrigued by this idea as it applies to disk based I/O because getting data off of spinning disks always seems to keep processors waiting. We actually wrote and presented a paper at the Parallel and Distributed Processing Techniques and Applications in 1995 (PDPTA’95). We shamelessly proposed the Eadline-Dedkov Law which states:

For two given parallel computers with the same cumulative CPU performance index, the one which has slower processors (and probably a correspondingly slower inter-processor communication network) will have better performance for I/O-dominant applications.

The whole idea struck us as counter-intuitive at first, which is why we investigated it further. Next time I’ll talk more about Amdahl’s law. In the mean time, please feel free to enlighten your colleagues about the slow processors idea. Makes for great heated arguments and fist fights.

Not Necessarily Parallel
Wednesday, May 7th, 2008

Before I climb back onto my multi-core soapbox, I wanted discuss some parallel computing fundamentals. Having been involved with parallel computing for almost 20 years, I often forget that those new to the cause lack a firm background in some of the basic concepts being discussed. It took hard lessons and several years for these concepts to sink into my head, hopefully your path to “fully parallel” will be far more gentle. So, let’s begin; if there is one thing to “take away” from this column, it is the following set of three rules:

  1. Concurrency is a property of the program.
  2. Concurrency does not necessarily imply parallel execution.
  3. Efficient parallel execution of concurrent programs is a property of the hardware.

Like many people, I often confused the words “concurrency” and “parallel” and these terms are often used interchangeably. However, in HPC universe they are not the same thing. Concurrency is a property of a program or algorithm and if parts of the program can run independently, then those parts are concurrent. That is, they may be totally independent function calls or lines in a do loop of some kind. Learning that parallelization can be compartmentalized, for many people, is a “Eureka, my program can run in parallel,” moment.

Not so fast there partner. If the independent parts are run on separate cores/nodes then we will say the program is executed in parallel. And, just because your program has some concurrency, it does not mean that it will run efficiently in parallel. The distinction is subtle, but very important when real hardware is used. Since the goal is to make your program run faster, we need to ask the question, Does making all the concurrent parts of the program parallel increase execution speed? The answer to that question is a definite “maybe.” Why? Because in some cases running concurrent parts of your program in parallel may actually slow down your the overall application!

How could this happen? Let’s introduce the concept of parallel overhead. In a threaded shared memory model, overhead is access to memory (there is also thread set-up and tear-down as well). Recall, that in a multi-core shared memory model all cores use the same memory and will often need exclusive or guarded access to memory locations. A thread will lock a memory location to make sure no other thread changes anything. If another thread wants to read/write to the same memory location, it must wait until the location is unlocked. And thus, you have overhead.

In a distributed memory model, the overhead is the time it takes to copy memory from one process to another (i.e. send a message). Even if the message passing program is executing on a shared memory SMP, there is still message overhead. If messages are sent over a network the overhead can get quite large.

When a program is executed in parallel, the expectation is that the overhead plus the parallel execution will produce a faster run time. In some cases, the overhead time is much larger than the parallel execution time and thus any gain is lost. For example take a simple pseudo-code loop:

for I = 0 to I = 8000
  X[I] = X[I] + 1
  I = I + 1
done

This loop is concurrent and could very easily be broken up into eight groups of 1000 iterations. The actual computation could be done in 1/8 the time, but only if the overhead does not outweigh the reduction in execution time. In a message passing environment, this loop would almost always run faster on a single core than it would on eight cores. My guess is that in a shared memory environment the same would be true. While this case may seem obvious, there are others that are not so obvious and looking at the code is only going to give you the smallest of hints.

In my second rule the key word is “necessarily.” Some concurrent applications with large data sets are obvious candidates for parallel computing. Parallel rendering is a good example. It is almost trivial because there is very little communication/sharing (overhead) between the concurrent parts. There are other applications where things get a bit more complicated and finding a concurrent loop does not “necessarily” mean it should run in parallel.

The third rule has one important implication. That is, efficient parallel programs are not “necessarily” portable. Or, put another way, portability does not imply efficiency. For instance, on one hardware platform a loop may be efficient when executed in parallel while on another, it may actually slow down your program. Gulp. And, you thought all it takes was a little OpenMP or MPI.

That’s probably enough parallel weirdness for one week. Remember the rules and set your expectations accordingly. Next time I explain why faster processors are not necessarily better than slower processors.

Equal Time in the TCOP Discussion
Wednesday, April 30th, 2008

My recent column on the Total Cost of Parallelization (TCOP) generated some interesting feedback. I like feedback because it means people are actually reading what I write. In particular, when a response to one of my columns is from a experienced practitioner of the art, I read it carefully. Informed discussion is a great thing and the multi-core world can certainly use more ideas. With that in mind, I would like to provide equal time to Brian Dobbins of Yale University. He emailed a response directly to me with what I feel are valid points. I inserted a few comments to help clarify some of my own points referenced by Brian’s comments, but did not alter any of his views or conclusions. (Brian’s comments are in italics).

My counter point stems from a recent discussion I had with someone explaining that it’s almost always better (performance-wise), when given the choice between 150 processors at speed 1.0X vs. 100 processors at speed 1.5X, to go with the latter option. While both give the same aggregate computational “power”, the latter has fewer communication links, so more time is spent in useful computations.

The rule of thumb is a good one, however, there are some subtleties, which don’t effect my proposal, but are worth mentioning. For instance, if the interconnect and the processors are balanced, then the smaller number of processors is better, but if the processors spend time waiting on processor-to-processor then you may see better performance on more processors (provided the application scales).

In your column, you say that single core performance doesn’t matter if you can scale, and then give a figure of 80% scaling on 8 cores. This statement leads to the another question, Are we talking about 80% scaling regardless of how many cores we run on, or is it 80% going from 1-8, then another 80% going from 8 -64, and so on?

Good question. I should have been more clear. When I say “80% scaling”, I mean the following: at 100% scaling on four cores, I assume a 4X speed-up, where speed-up is sequential-time/parallel-time. When I say “80% scaling”, I mean it gets 80% of the 4 times speedup or 3.2 times speed-up. The same goes for 8 as well (80% of 8 or a scaling factor of 6.4).

Let’s take a closer look at overhead - presumably there will be some overhead from our abstraction into this SL (Scalable Language) as opposed to the closer-to-the-metal C or Fortran. Assume there is a theoretical C version of this program that scales at 90% per every 8x increase in processors, and that the SL version scales in the same way but at 80%. Let’s also assume that the SL serial version is just as fast (not 5x slower!) than the C version. The table below indicates how this would play out in terms of speed-up on a large range of cores:

Language 1 Core 8 Cores 64 Cores 512 Cores
SL 1.0 6.4X 41.0X 262.1X
C 1.0 7.2X 51.8X 373.2X

At 8 cores, the difference is negligible at about 12.5%, and considering the size of the infrastructure required for a mere 8 cores, that probably isn’t tremendously important from a financial point of view. If, however, the company or university for which this application is important is scaling out to 512 cores, then the difference is a factor of 1.42 times. At today’s prices, let’s say we can get 512 cores for $200K, meaning that a 42% difference in performance is worth $84K. I’m ignoring continuing operating costs, etc. The question now becomes whether that $84K is worth the time of the person or people coding the application. The answer, of course, is “it depends”. The complexity of the application, how many programmers are working on it, for how long, etc. need to be considered.

Which brings me to my second point - I don’t think “parallel programming” is difficult, not in the sense of writing the C or Fortran code to call MPI. That is the easy part. The harder parts are algorithmic and, frankly, just plain old serial software engineering. The algorithmic considerations might be moot when considering taking a serial code and making it parallel to solve the same problem, but many times the size of the problem one is tackling also increases, and that often requires new algorithms.

Consider a common example, the N-body simulation. For a very small number of particles, a basic particle-particle interaction algorithm is perfectly fine, but it scales at O(N2), so once you start increasing the size of the program, you need something better, and might switch to a hierarchical O(N log N) method, which has a higher constant for small values (and thus is slower for small N), but is far faster for larger N. Other algorithmic options include PM or ‘P3M’ methods, and many variants of each of these categories as well. I can’t envision an SL language being able to make such algorithmic decisions in any reasonable time frame, so this choice is still left to the programmer.

The second real barrier that makes parallel programming seem hard is that it really shines a light on bad software design. I’ve been involved with a number of applications for which the majority of my effort has focused on reworking serial code so that it is simpler to understand and much more flexible in its design, which in turn lends itself easily to parallelization and algorithmic experimentation. With a well-designed serial code and a chosen parallel algorithm, the actual process of adding the code to parallelize a given application takes little time indeed – Joe Landman’s recent column and many MPI tutorials such as those available at the NCSA illustrate this quite well. As a side benefit, with a good design, the parallel constructs will be as transparent as possible to end users of the code by abstracting away all such MPI calls into their own routines, and will ideally keep as much of the main program code as similar to the serial process as possible. In this sense, I disagree with your comment that “as the core count continues to grow, so will the cost of programming with your current and comfortable software tools.” To me, the programming approach really doesn’t change much as core count increases.

In conclusion, for a given application type, if we assume that an SL introduces some small overhead, the optimal choice of tool (in this case, SL vs. C/Fortran) is going to depend upon the performance one needs at a given processor count. If we consider an open range of processor counts, the best tool for a given project, as determined through some cost function of performance vs. total cost of the solution, is most likely going to be a curve. Naturally, at some point in that curve will be a transition point between these options. My best guess, at this point in time, is that the ‘old guard’ (C/Fortran) are still going to rule the roost where performance matters most, but I definitely concede that for people aiming for small scale parallelism, something that ‘automagically’ gives them some improvement, even if sub-optimal, would be wonderful. Well, wonderful in the sense that, compared to walking, driving on the highway at 35 mph is wonderful, but you’re still going to get passed by most of the other people on the road.

I should point out that I was thinking more along the lines of “general programming” while Brian was thinking about HPC programming. None the less, Brian raises some good points. And, I invite you to comment below as well. The multicore/parallel discussion has just started and I have no clue where it is going to lead!

The Total Cost of Parallelization
Wednesday, April 23rd, 2008

I have calmed down since my previous rant. Looking at the comments, it seems I have stepped on more than a few issues. While I can’t address all the comments, I will make a general statement. There were some suggestions that solutions already exist. And, indeed, I can’t argue with that; you can write parallel programs in shell if you so desire. The big issue is the abstraction layer.

As a developer, I don’t want to think about the number of cores or nodes any more than I have to think about the addressing modes, registers, and floating point units in the processor. If parallel computing is going to truly be cost effective, then reasonable abstract layers need to exist. Many developers and most vendors will acknowledge an urgency to achieve the goal of simplified parallel programming lest we face some dire consequences going forward. The solution I have proposed is a shared GPL approach. If anyone has a different idea, I would love to hear it.

Moving on.

Continue reading The Total Cost of Parallelization »

The Scent of the HPC Market
Wednesday, January 23rd, 2008

This week I’d like to take a look at the latest survey results. While the participation was not all that I wanted, it is enough to make sufficiently vague statements about the HPC cluster market.

Whenever I talk about Web surveys I always throw in a proviso that these are not well-designed, scientific surveys performed by more sophisticated marketing organizations. Web surveys are more like asking the person on the street what they thought of a movie. When I look at the survey results, think of it as a “scent” of the market and maybe suggest some area that bear watching. But, enough handwaving. Let’s look at some results.

The question that garnered the greatest response (141 votes) was “What will limit your use of HPC Clusters in the next 12 months?” In other words, what is getting in your way? The one choice that surprised me was that sixteen percent of the people said “nothing.” You go people! Buy, build, and compute. On the other hand, 25% said that lack of skilled people was the biggest hold-back.

Software and hardware, space and power each weighted in at about 16%. Interestingly, cluster management issues were only of concern to 10% of the respondents. In previous surveys, cluster management issues seemed a bit more prevalent. This is a good trend, if it is true.

The people issue has been with us since the beginning. This result is supported by the fact that I am often asked if I “know any one?” who and would like a job working with HPC clusters. Unfortunately my answer is always, “Not really, everyone I know is busy and seemingly happily engaged in their current job.”

In this survey, I wanted to drill down and ask what kind of people were actually needed. The results (107 votes) were application specialists (31%), system administrators (27%), programmers (25%), none (11%), project managers (3%), hardware specialists (3%). The “none” result corresponds well with the “Nothing” result of the previous question. The top three needs are of course about software. The even spread across the top three responses indicates we need cluster knowledge across the board, as it were.

It seems the whole cluster thing snuck up on the market, and probably spread into markets that could not afford HPC in the past. The result is people who really understand this stuff (they have what I like to call “cluster scar tissue”) are hard to find. We need to fix this situation and things like Cluster Training at Georgetown University are helpful. (Full disclosure: I teach the fun filled Intermediate Beowulf Administration and Optimization course.)

Moving on to the other questions, I’m going to take a very cautious approach as the number of responses are quite low. Let’s take a look at the Infrastructure and the Parallel File Systems questions. Power and cooling seem to be the big issue for 75% of the 22 respondents. The rest considered cluster footprints to be their biggest concern. The parallel file system question had low participation as well (22 votes), but showed that parallel file systems were becoming more important for over half the respondents (55%), critical to project success for 36% of the respondents, and really not an issue for only 9% (NFS works fine).

Finally, there were two questions about hardware and software struggles. In terms of hardware, the biggest issues were compute node cost/performance (29%), interconnect cost/performance (24%), vendors don’t understand my needs (21%). Choosing the best hardware (13%) and component failures (13%) rounded out the 38 respondents. Other than noting an opportunity for vendors, I’m not putting much stock in these results, since there were five possible answers and only 38 respondents. A few more votes could easily change the result.

Turning to software, the results were as expected even with 22 respondents. The number one HPC software issue seems to be efficient use of multi-core processors (55%). Other issues, came in pretty even with application costs and cluster management tools, both at 14%, and lack of software tools and applications (each at 9%). The only thing that surprised me was the paltry number of votes for software tools, but the low number of respondents limits any real conclusions.

You can check out the survey results for yourself (after you vote) at Today’s HPC Clusters. We will leave current surveys up for about a week at which point a new batch of HPC questions will appear. Then it’s your turn. I don’t need to remind anyone of the importance of voting in the coming months. Outcomes that will influence the nation’s and even the world’s future will be determined. Of course, I’m talking about American Idol.

Boxes and Bugs
Wednesday, January 16th, 2008

It seems after last weeks rant, the poll participation on Today’s HPC Cluster has increased. Great job everyone! If you haven’t applied your valuable expertise to these polls, please do so as soon as possible. I plan to discuss the results next week. How as that for positive re-enforcement? Let’s move on to this weeks topic, boxes and bugs.

Let’s talk about the boxes first. There seems to be an interesting trend developing in the PC market. Wal-Mart has started selling Everex Linux computers for $200 recently. There’s also news that Shuttle will also be making a small “sub-$200″ box as well.

Of course, these systems are a little thin on resources, but there seems to be this $200 price barrier that the industry is trying to crack. My interest is not in touting the “Linux makes this possible” story, but to point out that such systems are selling quickly. If the trend continues, a large amount of these systems will be produced and anytime a large amount of “computing anything” gets produced, my “Can you build a cluster with these?” antennae goes up.

Of course, the $200 systems are mere toys when it comes to a modern-day multi-core cluster node, but bare with me for a moment. The $200 price point also means that there will be up-selling to better systems. So instead of $200, suppose for around $400 you could get such a system with 2GB of RAM, a reasonably fast dual-core processor and a good Gigabit networking card. The smallish hard disk drive can be considered optional. At this point you have a respectable, but not super fast, cluster node.

A perfectly logical question to ask is why build a cluster out of throw-away hardware? Well, why indeed? Which brings us to the second part of this discussion — bugs.

I’ve always been fascinated by colonizing insects. You know those deforesting ants, or swarms of killer bees. After you get over the “wow! that is a lot of bugs” reaction, you may wonder, how the heck to they work together so well. Parallel processing at its finest I figure. We assume the ants are not calculating next weeks weather, but they are solving some difficult logistics problems using a large numbers of individual worker units.

What can we learn from the ants and bees? Well, first it seems everyone has their own job and a set of rules to follow, dare I say program to follow. Years ago when I played Sim-Ant on the computer I recall there were three types of ants, diggers, foragers, and soldiers. Each did its own thing and the survival of the colony was dependent upon the right balance. The other thing that I noticed was no individual ant was essential to survival of all the ants. In other words, ants were disposable, or there was redundancy in their tasks.

There is also the queen ant which was responsible for producing more ants, and could loosely be considered a central point of control or perhaps a single but highly effective ant foundry. The queen does not, however, direct the 20,000,000 ants reported to be in some colonies. There is no redundant queen, but there are redundant colonies.

Perhaps something we can take something from the ants that may be helpful for clustering. I have long thought about the idea of disposable nodes. Well, not the throw them in the trash disposable, but redundant disposable. That is, if a node fails, it doesn’t matter. What if codes were written with a dynamic redundancy so they could tolerate one or even several nodes dropping out of sight. Or, what if low cost nodes were “compute mirrored”, kind of like a RAID 1 for cluster nodes. If nodes were cheap and plentiful, then who cares? It would be kind of like stepping on ants: there always seem to be more. There are other possible scenarios as well. When nodes are plentiful and cheap, rather than limited and expensive, then the rules of the game may change.

One of the advantages of clustering is that the loss of a node does not bring down the entire cluster. With today’s ever growing multi-core behemoth nodes (e.g 8 cores with 8GB of RAM), a failed node now brings down a big chunk of computation. Compare that to eight of the cheap dual core nodes I mentioned above (for a total of 16 cores and 16GB of RAM at about the same price) and you might be inclined to start thinking like an ant.

HPC: It Is All About You
Wednesday, January 9th, 2008

Trying to get a handle on the HPC community and market has always been difficult. The professional market forecasters seem to paint a rosy picture for HPC over the next five years. But, when I talk to people in the market, I often get a different take. For this reason, I like to ask people about how they use clusters and in particular I like to ask them about their pain points (or challenges if you prefer a more positive spin).

To help get a feel for the community’s take on the state and future of HPC the Today’s HPC Cluster site has a set of polls. I should mention that I don’t consider casual Web site polls to provide a definitive source, but rather a general feel for the issues. In addition, I like to see at least 100 responses to a poll.

As the current polls have been up for a while, I thought I would check the results (if you vote you see the results). The front page poll has two questions about physical issues and staffing. The response rate was within my “well I guess that is enough to draw a general conclusion” range.

Then I clicked over to the other sections Building Clusters and Managing Clusters and to my surprise, the polls on these pages had very low response rates. Indeed, the response rate drops off considerably as you move to the center tab and then to the right tab.

Before I go into some kind of rant and lose my audience, please go to the other two main pages Building Clusters and Managing Clusters and take the polls.

There are a total of six poll questions (two on each main page). I thought the questions were interesting, not because I thought them up, but because no one ever seems to ask these kind of questions. I broke one of my cardinal grammar rules for the sake of space on a Web page! I ended two questions with prepositions for crying out loud, I must be getting weak.

The lack of response posits several scenarios. My list is below. I am sure there are other reasons, but these are the ones that I will start with. (Did you catch that one?)


  • Visitors only read the stories on the first/front page.

  • Visitors can’t find the other pages.

  • Our front page gets it’s 30 seconds of web eyes and clicks and visitors move on

  • The stories on the other pages suck, although the comments don’t seem to support this one.

  • Visitors don’t care about managing or building clusters.

  • The polls are assumed to be the same, so they are ignored.

  • The important questions are not in the polls, questions like “Do you think Brittney will be on Dr. Phil?” are missing.

In any case, if you have comments, thoughts, or ideas about polls, article topics, the Web site, or Dr. Phil, add a comment below. It is all about you and building a resource to support this community.

I hope, after this polite prodding, we’ll get some more votes in the polls. Then, just maybe we can all get a better sense of where we are and where we need to go. If we scream loud enough, perhaps a vendor will notice and offer a product.

Or, if enough people share a common challenge, then maybe they can work together toward a solution. And remember, your input to the polls is very important. As I always say, “If you want to know how deep the manure is, ask the guy wearin’ boots.”

Adding Virtualization To My Anxiety Closet
Tuesday, January 8th, 2008

This month, I thought I’d take a break from my usual multi-core ranting and talk about another new technology that’s starting to give me fits. No, it’s not the somewhat confusing General Purpose Graphical Processing Units (GP-GPUs) acronym, which is not to be confused with the GPU that has something to do with putting colored dots on your screen. No, the thing that’s raising my dander is the virtualization thing, or “virt” for short.

Now, virt is a good idea, even if it’s become an annoying buzzword of late. It kind of makes sense with all the extra cores and all. And while I’m not expert enough to talk about how virt will play out in the larger server market, I do have some thoughts about virtualization in high-performance computing (HPC).

Before Virt Came to Town

Back in the good old days of single-core processors, when HPC clustering was in its infancy, getting the application as close to the hardware was very important. In many cases, it still is. Communication between nodes can take place through the operating system using TCP/IP, or outside the operating system using a user space zero-copy protocol. With the exception of pinning down memory, a user space protocol removes the operating system from the communication, resulting in better latency and often better application performance.

If your application is sluggish over Fast Ethernet and TCP, you must get closer to the hardware and use a specialized interconnect like Myrinet. All high performance interconnects are as close as possible to the hardware. Indeed, the hardware actually assists in the communication.

Let’s recap how programs operate on a cluster. A parallel MPI program is essentially a collection of individual programs (processes) running on the same node or on different nodes. Messages are essentially a way to copy memory from one program to another. Programs are “placed” on specific idle nodes by a scheduling program (such as PBS, Sun Grid Engine, or Torque).

The programs are managed by the all-knowing scheduler and once a program is placed on a node, it remains there until complete. Communication takes place over an interconnect, but as far as the scheduler is concerned, what you do with your nodes is your business. This model has worked well for HPC clustering, although it is not the only scheme available.

The other important issue is MPI or parallel applications must be explicitly written. Communication between processes is what makes it “parallel”. The goal is to make your program run faster.

R.I.P. OpenMosix

Recently, the OpenMosix project (http://openmosix.sourceforge.net/) announced it was shutting down. For those that don’t recall or have never heard of OpenMosix, it was an open implementation of the Mosix software originally developed by Amnon Barak. Mosix team member, Moshe Bar started and lead the OpenMosix effort. And what is Mosix? Very simply, it’s a method to migrate running processes to other computers.

For instance, if you had two computers running Mosix (or more properly, a Mosix-modified kernel), and one computer noticed that its load was very high, it could transparently migrate a process to the less-loaded node.

Now think of Mosix running on a cluster of nodes. A user logs into the head node, starts jobs, and as the load increases, the jobs migrate off the head node to less loaded nodes in the cluster. To the user, the job still looks as if it’s running on the head node, as it appears in the process table and can be manipulated as such.

The unique feature of Mosix is the ability to migrate jobs transparently. Mosix contained heuristics (rules) that determine when and where to move a process. A process could be moved several times during it’s execution as means to equalize the workload among all nodes. In short, Mosix and OpenMosix are able to make a collection of nodes look like a big fat SMP machine, or as it is often called, a Single System Image (SSI). (There are other SSI efforts as well. If you are interested, have a look at OpenSSI, found online at http://wiki.openssi.org; Kerrighed, hosted at http://www.kerrighed.org; and Scyld, found at http://www.penguincomputing.com. Each does user-directed process migration as well.)

There are some issues with Mosix migration, however. Things like I/O and threads make migration difficult or in some cases impossible.

  • Jobs requiring I/O are often returned to the head node to access I/O directly. Recent versions of Mosix are considering global file systems to help solve this problem.

  • Jobs that are threaded cannot be migrated because managing threads across multiple nodes requires fine-grained synchronization (for instance, shared memory access) that would lead to large inefficiencies.

It is also possible for some MPI versions (and PVM) to work under Mosix, provided they use the kernel for communication. Migrating user space communication is not so clear as migration is kernel-directed. If you are bypassing the kernel - well, you’re kind of asking for trouble when Mr. Migrater comes to visit.

The point about Mosix is that process migration is done dynamically with no user control. There is no need for a global scheduler, as Mosix takes care of load balancing. Compared to a traditional cluster where processes are placed in queues and eventually nailed to specific nodes, Mosix kind of dumps all the processes in a bucket and sorts things out at run-time.

While OpenMosix is used on many clusters, OpenMosix doesn’t make your applications run in parallel. That task is up to the programmer. According to project leader Moshe Barr, the advent of multi-core and virtualization has reduced the need for OpenMosix. Where once OpenMosix could unify sixteen nodes into a low-cost SMP resource, multi-core has been doing this for the past year. Today, servers with eight cores are not uncommon or that expensive. In the near future, sixteen cores will be the norm.

Virtualization Joy

So what does all this have to do with virtualization? Nothing. Well almost nothing. Virtualization is about moving operating system instances around. Cluster parallel computing is about placing or moving processes on or to nodes.

Yet interesting things happen when you mix the two concepts.

Virtualization allows the operating system to run on top of a “Super OS” or a hypervisor. This standardization or abstraction of the hardware presents some interesting possibilities. It allows multiple operating system instances to run at the same time on the same processor.

Remember all those cores you have? Using virtualization on such a machine makes sense if you want to run different distributions at the same time. Maybe you want to run Red Hat and Novell/Suse on the same machine. Or maybe you want to create a sandbox machine that runs a new or different kernel version. Or maybe you are selling co-locations space and you want to charge each customer for an individual machine. Of course, you, the clever one, just bought an eight-core server and have eight virtual machines running on it. If a customer wants to reboot their virtual machine, no problem. You just saved a bundle on power and hardware costs.

Because the hypervisor insulates the operating system instance from the hardware, it allows the instance to be migrated from one hypervisor to another. Think about this for a minute. Need to take down a server to fix/add a hard drive? No problem. Start a new virtualization server, move the operating system instance (s) to the new server, fix the box, then move the instance (s) back. The operating system instance has no idea that this is happening. It is almost like the hypervisor is a light bulb socket. An operating system instance, like a light bulb, can vary in different ways, wattage or color, but all lightbulbs fit into the same socket. So if you need to move a light bulb, you can do it easily.

In the virtualization world, there are two ways to move a light bulb. The first way is to turn it off, unscrew it, maybe even wait a bit, and then put it in a new socket. The other way is to move it while it’s on. (Stick with me - this is a thought experiment.) If you unscrew the bulb really fast, and put it in the new socket before the electrons stop moving (I did say “really fast”), no one notices the light going out.

What I’ve described is the two forms of migration that are available with virtualization. The first is kind of a “halt and move” while the second is live migration. There are practical applications for both. Halting an operating system instance, let’s call it check-pointing, allows the current image to be suspended and preserved until it is restarted. Live migration allows real-time movement of the operating system image. I don’t have space to go into all the interesting ways this could be used, so I invite you to let your mind wander a bit.

Done wondering? Just in case, here comes some cold water. For virtualization to work, it must add an abstraction layer over the hardware. If you are a true cluster geek, an alarm should have just gone off in you head. Layers add overhead. HPC requires minimal overhead. There is a cost for virtualization goodness. In addition, migrating a single operating system instance is a neat trick. Migrating operating system instances that are synchronously or asynchronously communicating adds another layer of difficulty.

Virtual HPC

Even with the overhead issue, there is still something alluring about virtualization in HPC. If you think about it, what is running on an cluster compute node? Well, let’s see: You have your MPI process (es), the operating system, and hopefully nothing else. If you provision your cluster in an efficient manner, the operating system instance should be pretty minimal and maybe even lives in a RAM disk. Not that much to migrate.

The Mosix and OpenMosix approach heavily modified the kernel to allow process migration. With virtualization, the kernel still needs some modification, but it to is scooped up in the migration. This approach could be valuable to HPC in a number of ways:

  • Check-pointing node instances is one possibility. Just dump each operating system instance to disk.

  • Similarly, whole cluster hogging jobs could be swapped out of the cluster and run at a later time or on another cluster.

  • Another possibility is running unique node instances. Suppose one of your codes requires a specific kernel, libc, or distro version not used on the other nodes. No problem, start the specific operating system instance you need on the nodes you need.

  • Schedulers could be crafted to work with operating system instances, migrating an instance to help load balance the cluster in real time.

It all starts to sound a little crazy at some point. The thought of moving a large N- way MPI code from one cluster to a hard drive (or at some point a USB stick), then on to another cluster definitely makes my head spin.

As you will keenly note, I intentionally glossed over many details and looked at the big picture. Details, I’ve learned, are important in the end and make a nice home for the devil.

Virtualization is overhead expensive and still maturing. What we do know is that processors and networks will get faster, cores will multiply, and memory sizes will grow. Approaches that are overhead-intensive and- expensive today may not be as costly tomorrow. Virtualization may act a kind of glue that may bring Mosix-like migration to MPI programs.

But then, my head still spins when I wonder how it will all work out in the end.

Go Ask ALICE
Wednesday, January 2nd, 2008

It’s time for the yearly batch of retrospectives and predictions. Count me in! Let’s see, the big thing of 2007? Well, that had to be multi-core. And, the big prediction for 2008? Why that would be multi-core, once again. There, I’m done. Enjoy your year.

I’m not really done… Being a “God does play dice” (and pinball) kind of guy, I find predicting the future to be a somewhat worthless endeavor.

Allow me to elaborate. Most predictions are based on extrapolating a line of some sort. If you extrapolated the game console line, the PS3 and Xbox 360 were destined to crush the Wii. Similarly, Microsoft’s Vista was going to usher in a new era of computing and those $200 Linux desktops from Wal-Mart were never going to sell.

Because the universe is non-linear and pretty much random, one never really knows what the future holds. New developments usually come from outside the box because everyone else is sitting inside the box drinking the market driven Kool-Aid.

Therefore, instead of trying to make some lucky linear predictions to prove I’m some kind of oracle, I’ll make some wild guesses. If they turn out to be wrong, then I can say I told you so. If they are right, then I suppose I got lucky with the dice. So here goes.

First, I believe the time is right for a personal cluster. I’m not talking about a desk-side box with built-in coffee maker, but one that sits on your desk. Why a cluster design and not an SMP box? For one, I believe that the same commodity economics that pushed rack mount clusters past big SMP systems will translate to the desktop. In addition, the cluster approach provides redundancy and power management opportunities not available with the SMP approach.

Second, a desktop cluster will provide one GFLOP of HPL performance for less than fifty dollars. That is, expect to achieve 50 GFLOPS or more running HPL for a cost of about $2,500. And if you want to spend a little extra, you will probably have terabyte of RAID1 storage.

Third, in addition to HPC, the desktop cluster will be used for (sit down for this one) Artificial Intelligence (AI). That is right, HAL 9000 is on the way. Of course AI has been oversold in the past, but it has made steady progress. Clusters are great for genetic algorithms, speech recognition, machine learning, neural nets, and sorts of AI type things.

And there is a need for AI. The more connections we have, the more simultaneous conversations we must maintain. The bigger the World Wide Web, the harder it is to find the haystack with your needle in it. AI offers a way to manage your corner of and expanding hyperspace your way.

Still not sure about how AI is going to show up in our ever-growing connected world? Go ask ALICE. Not everyone needs to fold proteins or predict the weather on their desktop just yet, but everyone needs a friend. Welcome to 2008.

Parallel Programming is No Cheap Date
Wednesday, December 19th, 2007

As 2007 fades away, I thought I would reflect on some of the HPC events of the last twelve months. Of course there was plenty of news, new products, mergers, acquisitions, and all other kind of normal stuff that one would expect from any market. Having thought about it, though, nothing really stands out in my mind as a big breakthrough or new paradigm shifting technology.

Of course when you read the press releases they all seem to imply that the world will never be the same when you use some latest and greatest gizmo, software, or service.

Before the people who have worked so hard in various corners of the market say bad things about me, let me first say, Thank You. In my opinion, we are further along than we were last year at this time due to your efforts. I include everyone from the Linux driver writers, to the guys taping out multi-core processors. We are moving forward. This past year you could buy more FLOPS (and use less power) for your dollar than in the past.

My disappointment is based on my belief that HPC is still hard and will continue to be this way until we figure out how to create cost-effective turn-key software for this market.

The hardness is in part due to parallel programming and the multitude of ways in which one can decide to express the parallel execution in their code. Multicore is forcing the issue in the mainstream markets, but there still does not seem to be a single hilltop where one can plant their flag and say, “This is it, We start from here. And, our efforts will not be wasted.”

Perhaps, I suffer from a bad case of wishful thinking. Or just maybe, the lack killer applications in the HPC market is due to the lack of a clear direction (or two, or even three).

Of course, using MPI (Message Passing Interface), OpenMP, or pthreads, is a perfectly viable and hard way to create parallel codes. Indeed, the recent launch of our Multicore Cookbook is designed to help programmers get started quickly with multicore projects. I’m looking for something a bit easier, however. A solution that sits above the low level minutia of most parallel applications. And, yes I do have hope.

A recent IDC report covered the HPC server market growth in the third quarter of 2007. The most interesting finding was that over the last year or so, non-HPC server growth has been slowing down. If it were not for HPC systems, the entire server market would be shrinking. According to IDC, the growth of the HPC market has seen revenue growth of 20% over the last four years.

Revenue from clusters represented 68% of the overall HPC server revenue for the third quarter of 2007. The fastest-growing area is the work group segment for systems priced under $50,000, which is projected to have 11.4% CAGR through 2011.

One of the reasons for such growth was the lower entry prices that make HPC systems affordable for smaller organizations and business units. HPC is headed for the desktop and a there is huge opportunity for those who make it easy. Microsoft will have a play here as they seem to own the desktop at this point.

They also seem to recognize that they will have to solve the same problem we all face — parallel programming. Open software has a huge opportunity here as well. Open (source) efforts have proved to be an effective way to focus the best and the brightest on problems that are two costly for single entity to fund.

The Linux kernel is a perfect example of this kind of corporate cost sharing (without lawyers). And, in a similar vein because parallel programming is no cheap date, cost sharing seems like a good idea. Something to consider as 2008 unfolds.

In closing, when I think about 2007 a Russian saying comes to mind, “I wish things were better, but I am glad they are not worse.” Enjoy your holidays. We have more work to do next year.

The Cost of Multi-core: Faster is as Faster Does
Tuesday, December 11th, 2007

With all due respect to Forest Gump, defining fast is becoming a bit harder these days. And, yes, it has to do with multicore. There certainly is no argument that a faster execution takes less time. For instance, if my program ran in 10 minutes on the old processor and now runs in five minutes on the new processor, I would be pleased. Now let’s see how this plays out in multicore land.

First, let’s assume your code(s) are “multicore” ready as they may be written using threads, OpenMP, or even MPI. Second, you’re in the market for new compute nodes and a decision must be made about the type of processor. Should you buy a larger amount of faster dual-core nodes, or a lesser amount of slightly slower quad-cores? (Note: more cores create more heat, so quad-core processors tend to come in at lower frequencies than dual-cores.) Of course, we know it all depends on the application.

In the end, the application performance is easily measured, so a little benchmarking is in order. Let’s further suppose the the following performance. Running your code on one dual-core, you find you can get a 1.8 times speed up over a single core. If you then run the same code on a quad-core, you find the speed up is 2.2 times faster than a single core. So which is faster? Of course the quad-core is faster, but some would suggest that you should be getting close to a four times the performance since you’re using a quad-core system.

In a sense, the quad-core is faster, but less efficient as you are not achieving full performance. HPC users prefer linear speedup. If one doubles the number of cores, and the program only speeds up by 20% or so, then it seems like your scalability is leveling off. In the case of the dual vs. the quad, one might conclude, the quad is a little faster, but much of the extra cores are wasted because they are not being used, so I’ll stick with the dual-core solution.

Of course, achieving peak performance is the desired goal, but how often is every part of a single core running at its peak rate? Do all your codes use both the floating point unit and do integer calculations at the same time?

It is reasonable to expect some codes can use four cores more effectively than other codes, just as some codes can use a single core more effectively than others. If we forget about the number of cores for a moment, isn’t the processor that runs 2.2 times faster the better of the two? Assuming you may spend about the same amount of money for processors, why should we care if it has four, eight or one hundred cores, as long as our code runs faster than before?

Perhaps we need to stop thinking about cores as though they are “nodes,” or extra processors, and focus on system-wide performance increases.

Fair enough, but there is one other issue that keeps cropping up — software licensing. In the past, most commercial software vendors licensed programs on a per CPU basis. When multicore hit the market, some commercial vendors decided to license per core. This scheme makes sense as multiple cores look like discrete processors to the OS. Some vendors have continued to license software on a per-socket basis. In those cases, a higher core to socket ratio may make more sense. There are also a number of methods used by companies like IBM and Oracle that try and land somewhere in the middle accounting for both cores and sockets. Other than customer confusion, most of these efforts really don’t seem to be addressing the issue. Of course, we have not even brought virtualization into the discussion.

It should also be noted that, either way, proprietary license fees often outweigh the cost of the hardware by far. If you decide to use quad cores, and your commercial application that’s licensed per-core is only showing a three times speed-up, then you have effectively wasted the license fee for the fourth core. In this case, it may better to use dual-cores because even with lower aggregate performance, their core utilization is higher.

It seems expectations of performance may not match the economics of software licensing. At the moment, there does not seem to be any answers to these and other issues surrounding multicore. The best advice for now, just remember, “Multicore is like a box of chocolates. You never know what you’re gonna get.”

MPI on Multicore, an OpenMP Alternative?
Tuesday, December 11th, 2007

No matter how you cut it, coding for multicore is really just parallel programming. Once you’ve realized that, it’s time to look at the options, whether your existing codebase will scale, or if you need to rewrite your code and how.

As stated in The Multicore Programming Challenge, parallel programming can be difficult. It moves the programmer closer to the hardware and further from their application space or problem. Fortunately, people like rocket scientists have been writing parallel software for quite some time in the HPC (High Performance Computing) sector.

As any good programmer knows, an existing code base can be valuable to current programming projects. First, the possibility of re-using existing code is a major incentive. Also, learning how someone else attacked a similar problem is very valuable.

In the HPC sector, most parallel programs are written using Message Passing Interface (MPI). While MPI is normally used on large computing systems (clusters) it can be also be used on a multicore processor. The “MPI proposition” may seem counter to conventional wisdom as MPI was designed for distributed memory (i.e. each core/processor has it own private memory), whereas OpenMP was designed for shared memory.

The lazy assumption suggests that OpenMP is a better solution because it was designed for shared memory. However, the possibility of re-using an existing MPI code base is worth considering before you spend a month(s) re-inventing the software wheel. Ultimately, the question is really about efficiency. Namely, How does the performance of MPI compare to OpenMP on a multicore system?

The answer to this question is important. If I can re-use MPI codes that work well enough on multicore, then there is no need to (re)write my application using OpenMP. If, on the the other hand, OpenMP or threads provide scaling benefits sufficient enough to justify re-writing the code, then investing the time in re-coding might be in order.

Although your application(s) are always the ultimate test of hardware, a comparison of the same program written in MPI and OpenMP would be interesting. Fortunately for us, the people at NASA (the rocket science guys) have an interest in such things as well. The venerable NAS Parallel Suite is now available in MPI, OpenMP, Java, and HPF.

This enhancement means a head to head comparison of MPI and OpenMP is possible. (I’ll leave the Java and HPF runs as an exercise for the reader). Before we get to the main event however, some background on how OpenMP and MPI differ may be helpful.

OpenMP and MPI Primer

Because native Pthread programing can be cumbersome, a higher level of abstraction has been developed called OpenMP. As with all higher level approaches, OpenMP sacrifices flexibility for the ease of writing code. At its core, OpenMP uses threads, but the details are hidden from the programmer.

OpenMP is implemented as compiler directives in program comments. Typically, computationally heavy loops are augmented with OpenMP directives that the compiler uses to automatically “thread the loop”. This type of approach has the distinct advantage that it may be possible to leave the original program “untouched” (except for comment-directives) and provide simple recompilation for a sequential (non-threaded) version where the OpenMP directives are ignored. (Read the OpenMP Web site to get the complete picture.)

For those who don’t follow software trends, but instead rely on the crack Linux Magazine columnists to provide them with all the important advances, GCC 4.2 (and later) has support for OpenMP. This is important for the open source crowd, because OpenMP was only available in commercial compilers before GCC 4.2 was released.

GCC 4.2 has not found its way into all distributions, so you may need to download and build it from source if you want to play along with this article. Of course if you have a commercial compiler, it probably already has OpenMP support.

For gcc and gfortran, OpenMP programs can be compiled by including the -fopenmp option. In order to test this new capability, I found an OpenMP version of the ubiquitous matrix multiplication program. I built two versions of the program, one with OpenMP enabled and one without:

$ gfortran -fopenmp -o matmult_omp matmult.f
$ gfortran -o matmult matmult.f

Then I ran the sequential version on an Intel Core 2 Duo system (two cores):

$time ./matmult

real    0m9.079s
user    0m8.988s
sys     0m0.012s

The OpenMP version was run as well. Note that there is a environment variable called OMP_NUM_THREADS that will tell OpenMP binaries how many threads to use. If this is not defined, one thread per CPU (core) is used. Ultimately however, the maximum number of threads may be defined by the program. The OpenMP results for two cores is shown below.)

$ time ./matmult_omp

real    0m4.967s
user    0m9.783s
sys     0m0.018s

The OpenMP version reduced the wall clock time by forty five percent. Astute readers may be wondering, why the user time is almost double the real time. This effect is due to using two cores, i.e. your total CPU time is a sum of the cores your application is uses. As we will see below, the user time can be quite a bit higher than the real time for eight cores.

In contrast to OpenMP, MPI uses a software library to send data from one process to another. Each process has its own memory space and thus MPI is basically a message copying methodology. In addition, MPI makes no distinction where a process runs. It can run on the same machine or on another machine. If one were to time an 8-way OpenMP and MPI program, the following would result (OpenMP is run first.):

$time bin/cg.B
real    1m11.735s
user    9m23.287s
sys     0m2.012s

$time mpirun -np 8 bin/cg.B.8
real    1m16.138s
user    0m0.000s
sys     0m0.004s

In the first case, OpenMP shows a real time of about one minute with user time of almost 9 and a half minutes indicating a good speed up. In the second case, the MPI run shows a comparable real time, but zero user time. This result is easily understood in terms of how MPI jobs are run. The mpirun command starts each separate MPI process and then waits until they are finished, thus no user time. OpenMP jobs, however, share a process space which makes them tractable to the OS.

The Process View

While we are talking about OpenMP and MPI, there’s one big difference between these programming methods in terms of the OS process space. OpenMP programs run as a single process and the parallelism is expressed as threads. This behavior can be viewed quite clearly when using an eight core server (two quad-core processors). For instance, examining a running OpenMP program using top shows only a single process running. (See Figure One)

Figure One: OpenMP program (cg.B) running on eight cores.

In contrast to the OpenMP, MPI actually starts one process per core using the mpirun -np 8 ... command. This situation is shown in Figure Two where an MPI version of the same program is now running. Note the number of processes is now eight. The processor (core) loads are about the same for both, however.

Figure Two: MPI program (cg.B.8) running on eight cores.

One final and subtle point. In OpenMP communication is through shared memory, which means threads share access to a memory location. With MPI programs on SMP systems communication is also through shared memory, but processes send messages by writing from private to shared memory.

Obviously, sharing memory locations seems more efficient than sending copies of memory locations to other processes, but it all depends. In the MPI process model, single processes have exclusive access to all their process memory. For some programs this situation may be more efficient because it is better to copy data (send a message) than to wait for shared memory access. On the other hand, in the OpenMP model, threads can share access to all memory in the process space. In this case, some programs may be much more efficient as the large overhead of copying memory is not needed.

Looking at the Numbers

An eight-core Intel server (two four core Clovertown processors) was used to run the tests. The OpenMP tests used gcc/gfortran version 4.2. The MPI tests used LAM version 7.1.2. The OpenMP and MPI suites have six programs in common and each of these was run five times and averaged (Class B problem sizes were used). The results are given in Mops (million operations per second) in Table One. The percent difference is also shown.

Test OpenMP
gcc/gfortran 4.2
MPI
LAM 7.1.2
Percent
Difference
CG 790.6 739.1 7%
EP 166.5 162.8 2%
FT 3535.9 2090.8 69%
IS 51.1 122.5 139%
LU 5620.5 5168.8 9%
MG 1616.0 2046.2 27%

Table One: Results for the OpenMP/MPI benchmarks. (winning test is in bold)

Tests CG and EP are about the same. Indeed, EP is a good check as both methods should produce a similar result because there is very little communication. OpenMP is the clear winner with FT performance, but MPI does surprisingly better with the latency sensitive IS benchmark. In the fifth test, OpenMP does best with the LU benchmark, while MPI does best with MG. Overall the comparison is a bit of draw.

The results are clear on one point, there is not a definitive winner in this match-up. This result may come as a surprise to those who would assume, OpenMP would easily beat MPI on an multicore machine. (Or any SMP machine for that matter.) Maybe MPI is good enough to stand toe-to-toe with OpenMP for many applications.

In only one case (FT), did OpenMP run away from MPI. In other cases, MPI was a clear winner, and taking the time to convert your code to OpenMP would actually result in a performance loss. The story is far from over, more benchmarks are in order using other hardware and commercial compilers.

Other Things to Consider

Getting back to our question, “do I need to re-code my MPI programs for these multicore thingies?,” the answer is a resounding maybe not. MPI may just be good enough in many cases. Again, more data, and results for your application are needed for more solid recommendations.

Another important question to ask is how scalable your application is. As more processors are added, parallel execution will always hit a point of diminishing returns. This situation means that creating more threads or processes will not improve performance and it may actually hurt performance. The size of your data set may also come into play. One of the advantages of distributed MPI programs is the ability to distribute large data sets over many processors thereby solving problems that would never fit in an SMP memory space.

If you’re considering a writing a new application from scratch, the choice of OpenMP or MPI includes other considerations. OpenMP is designed for shared memory (SMP) machines. As multicore continues to grow the number of processors on an SMP will continue to grow, but OpenMP is not designed to run across multiple machines like MPI.

If you want your application to be portable on clusters and SMP machines, MPI might be the best solution. If, however, you do not envision using more than eight or sixteen cores, then OpenMP is probably one of your best choices if the benchmarks point in that direction. From a conceptual standpoint, those with experience in both paradigms state that using OpenMP and MPI provide a similar learning curve and nuance level. There are no shortcuts or free lunches with OpenMP, or MPI for that matter.

Dodging The Boot Heel of Technology
Wednesday, November 28th, 2007

Earlier this year, my wife and I decided to finish our basement. Great! I thought I could finally set up a real office. I had just one thing to do before my office would become a reality — deal with the ghosts of systems past.

Over the years I have collected a substantial pile of old hardware. Vintage Pentium II servers in 2U cases, and workstations in big desk-side cases. At one point I was sure I was going to build a cluster out of these systems, but somehow I never got around to it. When you have a big empty basement, why not fill it with old hardware and build a cluster? Isn’t that what people do these days?

In any case, these systems were eBay orphans — not even worth the shipping cost, not worth testing, etc. I had checked into donating them, but they were just a little too old and slow. In addition, the systems were “white box,” unbranded systems. (Why are they still called “white box” systems when they’re in black cases?) The only solution was the recycle bin, but that meant separating the metal from the electronics. Not a problem, with a few screw drivers, some wire cutters, and the help of my teenage daughter, we quickly had a pile of metal and a pile of components.

While we were dismantling these systems, I started thinking about the fact that these systems were pricey and fast not so long ago, and now they were getting pulled apart like cheap toys headed for the trash heap.

I guess that’s the way of cluster technology. The new generation processor forces multiple old ones to become obsolete. A single new quad-core processor can easily replace four older processors, motherboards, power supplies, etc., and use less power! Out with the old and in with new. Like most old computer hardware, the boot heel of technology finally landed on these systems.

But will the same thing be true of the bright and shiny multi-core systems we’re buying today?

Let’s look at a hypothetical situation. The systems I recycled had 400MHz Pentium II processors (single core). Of course, one of today’s 2.3 GHz dual-core processors is running at with a clock speed that is six times faster. When it comes time to get the latest and greatest, will the future bring a 14 GHz dual core processor or will it bring an 3.0 GHz eight-core processor?

From what I know about physics, I’m betting on the eight-core processor. Notice that the core count is increasing quickly and the clock speeds are creeping up slowly. Therefore, the actual performance of an MPI program running on eight older dual-core processors (16 cores total) may not be that much slower than two new eight-core systems. (And yes, it all does depend on the application.)

In other words, newer processors will be adding cores similar to what I already have. My old cluster is kind of like the new cluster only the new one has more cores. In the past, it was about clock speed, now it is about cores. Multi-core actually gives clusters a longer effective lifetime and keeps boot heel of technology at bay just a bit longer.

In a strange kind of way, I find this to be one of the major advantages of multi-core. I doubt people are going to stop buying new hardware, but the old hardware will continue to have some utility — much more so than in the past.

Basically it means more cycles for more problems. I like that option. My only remaining issue is where to start stockpiling those dual-core machines. I can’t keep them in the refinished basement and expect to stay married.

I do have some room in the shde out back. I just have to move the boxes of fans, power supplies, cables, power cords, and PCI cards I just pulled from old systems. Maybe I’ll start a museum.

The Big Show
Wednesday, November 14th, 2007

I am standing in the middle of SC07 (Supercomputing 2007). SC07 is THE HPC event of the year. 318 exhibitors have made the trek to Reno, Nevada for SC07, and scores of attendees are here for the week-long conference. If you’re attending the show, you’re probably not reading this because you’re either completely exhausted, back-logged, depressed from losing your money at the casinos, still trying to get your shampoo back from the TSA agent at the airport, or some combination of the above.

In any case, I’m thought I might trying my turn at real-time blogging. As many of you know I write quite a bit about clusters, but not in the real-time sense. I normally to try to write clever and {insight/incite}ful articles with some take-away for the reader. Not this week. I’m blogging, baby. Plus, Linux Magazine editor-in-chief Joe ‘Zonker’ Brockmeier just came by and asked me when he can expect this week’s column…

11:00 a.m. I’m on the trade show floor standing in an vendor’s booth (Appro) as part of my duties as a “booth geek”. When not trying to invent TSA-safe shampoo, I like to test and benchmark new technology. When Appro asked me test some new Harpertown processors and write a white paper I jumped at the chance.

For Harpertown, the news is good. My results show a much improved quad-core processor from Intel. How much improvement, you ask? How about overall improvements of at least 40% on my 16-core MPI runs (two 8 core servers with Infiniband). I’ll have more to say about this in the weeks to come, but let me just say, I was late finishing the paper because the test results were so good, I had to run the benchmarks twice just to be sure I did not make a mistake.

11:30 a.m. Here’s something new, the entire tradeshow just went dark for a few seconds. Systems are rebooting all over the show floor. The SCinet wireless is down. Heavens! No Internet in this big room full of computer geeks!

Speaking of no Internet, many hotels that are used for the SC shows often have a bandwidth issue when the hordes of laptop toting computer jocks show up and take down their network at least once a day. The problem is the attendees are too used to fast and ubiquitous networking and while at the show have SCinet. No, not the one from the Terminator movies, but close.

Every year at SC they build one of the most powerful networks in the world. SCinet serves as a way for show exhibitors — including government labs, academia, and vendors — to demonstrate the advanced computing resources from their home institutions and elsewhere by supporting Supercomputing and grid computing applications.

SCinet is designed and built entirely by volunteers from universities, government, and industry. SCinet connects multiple 10-gigabit per second (Gbps) circuits to the show floor. To put it in perspective, you could download two DVD movies in one second.

SCinet has three major components. First, according to the Web page, “it provides a high-performance production-quality network with direct wide area connectivity which enables attendees and exhibitors to connect to the Internet and other networks around the world.” In other words, free and fast WiFi all week. Additionally, SCinet includes an show-wide Open InfiniBand (OpenIB) Network. Perhaps more impressive than a bunch of words is the control center in the picture.

12:30 p.m. Wireless is back up. Time for lunch, so I’ll take a break. Nice to get away from the server fans for a while. Off to grab a quick lunch. Convention food… well, that is another blog altogether.

1:30 p.m. I have been talking about my Intel Harpertown, sorry Xeon 5400, white paper to several people. Speaking of Multi-core, we just launched the Multi-core Cookbook. Take a look and learn to cook the multi-core way.

3:30 p.m. I have a little time, I thank I’ll check the new Top500 results. And this years winner is, surprise, IBM BlueGene coming in at 596 TeraFLOPS! That is 596 times 10 to the 12th power, or the equivalent of 50-60 thousand desktop machines all working on the same problem. In the case of BlueGene there are 212,992 PowerPC 440 (700 MHz) processors chugging away under the hood. I you are wonder why they processors are only 700 MHz, just remember more Hz means more heat. Using a large number of slower but cooler processors can have advantages.

6:00 p.m. The first full day of the show is winding down. So am I. Last night I attend the sort of annual LECCIBG. For an East-coaster, you can only take so many of those late (2 a.m.) events on the West side of country. I must soldier on, however. Tonight is the Beowulf Bash, a yearly anniversary of the of the early Beowulf Alumni. It is also a great place to associate a face with an email address. The crack Linux Magazine HPC media team will be there collecting interviews and candid footage. Stay tuned to Today’s HPC Clusters for the videos in coming weeks.

So, blogging is kind of cool. I managed to capture a fraction of the SC experience. I wish I had more time to write about other thoughts I had to day. Like TSA brand shampoo. A safe alternative for problem hair. I need some sleep.

Hamburgers, Beer, and Clusters
Wednesday, November 7th, 2007

I find Pablo Picasso’s famous response on computers, “Computers are useless. They can only give you answers,” interesting and provocative. I also believe there is a grain of truth in that statement — namely, that it’s really important to ask the right questions of computers. I have a hard time with the useless part, however.

We ask important and world-changing questions of computers (and clusters) everyday. I call these the what questions. For instance, “What is the lowest energy state of this molecule?” or “What happens when I try an align these two genome sequences?” or “What does this seismic data tell us about possible oil in the ground?”. These questions yield important information.

The How Questions

When discussing HPC clusters, and computing in general, other questions arise. I call these the how questions. These question often need to be answered before computation begins. For instance, “How do I solve this problem quickly?” or “How do I solve this problem within my research budget?” or “How to program 128 cores?”

The answers often begin with hardware choices and end up with software issues (i.e. telling the computer how to solve a problem). The how questions are often include many dimensions and can have a lasting effect on future decisions and costs. That decision to write everything in Turbo Pascal probably seemed like a a good idea at the time. Thinking carefully about how we do things in HPC is important and, as I found out last week, goes well with hamburgers and beer.

Hamburgers and Beer

If you recall, I had announced the formation of the NYCA-HUG (New York City Area HPC Users Group). We held our first meeting last Thursday. Seven people, including myself, attended and after some hamburgers and beer the discussions moved to some of the most important howquestions about cluster and HPC.

The first question is a classic, “How does one define a supercomputer?” Should it be based on cost or performance or some other metric? We had no clear answer, but all agreed that the target is moving rapidly. Next there was a question about power usage, like how could one turn off unused nodes and then turn them back on when needed and other such strategies. Most academic sites seem to concede that their clusters are not fully utilized, but are still using plenty of electricity. Perhaps the most provocative question was “How do we teach high school students about cluster programming so that what they learn today will work tomorrow’s technology?”

Such great open-ended questions, and so little time. Discussing questions that don’t have a single answer (non-deterministic in programming terms) is a great exercise. As a group we did not really have any solid answers, but I think we came away more enlightened and better informed.

So what did I learn? I gleaned some interesting insights from the discussion. Everyone seemed to agree that in academic/university situations power and cooling is often considered Someone Else’s Problem. The facilities budgets are often different from the research budgets (because “university overhead” is a big up-front part of most research grants). The consensus was that because power and cooling are out of the equation often the number of nodes/cores is the goal of the procurement process.

My views on software were confirmed as well. Everyone agrees it is a big issue as we move forward. Nothing new here, except that multi-core architecture does not seem to play big in procurement decisions while some of my tests have shown CPU architecture can be important in scaling. It seems the more nodes/cores the better.

And finally, after discussions like these, I like to think of how I better understand the how questions about clusters. Putting the self reference issue aside for now, the answer is simple — conversations. Whether it be small groups around the table or a mailing list, the two way conversations and even good arguments are at the core of the HPC revolution we are now experiencing. The fact that much of the HPC infrastructure is open helps quite a bit, but I’ll leave that for another column.

Our next meeting is December 6, if you are in the NYC area stop by. Check the NYCA-HUG page for more information.

In regards to Picasso, dear Pablo, there is only one answer really (42). And, to borrow a phrase from Deep Thought, a supercomputer character from Douglas Adams’ Hitchhiker’s Guide to the Galaxy series:

I think the problem, to be quite honest with you, is that you’ve never actually known what the question is.

More Conversations (and Beer)

It is that time again. SC07 is here. This year the HPC hoards descend upon Reno Nevada. In the name of shameless promotion I wanted to mention two events that will probably feature a few HPC conversations as well as beer and maybe a hamburger.

Monday Night (November 12) after the Gayla you don’t want to miss the SC07 LECCIBG pool tournament (pool playing is not required). Those of you that have attended these in the past know the drill. Most people usually don’t get arrested. Check out the cool invite.

Tuesday Night (November 13) The equally infamous Annual Beowulf Bash will be held at Third Street Blues in downtown Reno from 6-8 p.m. Most of the early Beowulf pioneers show up at this event. Plenty of conversations for everyone. The club is close to several of the conference hotels, but you’ll need one of the conference shuttle buses from the Sparks convention center. The address is 3rd Street Blues, 125 W 3rd St, Reno, NV Next to the ElDorado Casino, (on-street parking readily available).

If you can’t make the SC07 show, stay tuned to Today’s HPC Clusters for plenty of coverage.

Jonathan vs. The Roman Beauty
Wednesday, November 7th, 2007

In the past, I’ve put Intel quad-cores through some very simple tests. Namely, I spent quite a bit of time trying to understand the results of my “effective processor” script. Now it’s AMD’s turn.

I’ve been testing an 8-core (four-socket) Opteron machine from Appro International (http://www.appro.com). The machine can be configured as a 3U cabinet or as a deskside tower. More important, however, I now have an AMD box with 8 cores (four dual-core, 2.8 GHz Opteron processors) to compare to an Intel 8-core box (two quad-core 2.67 GHz Clovertowns) — a very apples-to-apples comparison.

However, exact parity in clock speed is not essential for my comparison. I am interested in how well the system scales when running multiple copies of the same program, so the actual clock speed is normalized out of the results. For each system, the Linux kernel finds eight processors and that is at the core of all the tests (obligatory core pun when writing about apples).

Before showing the results, I’d like to present my conclusions: In the taxonomy of apples, there are two kinds that seem to represent the results. The first is like the Jonathan — deep red, mildly tart, rich in flavor, versatile, and excellent for snacking or baking. The other is more like the Roman Beauty, slightly tart, best for baking. Of course, I will let you try and figure out which is which.

The Contenders

I configured the two systems to be as similar as possible. Table One is a a comparison of the systems’ specifications. I am most interested in the number of cores. The difference in kernels will have little effect, as the programs are crunching numbers. The compilers could almost be considered identical. The compiler options in both cases were –O3 –ffast-math and the –march option was set to the respective processor family (nacona or opteron).

(As an aside, gcc/g77 3.X are part of both the RHEL4 and SLES9 distributions. All gcc versions greater than 4 now include gfortran instead of g77. (gfortran is Fortran 95 compliant.) The absence of gcc 4+ toolset has made RHEL4/SLES9 users a little envious, because they’re stuck with g77. Not to worry, you can build the 4.X series yourself or you can head over to AMD and grab a set of gcc/gfortran 4.1.2 RPMS (http://developer.amd.com/gcc.jsp) The new version of gcc/gfortran can coexist with the older compilers as well.)

TABLE ONE:

The Intel and AMD specifications

Option Intel Platform AMD Platform
Number of CPUs 2 4
Number of Cores 8 8
Clock Speed (GHz) 2.67 2.8
CPU Model 5355 8220
Memory Type FBDIMM DDR2
Amount (GB) 8 16
Motherboard S5000XAL Appro
OS Fedora 6 RHEL
Kernel 2.6.18.7 2.6.9-42
gcc/gfortran 4.1.1 4.1.2

In past columns, I presented a script that determined what I call “effective processors,” or how many processors your application actually sees when it is using all the cores. The script simply measures how long it takes to run one copy of your program, then how long it takes to run eight copies. If the times are the same, you have eight effective processors. If the second time is eight times longer than the single copy, you have one effective processor.

The actual programs are part of the NAS Parallel test suite. The suite contains eight programs that represent real application programs. The programs were run in in single process mode for the purposes of the test script — that is to say, each program was a single process running on one core. See http://www.linux-mag.com/id/2868/ for the actual scripts and NAS suite.

Round One

For reference, Table Two provides the previously reported results for the Intel platform. While running the tests, I found the results would vary quite a bit, so I ran them five times and reported the average and the standard deviation. The tests indicate the number of “effective cores” you can see achieve for running 4 and 8 copies of a given program As you can see, the variation in performance could almost amount to a whole core in some cases.

The Opteron platform results are given in Table Three. In comparison to the Intel platform, the number of effective cores is much better. In the four copy tests, the number of effective cores scales almost perfectly with number of jobs. When the number of copies is eight, there is some drop-off but nothing lower than six effective cores. The standard deviation is also much better, which means process placement is not as critical as on the Clovertown.

TABLE TWO:

Previously reported average speed-up data (effective processors) for 2, 4, and 8 copies of the same program on an eight-way Intel Clovertown system. Each test was run five times.

Test 2 Copies Std Dev 4 Copies Std Dev 8 Copies Std Dev
bt 1.5 0.2 2.4 0.0 3.5 0.0
cg 1.7 0.1 2.3 0.1 2.3 0.0
ep 2.0 0.7 3.3 0.2 8.0 0.0
ft 1.7 0.2 3.1 0.1 7.1 0.8
is 1.7 0.2 3.2 0.1 4.6 0.5
lu 1.7 0.2 3.5 0.8 4.4 0.0
mg 1.7 0.2 3.1 0.8 3.1 0.7
sp 1.5 0.2 2.3 0.3 2.8 0.0
TABLE THREE:

Average speed-up data (effective processors) for 2, 4, and 8 copies of the same program on an eight-way AMD Opteron system. Each test was run five times.

Test 2 Copies Std Dev 4 Copies Std Dev 8 Copies Std Dev
bt 1.5 0.0 3.8 0.0 6.1 0.0
cg 1.9 0.0 3.9 0.1 7.4 0.0
ep 2.0 0.0 4.0 0.0 8.0 0.0
ft 1.9 0.1 3.9 0.0 7.5 0.0
is 2.0 0.1 4.0 0.0 7.5 0.0
lu 1.8 0.2 3.9 0.1 6.5 0.1
mg 1.7 0.2 3.8 0.0 6.1 0.0
sp