Equal Time in the TCOP Discussion

Readers respond to Doug Eadline's execution in determining the "Total Cost of Parallelization" for a given system. Join the discussion.

My recent column on the Total Cost of Parallelization (TCOP) generated some interesting feedback. I like feedback because it means people are actually reading what I write. In particular, when a response to one of my columns is from a experienced practitioner of the art, I read it carefully. Informed discussion is a great thing and the multi-core world can certainly use more ideas. With that in mind, I would like to provide equal time to Brian Dobbins of Yale University. He emailed a response directly to me with what I feel are valid points. I inserted a few comments to help clarify some of my own points referenced by Brian’s comments, but did not alter any of his views or conclusions. (Brian’s comments are in italics).

My counter point stems from a recent discussion I had with someone explaining that it’s almost always better (performance-wise), when given the choice between 150 processors at speed 1.0X vs. 100 processors at speed 1.5X, to go with the latter option. While both give the same aggregate computational “power”, the latter has fewer communication links, so more time is spent in useful computations.

The rule of thumb is a good one, however, there are some subtleties, which don’t effect my proposal, but are worth mentioning. For instance, if the interconnect and the processors are balanced, then the smaller number of processors is better, but if the processors spend time waiting on processor-to-processor then you may see better performance on more processors (provided the application scales).

In your column, you say that single core performance doesn’t matter if you can scale, and then give a figure of 80% scaling on 8 cores. This statement leads to the another question, Are we talking about 80% scaling regardless of how many cores we run on, or is it 80% going from 1-8, then another 80% going from 8 -64, and so on?

Good question. I should have been more clear. When I say “80% scaling”, I mean the following: at 100% scaling on four cores, I assume a 4X speed-up, where speed-up is sequential-time/parallel-time. When I say “80% scaling”, I mean it gets 80% of the 4 times speedup or 3.2 times speed-up. The same goes for 8 as well (80% of 8 or a scaling factor of 6.4).

Let’s take a closer look at overhead – presumably there will be some overhead from our abstraction into this SL (Scalable Language) as opposed to the closer-to-the-metal C or Fortran. Assume there is a theoretical C version of this program that scales at 90% per every 8x increase in processors, and that the SL version scales in the same way but at 80%. Let’s also assume that the SL serial version is just as fast (not 5x slower!) than the C version. The table below indicates how this would play out in terms of speed-up on a large range of cores:

Language 1 Core 8 Cores 64 Cores 512 Cores
SL 1.0 6.4X 41.0X 262.1X
C 1.0 7.2X 51.8X 373.2X

At 8 cores, the difference is negligible at about 12.5%, and considering the size of the infrastructure required for a mere 8 cores, that probably isn’t tremendously important from a financial point of view. If, however, the company or university for which this application is important is scaling out to 512 cores, then the difference is a factor of 1.42 times. At today’s prices, let’s say we can get 512 cores for $200K, meaning that a 42% difference in performance is worth $84K. I’m ignoring continuing operating costs, etc. The question now becomes whether that $84K is worth the time of the person or people coding the application. The answer, of course, is “it depends”. The complexity of the application, how many programmers are working on it, for how long, etc. need to be considered.

Which brings me to my second point – I don’t think “parallel programming” is difficult, not in the sense of writing the C or Fortran code to call MPI. That is the easy part. The harder parts are algorithmic and, frankly, just plain old serial software engineering. The algorithmic considerations might be moot when considering taking a serial code and making it parallel to solve the same problem, but many times the size of the problem one is tackling also increases, and that often requires new algorithms.

Consider a common example, the N-body simulation. For a very small number of particles, a basic particle-particle interaction algorithm is perfectly fine, but it scales at O(N2), so once you start increasing the size of the program, you need something better, and might switch to a hierarchical O(N log N) method, which has a higher constant for small values (and thus is slower for small N), but is far faster for larger N. Other algorithmic options include PM or ‘P3M’ methods, and many variants of each of these categories as well. I can’t envision an SL language being able to make such algorithmic decisions in any reasonable time frame, so this choice is still left to the programmer.

The second real barrier that makes parallel programming seem hard is that it really shines a light on bad software design. I’ve been involved with a number of applications for which the majority of my effort has focused on reworking serial code so that it is simpler to understand and much more flexible in its design, which in turn lends itself easily to parallelization and algorithmic experimentation. With a well-designed serial code and a chosen parallel algorithm, the actual process of adding the code to parallelize a given application takes little time indeed – Joe Landman’s recent column and many MPI tutorials such as those available at the NCSA illustrate this quite well. As a side benefit, with a good design, the parallel constructs will be as transparent as possible to end users of the code by abstracting away all such MPI calls into their own routines, and will ideally keep as much of the main program code as similar to the serial process as possible. In this sense, I disagree with your comment that “as the core count continues to grow, so will the cost of programming with your current and comfortable software tools.” To me, the programming approach really doesn’t change much as core count increases.

In conclusion, for a given application type, if we assume that an SL introduces some small overhead, the optimal choice of tool (in this case, SL vs. C/Fortran) is going to depend upon the performance one needs at a given processor count. If we consider an open range of processor counts, the best tool for a given project, as determined through some cost function of performance vs. total cost of the solution, is most likely going to be a curve. Naturally, at some point in that curve will be a transition point between these options. My best guess, at this point in time, is that the ‘old guard’ (C/Fortran) are still going to rule the roost where performance matters most, but I definitely concede that for people aiming for small scale parallelism, something that ‘automagically’ gives them some improvement, even if sub-optimal, would be wonderful. Well, wonderful in the sense that, compared to walking, driving on the highway at 35 mph is wonderful, but you’re still going to get passed by most of the other people on the road.

I should point out that I was thinking more along the lines of “general programming” while Brian was thinking about HPC programming. None the less, Brian raises some good points. And, I invite you to comment below as well. The multicore/parallel discussion has just started and I have no clue where it is going to lead!

Comments on "Equal Time in the TCOP Discussion"

Major thanks for the article.Thanks Again.

Hi there just wanted to give you a brief heads up and let you know a few of the pictures aren’t loading correctly. I’m not sure why but I think its a linking issue. I’ve tried it in two different web browsers and both show the same outcome.

I don’t even know how I ended up here, but I thought this post was great. I do not know who you are but definitely you’re going to a famous blogger if you are not already ;) Cheers!

Leave a Reply