Multi-core Malaise

Confronting multi-core anxiety and what the new processors mean for the future of commodity clusters.

I must have a weird disease. I lay awake at night vexed by this
whole multi-core thing. While others celebrate the latest hardware
revolution, I often find myself pensively staring at the bedroom
ceiling. Moreover, as any reader of my columns knows, I have looked
in a rational way at how multi-cores will affect the world of
commodity clustering.

However, this month, rationality goes out the window. I am going
to take a visit to my multi-core anxiety closet — those
fearful thoughts that keep me tossing and turning at night in hopes
of finding a happy ending. Gloom and doom aren’t
optional.

To understand my fear, I should first explain my vision of
cluster and parallel computing. My model is quite simple: a
processor, local memory, and a communication link. To a programmer,
this design, while tedious, can produce results and lends itself to
optimization. Multi-core shatters this vision and presents some
interesting challenges.

Recently Intel announced that by the end of the year they will
be introducing a quad-core processor. AMD announced its dual core,
dual socket, quad graphical processing unit (GPU) desktop using the
trendy “4×4” moniker (four cores, plus four GPUs). My
first cluster had four processors! Not only can you pack four cores
on a single motherboard, the price is expected to be in line with
what you would pay for a high end desktop. In other words, a new
four-headed beastie isn’t four times the price of a single
core. Cheap fast hardware. Why should I worry?

Fear Number One: The “More Is”
Law

One of the things that frightens me is that I believe multi-core
is now being positioned to be the “differentiator” used
by those that wear marketing hats at the big companies. Tasked with
selling silicon, wonks are constantly looking for a single metric
that says “We are better than the other guys because we give
you more… ” Or, as I like to call it, “More Is
Law” (not to be confused with the famous “Moore’s
Law”)

My law is about marketing. Until recently, my law could be
applied to clock speed, where obviously more
hertz is always better. My irrational side wants me to think that
the primary reason Intel developed the failed "i">Netburst architecture was to ramp the clock speed so far
past AMD that Intel would be the winner of conversations like the
following between Mr. and Mrs. Joe Sixpack at the local electronics
barn:

Look honey, this machine says it has 5.2
gigerhurts[sic],but this one over there is only 3.0 gigerhurts. We
better get the one with more gigahurts, never know when you might
run out. More is always better.

I suspect that in the corporate world the arguments were
basically the same. Of course, the only reliable metric that can be
directly related to gigahertz is the amount of heat generated.
Whether all those extra cycles really translate into faster,
cheaper, and better seemed to be a secondary concern.

Unfortunately for the marketeers, physics put an end to selling
clock speed. My fear tells me that with gigahertz gone, there is
room for a new differentiator. Enter the multi-core. And, much to
the marketers delight, having multiple cores actually works better
than clock speed. That funny gigahertz word is not needed, just
plain simple arithmetic. One core is for losers; two cores are
better than one; and four cores — well you are the bees
knees. More is better.

While I’m not one to turn down cores, I have to ask,
“What will the market do with all this goodness?” After
all, the commodity cluster market relies on the “real
markets” to provide low cost hardware. After obvious things,
like multi-tasking (preventing all that adware from slowing down
your computer), how are the other cores going to be used?

In my irrational world, I see the chip vendors ignoring this
question. Just keep putting cores on the chip and by 2010, sixteen
core chips will be all the rage. After all, sixteen is bigger than
eight (and so 2008). “We don’t
care about how you use them, you just want to have more than the
other guy.”s

So now comes the scary part. In the consumer market, I believe
multi-cores will have little effect on performance with what
consumers do with their computer. Thus, those extra cores really
become high-tech hood ornaments. The cores are there, but possibly
not implemented in an optimal fashion. In other words, the
multi-cores are there at the expense of other improvements.

Another avenue that the AMD 4×4 model has opened up is the move
toward system on a chip integration. AMD’s recent purchase of
ATI further supports this idea. Adding specialized cores like GPUs
is another way to improve performance. Fine for the game players
and 3D CAD users that live at home, but for a cluster, I worry.

In the commercial world, the story I create is a bit different.
Now that there are “desktop” and “server”
processors, the servers will initially implement pure CPU
multi-cores to keep up with the other guys, but in a way will be
cannibalizing their market. One way I see the market eating itself
is moving to operating system virtualization.

Consider how this will change the way servers are purchased.
Several years ago, if you needed a Web server, you bought a
machine, installed the operating system, and set it to work. Today,
you can purchase a dual-core/dual-socket motherboard server and use
virtualization software. You can make this system look like and
perform similar to four (or more) small servers, each with a
different purpose or owner. When the quad-core rolls out next year,
that single system can look like and perform like eight servers. So
if I’m Joe Corporate, instead of buying eight new servers,
I’m buying one. Ouch. I doubt the industry can tolerate an 8
fold reduction in sales. The only solution will be to increase the
prices for multi-core servers.

My multi-core nightmare plays out as follows. On the desktop, or
the low end, cores are selling points, but become very specialized
and redundant, good for watching movies, but maybe not for folding
proteins. I will concede, that using GPUs as computational
accelerators can be a good thing, albeit non-standard. More on this
later. At the high end, in the server room, a multi-core server now
starts at five figures. The market has bifurcated and there is no
middle ground with which to find cheap general purpose commodity
hardware that fits my vision of a cluster. And, this is not the
worst of it.

Fear Number Two: Software

I have talked about this problem before. Here is the hard truth:
Writing and debugging parallel software is hard no matter how you
do it. Threads and messages are a PITA (in
the spirit of keeping this column G-rated, the acronym stands for
“Parallel Instructions Take Adjustment”).

The problem is not necessarily writing multi-core software, the
problem is the huge disconnect between hardware schedules and
software schedules. As we all know, there’s no production
line software process like there is for hardware. Software takes
people time, which is expensive and cannot be scaled easily. So
while the software writers are still scratching their heads over
getting their application to run on a dual-core chip, the hardware
vendors are rolling our the quad-core samples and talking about the
next “octi-cores” that are in development.

The net effect is hardware distancing itself from software
faster then ever before. There is no compiler option like “
–mc 4 ” (multi-core, 4-cpu) that
compiles a program for your new quad-core system. Getting
performance out of multi-core processors is going to take real
work. The time frame will be measured in years. What will happen
when the hardware is so far ahead of the software it does no one
any good?

Fear Number Three: Breaking the Symmetry

In the beginning of the column, I mentioned my
“ideal” cluster design. The advent of multi-core breaks
this symmetry.

There are now two levels of communication: "i">on-chip and off-chip. Some may
argue that clusters have been using dual-cpu motherboards for a
long time, to which my response is, the dual processor systems were
kind of a kludge. For all intents and
purposes, programmers treated dual-cpu motherboards like two
separate systems. There are cases where packing two of the same
program on the same motherboard could take advantage of the fast
local communication, but most cluster programs seem to do better
with a maximally distributed approach, because of contention for
the interconnect and memory.

Consider the new multi-core environment facing the programmer:
If I have eight cores on a motherboard, do I treat them like eight
nodes? Or do I try and optimize my MPI
process to use the cores as a single node? Who knows.

If you treat them like eight separate nodes, then you have two
issues: variable communication speeds (local in-memory and
non-local over the interconnect) and resource contention. In terms
of communication speeds, MPI does not support a two-tiered
interconnect architecture. You may, with some careful twiddling,
write and place your MPI applications to optimize local and
non-local nodes, but this approach is definitely off-the-beaten
path. It is probably best to assume that all messages speeds are no
faster than the slowest connection, presumably the interconnect.
Even with this “safe” assumption, there are still
contention issues that may redefine what “slow” is for
your cluster.

With a traditional approach of one MPI process per node, there
was one process using the interconnect. With two or more MPI
processes on a node, the interconnect does extra duty. If many
processes are hitting the link hard, the communication times will
increase. A solution to this problem is a high-performance
interconnect. Bigger pipes help, but like all things it depends on
your application behavior more than ever before. One mix of codes
may run well on a group of multi-core nodes, while another may run
much worse.

The other issue is memory contention. The only way to really
test this effect is to run real programs and see if there are
slowdowns. (Memory contention on dual cores was the topic of last
months column, where I provided a simple script to test this kind
of thing.)

Another way to approach multi-cores is the use of process
threads. For instance, if you use threads for your inner loops (so
they execute on the multi-cores on the node) and use MPI for the
outer loops (between motherboards), you can garner the best of both
worlds, but there is a cost. You now have a less portable version
of your program. You have in essence committed your program to
specific hardware.

As any HPC veteran will tell you, this approach is not uncommon,
but definitely something to try and avoid as it may not work in all
cases. Furthermore, using the threads and MPI approach may change
the dynamics of your program. What was once a computationally heavy
inner loop may now be completed quickly and the MPI overheads of
the outer loop may now become much more significant.

If there is a best approach, it is not obvious to me. Are we
about to make something that is hard, much harder?

Is There Hope?

As the morning sunlight creeps into my anxiety closet, I do see
a glimmer of hope. After all, I tell myself, “Those cluster
people are rugged individualists. They will think of something. The
community will adapt to what ever hardware is on the market.”
Let’s hope.

In the mean time, here are some of my predictions:

*First, I expect to see the
deskside cluster emerge as a real product. There have been some
false starts, but multi-core makes it possible to put 8-16 cores
next to a desk without giving anyone heatstroke. These systems will
be further aided by the entry of Microsoft into the HPC cluster
market. Vendors like Tyan are showing deskside systems with the
full faith that the Microsoft marketing machine will make their
hardware necessary. Low cost deskside hardware will be good for
Linux clusters as well.

*The move to virtual servers
will create a bigger demand for high speed interconnects —
take your choice, 10GigE or "i">Infiniband. A larger demand will create larger volumes
and thus lower prices.

*Finally, the most intriguing
thing for me is the move towards specialized cores for the desktop.
Although a bit radical in design, the IBM/Sony "i">Cell processor is an example of this approach. In the
commodity sector, the growth of GPU capability has not gone
unnoticed. The BrookGPU project ( "http://brook.sourceforge.net/" class=
"story_link">http://brook.sourceforge.net/
) and sites like
General-Purpose Computation Using Graphics
( "story_link">http://www.gpgpu.org/) are getting more active
each month.

Using specialized hardware has always been the purview of
high-performance computing practitioners. The concern always
centers around portability and longevity. More than one company
that’s produced whiz-bang hardware for the HPC industry has
vanished leaving us a little more the wiser. In the emerging
multi-core world, I am hoping that open software and commodity
production of specialized multi-cores by the boatload will make
this a much safer proposition.

Now that I’ve talked about my fears, I feel a bit better.
I’ll shut the multi-core anxiety closet for a while.
I’ll check back in the future, but in the mean time,
I’ll try and get a good night’s sleep.

Doug has enlisted more monkeys to help him randomly
type a book on clusters. More is better. A preview is available at
"story_link">http://www.clustermonkey.net/content/view/128/53/.

Fatal error: Call to undefined function aa_author_bios() in /opt/apache/dms/b2b/linux-mag.com/site/www/htdocs/wp-content/themes/linuxmag/single.php on line 62