Will You Still Need Me When I’m Sixty-Core? HPC and Multi-core
Now that all of the major processor vendors have introduced multi-core chips, the impact of this relatively new technology on high-performance computing should to be addressed. What will is the immediate impact on HPC application development? And what will "many-cores" ultimately mean for the future of the HPC cluster?
In case you haven’t noticed, there is a revolution (or evolution) occurring in the computer industry. Recently, all the major processor vendors have introduced multi-core chips composed of multiple processing units in a single package. Instead of having one central processor, computers now have multiple “brains” with which to run programs.
While this technique is not necessarily new, it is the first time these types of architectures have been mass-produced and sold to the commodity PC and server markets. Put simply, the multi-core revolution stands to affect everyone who uses a computer. From laptops, to game consoles, to large servers, the age of multi-core has begun. And, the trend is expected to continue. To an end-user, this change remains hidden; however, the expectation for continued price to performance gains similar to those experienced over the past twenty years will remain.
Alas, programmers using traditional programming methods will find providing additional price to performance for multi-core designs a challenging task. There is no silver bullet or automated technology that adapts current software to operate on multi-core systems.
For the HPC practitioner, however, many of the programming issues facing the rest of the industry are non-issues. The cluster HPC community has in a sense been there, done that. Instead, for the ever-growing HPC crowd, the technology road map begs two crucial questions: How do you program multi-core systems and what implications does multi-core hold for clusters? Before we take a look at these issues, though, a bit of background may help set the stage.
The Road to Multi-core
The computer market has long enjoyed the steady growth of processor speeds. A processor’s speed is largely determined by how fast a clock tells the processor to perform instructions. The faster the clock, the more instructions that can be performed in a given time frame. The physics of semiconductors have placed some constraints on the rate at which processor clock speeds can be increased. This trend is shown quite clearly in Figure One where the average clock speed and heat dissipation for Intel and AMD processors are plotted over time.
Figure One: Growth of clock speed and heat dissipation for Intel and AMD processors
Power consumption has clearly become a problem. The continued climb in power consumption (and thus heat generation) requires additional cooling and electrical service to keep the processor operating.
The solution was to scale out processor cores instead of scaling up the clock rate. The drop-off in clock speed on the graph indicates the delivery of the first dual-core processors from AMD and Intel. These processors are designed to run at a slower clock rate than single core designs due to heat issues. However, these dual-core chips can, in theory, deliver twice the performance of a single-core chip and thus help continue the processor performance march.
Multi-core Road Maps
Both Intel and AMD are selling multi-core processors today. From publicly available documents, the companies expect to release quad-cores in the 2007 time frame, and speculation is that eight-way cores will be introduced in the 2009-2010 time frame.
For servers and workstations that traditionally have two processor sockets available, this means the total number of cores per motherboard can easily reach sixteen in the near future. In addition, extrapolating this to expected eight way cores, means that 64-core servers are not an unreasonable expectation. Intel has been developing next generation processors that may contain up to 80 cores.
Answering The Programming Question
The challenge facing the industry then is how to use the sudden doubling of processor power. Fortunately, modern operating systems are equipped to take advantage of multiple processors and may extend some immediate benefits to the end users in the near term.
Using dual-core or quad-core processors to their fullest potential on a per application basis is harder (requires re-programming) and is considered a longer term benefit. From a software standpoint, the move to multi-core can be seen clearly as a move to parallel computing. For the larger markets, parallel computing is new and difficult; for the HPC cluster market it is the the way things are done. And in the HPC community the question being debated right now, is What do I have to do to my existing parallel programs (those written to use MPI message passing libraries) to run efficiently on multi-core? There are many opinions and no definitive answers (yet), however, the short answer is — nothing and surprising the longer term answer may be the same. Good news, for the HPC crowd.
Parallel Programming Methods
Dealing with multiple CPUs is not a new idea. The practice has been around for many years and has been studied quite extensively. There is no general consensus, however, on how to program multiple processors. There are two general methods that the programmer may use. The first is threaded programming and the second is message passing (i.e. MPI). Both have their advantages and disadvantages; yet message passing has been the method of the HPC industry for over a decade.
The thread model is a way for a program to split itself into two or more concurrent tasks. These tasks can be run on a single processor in a time-shared mode, or on a separate processors, such as the two cores on a dual-core processor. The term thread comes from “thread of execution” and is similar to how a fabric (computer program) can be pulled apart into threads (concurrent parts). Threads are different from individual processes (or independent programs) because they inherit much of the state information and memory from the parent process.
On Linux and Unix systems, threads are often implemented using a POSIX Thread Library (pthreads). There are several other thread models (Windows threads) with which the programmer can choose, but using a standards-based implementation, like POSIX, is highly recommended. As a low level library, pthreads can be easily included in almost all programming applications.
Threads provide the ability to share memory (on the same motherboard) and offer very fine-grained synchronization with other sibling threads. These low-level features can provide very fast and flexible approaches to parallel execution.
Software coding at the thread level is not without its challenges. Threaded applications require attention to detail and considerable amounts of extra code to be added to the application. Finally, threaded applications are ideal for multi-core designs because the processors share local memory.
Because native thread programing can be cumbersome, a higher level abstraction has been developed called OpenMP. As with all higher level approaches, there is the sacrifice of flexibility for the ease of coding.
At its core, OpenMP uses threads, but the details are hidden from the programmer. OpenMP is most often implemented as compiler directives in program comments. Typically, computationally-heavy loops are augmented with OpenMP directives that the compiler uses to automatically “thread the loop”. This type of approach has the distinct advantage that it may be possible to leave the original program “untouched” (except for directives) and provide simple recompilation for a sequential (non-threaded) version where the OpenMP directives are ignored.
There are several commercial and open source (C/C++, Fortran) OpenMP compilers available. Like pthreads, OpenMP is ideal for multi-core designs that live on a single motherboard.
MPI (Message Passing Interface)
In the High Performance Computing (HPC) market, parallelism is almost always expressed using the MPI programming interface. In contrast to threaded approaches, MPI uses messages to copy memory from one process space (program) to another. This approach is very effective when the processors do not share local memory — that is when the processors are located on separate motherboards. MPI can be used, however, for multi-core programming as well, as almost all modern day MPI implementations can recognize where a sibling process lives (i.e. on the motherboard or off the motherboard) and use the appropriate method to communicate.
For off-motherboard communications, TCP or a user space transfer protocol is used. Conversely, for on-motherboard transfers (those between cores) a shared memory transfer is used. Shared memory is used to pass the message, which means each parallel process still maintains it’s own private memory space (i.e. parallel MPI processes on the same motherboard can only effect other MPI processes through MPI messages). The lazy assumption, that threads are always faster than messages, may not always hold with modern multi-core systems. Just as messages have overhead, so does sharing memory. The multi-core lunch is far from free.
The advantage of the MPI approach is that programs can scale and exceed the number of processors and memory available on any one multi-core SMP motherboard. MPI is available as a library for most languages (C/C++, Fortran) and is available in both commercial and open source packages.
For large HPC applications, there would seem to be a hybrid approach for multi-core, for example using threads (or OpenMP) to communicate on the motherboard and MPI to communicate between motherboards. While this approach sounds enticing and includes standard programming components, there are two drawbacks to consider. First, it relies on the assumption that threads are always faster than messages. This assumption should be tested if possible. Quite simply, adding additional programming complexity vi OpenMPI or threads may not yield expected results. Second, hybrid methods tend to create non-portable platform specific applications. As any HPC veteran can attest, investing time in non-portable approaches runs the risk of wasted efforts when the hardware environment changes.
Nothing Like MPI
For those deciding on how to use multi-core in an HPC application, the following table may be a useful summary of the above discussion. For the HPC practitioner, the message is clear — stay with MPI. Indeed, there is almost zero cost associated with running parallel MPI programs on multi-core platforms right now and the benefits of converting to a threaded approach is simply unknown at this point. The do nothing approach may be the best. Of course, there are run-time optimization strategies with MPI (i.e. process placement and affinity) that may help as well.
Table One: MPI vs. Threads
Approach To Parallelism
poor (can difficult
good (for single SMP)
For those, considering threads or OpenMP, consider how large your problem may grow in size. If you do not believe you will ever need to move off a single SMP motherboard (both in terms of cores and memory size), then perhaps a threaded approach may serve you well.
Is The Cluster Doomed?
Growing core counts certainly would make one think that a “cluster-on-a-chip” is not to far off and the HPC cluster’s future is dubious. If clusters were just piles of processors, then this would certainly be the case. Unfortunately, there is the issue of memory access that tends to get in the way of such grand plans. Computer memory, for all it’s speed and density, has difficulty sharing itself with multiple processors. In other words, there are limits to how many cores your can attach to a bank of memory. This architectural issue is not going to go away any time soon. Indeed, AMD multi-core memory design uses an approach that looks a lot like a “cluster” (Hyper-channel is the the Ethernet of AMD multi-core cluster, but much faster!).
Clusters by their nature have distributed memory across multiple compute nodes. This approach has a established scalability, performance, and reliability over large SMP systems, which must resort to expensive NUMA (non uniform memory access) technologies to share memory.
The final argument for HPC cluster longevity is simply that clusters have the ability to take what is fast (a single commodity server) and scale it up to create something that is even faster. Of course, problems that took weeks on a single computer now take minutes on a cluster, but the need for faster computation continues for two reasons: Clustered systems allow larger problem sizes that are not possible on single systems, and there is no dearth of large Grand Challenges in both the research and industrial sectors of the HPC world. The ability to apply more processing power to a problem allows for better solutions and finer grained models to be developed. We do need more then 640K of memory.
Taking advantage of multi-core technologies is the next step in software development for many markets. The HPC sector has an advantage because many of the important applications are already cast in a parallel mold. Multi-core to the HPC practitioner represents a real “plug and play” advantage. Others may do well to follow their message passing approach to parallel computing.
The following sources can be consulted for general information about multi-core hardware and software.