Harnessing the power of multicore processors is one of the largest challenges facing the computer industry today. Here we look at the challenges and some of the programming methods we can use to solve the problem.
Harnessing the power of multicore processors is one of the largest challenges facing the computer industry today. Most commodity servers have used two discrete processors. Multiple processors have also been used in many large Symmetrical Multiprocessing (SMP) systems for years. Many modern operating systems (OS) are therefore equipped to take advantage of multiple processors. Indeed, from an OS and programmer’s point of view a multicore processor “looks like” a traditional multi-processor system with lots of processors.
In a server environment, there was an immediate benefit to multi-user, multi-processor systems. Quite simply, with extra processors the server could support more jobs or users than with a single processor. But, using multiple processors, and now multicores, for a single application is harder, because it requires re-programming. To help sort out how applications can be reprogrammmed let’s develop an analogy.
The Multi-processor Store
We’ve all waited in line at the grocery store. In general, the speed with which we’re checked out is related to the number of cash registers and cashiers working at that time.
A store with one cash register is like a modern day single processor computer. Each customer has a cart full of items (program) that is to be tabulated (computed) by a cash register (processor). Modern operating systems use a trick, called time sharing (or multitasking), to make it look like there are multiple programs running at the same time. For instance, in the store analogy, if an extremely efficient cashier with a smart cash register processes some of your order, then processes some of the next customer, you both would appear to be moving though the line at the same time. Using this method, customers get the illusion that they are moving through the line, but in reality, they will always go faster if they are the only customer.
The obvious solution, to anyone waiting in line, is to have more than one cash register going at one time. Indeed, many stores often do have more than one going to improve the flow of customers through the checkout line. The same is true for computers. Adding more processors, and now multicore processors, will also speed the work load. More customers (programs) can be serviced (run) at the same time, but you will never get through the line any faster than you would if there was only your order and one cash register. In computer terminology, this is referred to as Symmetric Multiprocessing (SMP).
The market has grown accustomed to faster and faster “cashiers” over the last twenty years. Thus, orders that once took minutes to tabulate, now take seconds and customers (programs) move faster than before. As mentioned in The Multicore Programming Challenge, processor technology is having trouble making the processors (cashiers) faster. So instead, they have introduced more cash registers.
In the near term, more processors (cash registers) means more of the users programs can run at the same time without impacting each other’s performance. Using modern SMP-enabled operating systems, this benefit will be immediate and transparent to all users. The longer term challenge facing software developers is how to make individual programs go faster using more than one processor.
Meeting The Long Term Performance Challenge
Going back to our store analogy, it is obvious that breaking your order into smaller orders and distributing them over two or more cash registers allows you to get out of the store faster.
The same applies to programs. If the program is amenable to distribution, it can use multiple processors and execute faster. Commonly referred to as parallel computing, this method will be responsible for many of the large performance gains in the immediate future. Parallel computing almost always requires re-programing existing sequential applications to execute in parallel.
The amount of reprogramming can be trivial or monumental, depending on the application. The choice of tools and techniques for this task will be critical for success in the future. Fortunately, software methods and tools for exploiting parallelism are already available. Many of these techniques are already used successfully in the High Performance Computing (HPC) market.
Dealing with multiple CPUs isn’t a new problem. It’s been around for years, and has been studied quite extensively — though there’s no consensus on exactly how to go about dealing with multiple CPUs and multicore CPUs.
Programmers can choose from two general methods. The first is threaded programming, the second is message passing. Both have advantages and disadvantages, and both are rather low-level approaches. The correct choice depends largely on the application and target hardware.
The thread model is a way for a program to split itself into two or more concurrent tasks. The tasks can be run on a single processor in a time shared mode, or on a separate processors. For example, the two cores on a dual-core processor can each run threads. The term thread comes from “thread of execution” and is similar to how a fabric (computer program) can be pulled apart into threads (concurrent parts). In the cash register analogy, it would be similar to breaking your order up into components and using separate cash registers. Threads are different from individual processes (or independent programs) because they inherit much of the state information and memory from the parent process.
On Linux and Unix systems, threads are often implemented using a POSIX Thread Library (pthreads). Programmers can choose other thread models (Windos threads). However, using a standards based implementation, like POSIX, is highly recommended. As a low level library, pthreads can be easily included in almost all programming applications.
Threads provide the ability to share memory and offer very fine grained synchronization with other sibling threads. These low-level features can provide very fast and flexible approaches to parallel execution. Software coding at the thread level is not without its challenges. Threaded applications require attention to detail and considerable amounts of extra code to be added to the application. Finally, threaded applications are ideal for multicore designs because the processors share local memory.
Because native thread programing can be cumbersome, a higher level abstraction has been developed called OpenMP. As with all higher level approaches, there is the sacrifice of flexibility for the ease of coding. At its core, OpenMP uses threads, but the details are hidden from the programmer. OpenMP is most often implemented as compiler directives in program comments. Typically, computationally heavy loops are augmented with OpenMP directives that the compiler uses to automatically “thread the loop”. This type of approach has the distinct advantage that it may be possible to leave the original program “untouched” (except for directives) and provide simple recompilation for a sequential (non-threaded) version where the OpenMP directives are ignored.
There are several commercial and open source (C/C++, Fortran) OpenMP compilers available (GNU Compilers 4.2+ now support OpenMP). Like pthreads, OpenMP is ideal for multicore designs.
MPI (Message Passing Interface)
In the High Performance Computing (HPC) sector, parallelism is often expressed using the MPI programming interface. In contrast to threaded approaches, MPI uses messages to copy memory from one process space (program) to another. This approach is very effective when the processors do not share local memory (i.e. they are located on another motherboard). It can be used, however, for multicore programming as well. In particular, many programs have already been ported to MPI that can take advantage of multiple-cores without any re-programming.
In our cash register analogy, an “MPI cashier” would call other stores on the phone and tell them the items in the shopping cart that they would have to tabulate. The advantage of this is that the size (scale) of the order can get very large and exceed the capacity of the cash registers of any one store (computer). MPI is available as a library for most languages (C/C++, Fortran) and is available in both commercial and open source packages.
Both the threaded and message passing approaches are somewhat low level. While they provide a level of control and performance needed by programmers, they introduce a level of programming minutia that can make programming tedious. Much research and development has been done on more efficient ways for programmers to express parallelism in a program. While some results are promising, no breakthroughs have been made to afford programmers the ability to quickly and efficiently harness the power of multiple processors.
The Multicore Cookbook will introduce and discuss several of the more promising approaches. In general, New methods attempt to lift the programmer above the details of parallel programming and closer to the application space.
There is an old joke that goes, “every program works just fine, getting it to work the way you want it to is the trick.” As they say, it’s funny because it’s true. Parallel programs, like their sequential program cousins, are no exception. Indeed, parallel programs represent a much harder proposition because, unlike serial programs, there is the notion of synchronization and data sharing. These properties can make it difficult to fully understand program behavior in a real world environment where the program execution may not be easily replicated.
The use of parallel debugger can be critical to the success of any multicore programming project. (Print statements are still a good first step, but they often change the dynamics of the execution.) Without the ability to see what the program is doing in real time, multicore or parallel applications can be difficult or even impossible to to complete in any meaningful time frame.
Recommendations For Moving Forward
The multicore revolution requires software developers to evaluate their codes in terms of multicore performance now and as the number of cores increases in the future. The following recommendations are designed to help aid the transition to multicore architectures.
Assess the level of concurrency in your application: Concurrency is not present (or necessary) in all applications. Some applications, by virtue of their algorithms, can only be executed in a serial fashion. Examining the algorithm is the only effective way to access of the presence of concurrency in your application. Careful attention to memory access patters by various parts of the program is important as dependencies will inhibit the ability to operate concurrently.
Assess, if possible, whether concurrency will improve performance: If you can identify concurrent parts of your application, look carefully to see if executing them in parallel will result in application speed-up (reduced execution time). It’s common that parts of a program could be executed concurrently, but will have no effect on the execution time. There’s no point in spending time on these sections of code. Typically, you should focus your attention on computationally heavy parts of your application.
Assess the scalability of your applications: Another important question to ask is, “how scalable is my application?” As more processors are added, parallel execution will always hit a point of diminishing returns. This means that creating more threads will not improve performance (it may actually hurt performance). If you find that your application can be scaled to large number of processors, then using MPI may be a good choice. If however, you do not envision using more than four processors/cores, then pthreads or OpenMP are probably your best choices.
Make sure there is an adequate tool chain for your application: This last recommendation is critical to the success of your application. There are many methods to use multiple processors, some are mature (like pthreads) while others may still need some time to mature. regardless of method, If your project is to production level code, make sure there are adequate tool chains (compilers, debuggers, profilers) that provide capabilities and support needed to produce software in a reasonable time and cost.
To further assist in your decision making process, be sure to check out the other resources available on our Multicore Cookbook.
Douglas Eadline is the Senior HPC Editor for Linux Magazine.