If you develop software, then sometime in the near future you’ll be forced to make a decision. The decision will be a fork in the road, and once you choose a direction, you won’t be able to turn back. No developer will be spared. And no, it’s not another Perl Apocalypse.
As I noted in my premiere “Cluster Rant” column (available online at http://www.linux-mag.com/2005-09/cluster_01.html
), multi-core processors are going to change everything. To use multiple processors, applications will need to be poured into a concurrent framework of some sort, and once a framework is chosen, switching to another framework will be difficult (read: expensive). Now is the time to consider the implications.
This month and next, let’s look at concurrency, starting with the whole idea of doing two or more things at once.
Concurrent Versus Parallel
The terms concurrent and parallel are often used interchangeably, but the two aren’t the same thing. Concurrency is a property of a program or algorithm. If parts of a program can run independently, those parts are considered concurrent. If the independent parts can run on separate processors, then the program can further be called parallel.
Admittedly, the distinction between concurrent and parallel is subtle, so it may help to keep these three rules in mind:
1.Concurrency doesn’t necessarily imply parallel execution.
2.Concurrency is a property of the program.
3.Efficient parallel execution of concurrent programs depends on the hardware.
Not all programs can be made concurrent, and it is not necessarily better to run concurrent sections of code in parallel. In some cases, running concurrent portions in parallel may actually slow your application.
Add CPUs Here
However, having multiple CPUs is beneficial nonetheless:
*If you run a multi-tasking operating system on a multi-processor system, it’s possible to run two or more processes on different CPUs at the same time. This trick is a natural “concurrency” that is easily exploited by systems with more than one low-cost CPU.
*Processor speeds have been doubling every eighteen months, but RAM and hard disk speeds have been lagging behind. Doing things in parallel is one way to get around hardware subsystem limitations.
*Predictions indicate that processor speeds cannot continue to double every eighteen months after 2005. Indeed, the switch to multi-core has been prompted by an increasingly insurmountable “GHz wall.” Thus, instead of increasing clock speeds at greater and greater expense, chip makers are offering multiple-core CPUs at lower clock speeds.
*Depending on the application, parallel computing can speed things up anywhere from two to N times faster, where N is the number of processors. Such performance is not achievable using a single processor. To wit, most supercomputers that at one time used very fast custom processors are now built from multiple, “commodity-off-the-shelf” CPUs.
If you need speed — due to a compute-bound and/or I/O-bound problem, concurrency may be your only option. Using eight CPUs to run your word processor is overkill, but a web server, a database, a rendering program, or a scheduling program might benefit from extra CPUs. And certainly, complex simulations, fluid dynamics applications, and data mining programs are accelerated with extra CPUs.
I Hate to Wait
Because parallel computing is implemented in a variety of ways, solving a problem in parallel requires some thoughtful decisions. These decisions may dramatically effect portability, performance, and the cost of your application.
To illustrate the differences in parallel designs, consider a real “parallel computing problem” using a familiar scenario: waiting in long lines at a grocery store.
In the (modern day) grocery, everyone (still) has to wait in line to check out. At a busy hour or during the holidays, you’ve likely thought, “I could be home already if these morons would just run this place more efficiently.” Such an experience is very similar to concurrent programming, only the cashiers are commodity CPUs.
Assume that there are a maximum of eight cashiers (CPUs) and each customer is a computer program. The size of the computer program (the amount of work) is the size of each customer’s order. The following analogies are helpful to illustrate parallel computing concepts:
*Single-tasking operating system with a single CPU. One cashier is working (is in use) and must process each customer one at a time. (A computer example is MS DOS.)
*Multitasking operating system with a single CPU. Only one cashier is working, but only a part of each customer’s order is processed at a time. The cashier moves from one person to the next, giving each customer a limited amount of attention each time, and everyone “seems” to be moving through the line together. If there’s only one customer in line, that customer checks out very quickly. Conversely, if there are a lot of customers, the wait for any one customer can increase substantially. (A computer example is Linux or Windows on a single CPU.)
*Multitasking operating systems with multiple CPUs. Here, several cashiers are working, orders are still processed in parts, but any single part can be processed by any available cashier. Overall, customers check out much faster. However, although some cashiers may be idle, no customer ever checks out any faster than a single customer can be processed by a single cashier. This technique is called Symmetric Multi-processing (SMP). (A computer example is Linux and newer versions of Windows running on a multiprocessor system.)
*Threads on a multitasking operating systems with multiple CPUs. If the items in a customer’s order can be separated and processed by multiple cashiers at one time — and if the time to divide the order was recouped — the customer could be expedited through check out. In theory, a customer should be able to move through the line N times faster than before, where N is the number of cashiers. When the cashiers need to compute a sub-total or otherwise share data, information can be exchanged between verbally. There is a limit, however, as to how many cashiers the store can effectively locate in any one place because the number of “cross conversations” tend to slow things down. For instance, if you were a cashier at such a store, you’d have to contend with all of the conversations — even those you don’t care about — before you could ask for your data. (A computer example is Linux or modern Windows running a multi-threaded program on a multiprocessor system.)
*Sending messages on multitasking operating systems with multiple CPUs. To improve performance, the store adds an additional eight cashiers at the back of the store. Because the new rear cashiers are far away from the front cashiers, communication over a telephone takes a little longer, but if communication is minimized, the delay does not cause problems. Hence, if a customer has a really big order, one that requires all of the cashiers — and assuming the extra overhead isn’t too great — that customer can be expedited. (A Linux cluster and a distributed memory parallel computer are examples of this configuration.)
The above scenarios, although not exactly the same, do represent the constraints placed on parallel systems. Communication overhead is a necessary evil of multiprocessor systems. As you build your parallel program, you must decide how you want your CPUs to communicate (messages or threads). Interestingly, it does not necessarily depend on the underlying hardware.
To Share or Not to Share
There are two fundamental designs for parallel computer hardware:
*Local memory machine. Systems of this ilk communicate by passing messages between processors. A message can be thought of as a remote memory copy, as data is copied to a remote processor.
*Shared memory machine. This kind of system communicates through a shared memory. There’s no memory copying, as all data is shared “in place” between processors.
Of course, the distinction is a bit blurry, because many clusters use dual-CPU nodes where each pair of processors share a private memory space. On almost all of these systems, however, communication between “dual CPUs” is effectuated via messages (copying data). (More about this in the next installment.)
A typical HPC cluster is a collection of single-, dual-, and quad-CPU machines connected using Gigabit Ethernet (or some other high-performance interconnect), and is therefore a local memory machine. A 4-, 8-, 16-, or 32-way SMP system is a shared memory machine, and can be used for parallel computing if parallel applications communicate using shared memory.
A local memory machine can be scaled greatly because the communication between processors and memory is private. Conversely, the scaling of a shared memory machine is limited because the conversation between processors and memory is in a public space, and all CPUs must contend for the memory bus. Communication between processors must wait to get a turn at the memory pool.
As an aside, it’s possible to connect many multiprocessor shared memory machines to create a “hybrid” shared memory machine. These hybrid machines “look” like a single, large SMP machine to the user and are often called non-uniform memory access (NUMA) machines because the global memory seen by the programmer and shared by all the CPUs can have different latencies. At some level, however, a NUMA machine must “pass messages” between local shared memory pools.
As mentioned, it’s also possible and common to connect SMP machines to form a cluster. For a dual-CPU node, two separate processes (communicating through message passing) are run and the Linux internal scheduler determines how these CPUs get shared. There are some ways to assign a specific task to a specific processor, but this is not normally done.
In the next installment, we’ll take a look at how you can choose to “see” the hardware described above — either as running threads or as passing messages. And, this choice will be a fork in the concurrency road with no middle ground.
Douglas Eadline is the Senior HPC Editor for Linux Magazine.