Parallel fundamentals, what constitutes an actual parallel program, and why some applications may or may not run faster on multi-core systems.
Before I climb back onto my multi-core soapbox, I wanted discuss some parallel computing fundamentals. Having been involved with parallel computing for almost 20 years, I often forget that those new to the cause lack a firm background in some of the basic concepts being discussed. It took hard lessons and several years for these concepts to sink into my head, hopefully your path to “fully parallel” will be far more gentle. So, let’s begin; if there is one thing to “take away” from this column, it is the following set of three rules:
- Concurrency is a property of the program.
- Concurrency does not necessarily imply parallel execution.
- Efficient parallel execution of concurrent programs is a property of the hardware.
Like many people, I often confused the words “concurrency” and “parallel” and these terms are often used interchangeably. However, in HPC universe they are not the same thing. Concurrency is a property of a program or algorithm and if parts of the program can run independently, then those parts are concurrent. That is, they may be totally independent function calls or lines in a
do loop of some kind. Learning that parallelization can be compartmentalized, for many people, is a “Eureka, my program can run in parallel,” moment.
Not so fast there partner. If the independent parts are run on separate cores/nodes then we will say the program is executed in parallel. And, just because your program has some concurrency, it does not mean that it will run efficiently in parallel. The distinction is subtle, but very important when real hardware is used. Since the goal is to make your program run faster, we need to ask the question, Does making all the concurrent parts of the program parallel increase execution speed? The answer to that question is a definite “maybe.” Why? Because in some cases running concurrent parts of your program in parallel may actually slow down your the overall application!
How could this happen? Let’s introduce the concept of parallel overhead. In a threaded shared memory model, overhead is access to memory (there is also thread set-up and tear-down as well). Recall, that in a multi-core shared memory model all cores use the same memory and will often need exclusive or guarded access to memory locations. A thread will lock a memory location to make sure no other thread changes anything. If another thread wants to read/write to the same memory location, it must wait until the location is unlocked. And thus, you have overhead.
In a distributed memory model, the overhead is the time it takes to copy memory from one process to another (i.e. send a message). Even if the message passing program is executing on a shared memory SMP, there is still message overhead. If messages are sent over a network the overhead can get quite large.
When a program is executed in parallel, the expectation is that the overhead plus the parallel execution will produce a faster run time. In some cases, the overhead time is much larger than the parallel execution time and thus any gain is lost. For example take a simple pseudo-code loop:
for I = 0 to I = 8000
X[I] = X[I] + 1
I = I + 1
This loop is concurrent and could very easily be broken up into eight groups of 1000 iterations. The actual computation could be done in 1/8 the time, but only if the overhead does not outweigh the reduction in execution time. In a message passing environment, this loop would almost always run faster on a single core than it would on eight cores. My guess is that in a shared memory environment the same would be true. While this case may seem obvious, there are others that are not so obvious and looking at the code is only going to give you the smallest of hints.
In my second rule the key word is “necessarily.” Some concurrent applications with large data sets are obvious candidates for parallel computing. Parallel rendering is a good example. It is almost trivial because there is very little communication/sharing (overhead) between the concurrent parts. There are other applications where things get a bit more complicated and finding a concurrent loop does not “necessarily” mean it should run in parallel.
The third rule has one important implication. That is, efficient parallel programs are not “necessarily” portable. Or, put another way, portability does not imply efficiency. For instance, on one hardware platform a loop may be efficient when executed in parallel while on another, it may actually slow down your program. Gulp. And, you thought all it takes was a little OpenMP or MPI.
That’s probably enough parallel weirdness for one week. Remember the rules and set your expectations accordingly. Next time I explain why faster processors are not necessarily better than slower processors.