An Introduction to Parallel Programming

Having built a Beowulf cluster using instructions found on the Internet or in popular magazines, some zealous individuals are disgusted to discover that their favorite word processor and spreadsheet packages will not run on their powerful new creation.

Having built a Beowulf cluster using instructions found on the Internet or in popular magazines, some zealous individuals are disgusted to discover that their favorite word processor and spreadsheet packages will not run on their powerful new creation.

This is the case because unless a particular application has been specifically programmed to use multiple processors, it must be rewritten, or parallelized, before it will be able to run on either a parallel supercomputer or a cluster of PCs.

Moreover, since Beowulf clusters consist of computational and network elements that may be combined to optimally solve a particular problem or run a particular algorithm, the target application should play a significant role in the design of any new cluster.

Parallel computing is accomplished by splitting up a large computational problem into smaller tasks that may be performed simultaneously by multiple processors. For example, the addition of two very long vectors of numbers, A and B, can be performed by two processors if one of them adds the first half of vector A to the first half of vector B, while the second adds the second half of vector A to the second half of vector B. While this theoretically halves the time needed to solve the problem, the resulting vector, C, is now split across two different processor memories.

Because of this split, communication must occur to get the entire solution in one place. The distribution of the initial data, A and B, and the collection of the result, C, adds overhead to the computational problem and reduces the actual speedup of the entire task to less than double.

Symmetric multi-processor (SMP) and other shared memory computers can be used to reduce the amount and cost (in terms of time) of this communication; however, these systems typically have only a small number of processors or are very expensive, custom-built supercomputers. On the other hand, distributed memory platforms — including Beowulf clusters — are relatively inexpensive and can be scaled to hundreds or thousands of processors.

Most clusters built today are hybrids; they consist of many nodes (i.e., individual computers), each having two or more processors.

Decomposition and Granularity

Computational problems may be parallelized in a variety of ways. Parallelization may be accomplished by decomposing the data (as in the vector addition example above), by decomposing functionality so that one processor performs one type of operation while other processors simultaneously perform different operations, or by decomposing both the data and the functionality.

This decomposition may be established a priori or, more often, is performed dynamically once the program is running. Good parallel code most often automatically decomposes the problem at hand and allows the processors to communicate with each other when necessary — this is called “message passing” — while performing individual tasks.

Not all computational problems are amenable to parallel computing. If an algorithm cannot be restructured so that sub-tasks can be performed simultaneously or if the model components are highly interdependent, attempts to parallelize these codes may result in increased time-to-solution. Such “fine grained” problems do not scale well as more processors are applied to the computation. Performance of the finest-grained problems is limited by the speed of the fastest single CPU that is available.

Other computational problems, such as image processing where which each pixel may be manipulated independently, are “coarse grained.” Image processing is a good example of this type of problem. These problems are generally easier to parallelize and tend to benefit the most from parallel processing, particularly in distributed memory environments. The coarsest grained problems are often referred to as “embarrassingly parallel.”

Fortunately, most complex scientific problems may be decomposed by performing separate tasks independently and simultaneously on multiple processors or by splitting up the space and/or time coordinates of the system being modeled. These problems tend to fall somewhere between coarse and fine granularity and usually require a moderate amount of interprocessor communication to coordinate activities and share data.

For example, values for cells on a map may depend on neighboring cell values. If the map is decomposed into two pieces, each being processed on a separate CPU, the processors must exchange cell values along the adjacent edges of the map.

Problem decomposition is very important for successful parallel processing. A balance must be struck between computation and communication so that what was a computational problem on a single processor does not become a communications problem in a parallel environment. Writing good parallel code is actually more of an art than a science; practitioners must be able to think about algorithms in novel ways.

Message Passing

Many alternatives for code parallelization have been developed, but explicit message-passing strategies have met with the most success on a wide variety of applications and platforms. The most popular parallel environments and applications programming interfaces (APIs) are MPI (Message Passing Interface) and PVM (Parallel Virtual Machine). Because these two APIs are widely used, parallel code that performs message passing using these libraries can be run on every platform from laptops to the largest commercial supercomputers without changing the source code.

Both PVM and MPI are available on a wide range of computer platforms and have bindings for C, C++, and FORTRAN. The two most popular MPI implementations are LAM (Local Area Multicomputer) and MPICH (MPI Chameleon). While PVM has task scheduling and advanced features that are not available in MPI, MPI is increasingly used for code development because it’s based on a community standard and has adequate features for most parallel applications.

All three of these implementations work well on Beowulf clusters and are easy to install under Linux. One or more of these libraries is often included in standard Linux distributions. RPMs and other types of installation packages are available on their respective Web sites: for PVM, for LAM, and MPICH.

Using MPI and PVM

The desired message passing environment should be downloaded and installed on all the computers that will be used for parallel processing. On a Beowulf cluster with a shared filesystem, a single installation can be made accessible to all nodes (for more information on Beowulf cluster file system topologies, please consult the February 2002 issue).

In the case of MPICH, the installation will create the file /usr/local/mpich/share/machines. This file specifies which nodes in the cluster are available for use and should usually contain the names of the all the nodes in the cluster (a machines file may also be specified at runtime for finer control). If LAM is used, the lamboot command should be executed to initiate a daemon on each node.

If PVM is used, the PVM shell can be used to construct the desired virtual machine and start up daemons on each node. The PVM_ROOT environment variable should be set for each host, and user-created binaries should be placed in a special directory, pvm3/ bin/ARCH/, below the user’s home directory, where ARCH should be replaced with the appropriate architecture name.

MPI programs can be compiled in many ways, but most MPI implementations provide an easy-to-use script that will set desired compiler flags, point the compiler at the right directory for MPI header files, and include the necessary libraries for the linker. MPICH and LAM both provide a script called mpicc that is used for compiling MPI codes.

Figure One shows how to compile and run a simple “Hello World!” program called mpi_hello (see Listing One, ). After compiling with mpicc, the mpirun command executes the program. The -np flag tells mpirun how many processes to start. mpirun starts one process on the local node (no matter how many processors are contained in the node), then initiates one process on each node listed in the machines file. If the number of processes that are specified exceeds the number of nodes available, additional processes are created in round-robin fashion until all processes have been started.

Figure One: Building and Running with MPI

[forrest@node01 mpi]$ mpicc -O -o mpi_hello mpi_hello.c
[forrest@node01 mpi]$ mpirun -np 6 mpi_hello
Hello World! I’m rank 4 of 6 on node05
Hello World! I’m rank 1 of 6 on node02
Hello World! I’m rank 5 of 6 on node06
Hello World! I’m rank 2 of 6 on node03
Hello World! I’m rank 3 of 6 on node04
Hello World! I’m rank 0 of 6 on node01

Listing One: mpi_hello.c

#include <stdio.h>
#include “mpi.h”

int main(int argc, char **argv)
int myrank, nprocs, namelen;
char processor_name[MPI_MAX_PROCESSOR_NAME];

MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &nprocs);
MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
MPI_Get_processor_name(processor_name, &namelen);

printf(“Hello World! I’m rank %d of %d on %s\n”, myrank,
nprocs, processor_name);

return 0;

Each process is assigned a rank: a numeric identifier starting at zero that indicates the order in which processes were initiated. The name of the executable file and any command-line arguments are the final arguments to mpirun.

In this example, six processes were started on different nodes (node01 through node06), and each process determined its own rank (0 through 5) and printed out the total number of processes involved in the job (6). While each process started at the same time, the printed output appears in no particular order. This is normal behavior when multiple processes all print at the same time.

In order to successfully compile the code, the MPI header file (mpi.h) must be included at the top. Just inside main(), MPI_Init() must be called and handed the command-line arguments so that the environment is set up correctly for the program to be able to run in parallel.

The next three MPI routines return information about the parallel environment. Although this example program merely prints out the information, it is usually used to do automatic problem decomposition and to set up communication between processes.

The MPI_Comm_size() routine returns the number of processes (subsequently stored in nprocs) in the communicator group MPI_COMM_WORLD. MPI_COMM_WORLD is a special communicator that denotes all of the processes available at initialization. The rank of the calling process (ranging from 0 to nprocs – 1) is provided by MPI_Comm_rank(). The rank is stored in myrank. MPI_Get_processor_ name() provides the hostname of the node (not the individual processor) that is being used, which is stored in processor_name, and the length of this hostname, which is stored in namelen.

Next, the code prints “Hello World!” and the values of the variables obtained in the three previous MPI calls. Finally, MPI_Finalize() is called to terminate the parallel environment.

A similar program can be written using PVM. In this case, the PVM shell is used to construct a virtual machine of the desired size by adding hosts using the add command (see Figure Two). The virtual machine configuration may be verified using the conf command.

Figure Two: Building and Running with PVM

[forrest@node01 forrest]$ cd pvm3/bin/LINUX
[forrest@node01 LINUX]$ gcc -O -I/usr/share/pvm3/include /
-L/usr/share/pvm3/lib/LINUX -o pvm_hello pvm_hello.c -lpvm3
[forrest@node01 LINUX]$ pvm
pvm> add node02 node03 node04
add node02 node03 node04
3 successful
node02 80000
node03 c0000
node04 100000
pvm> spawn -4 -> pvm_hello
spawn -4 -> pvm_hello
4 successful
[1:t40002] Hello World! My task id is t40002 on node01 with 4 hosts.
[1:t40002] EOF
[1:t80001] Hello World! My task id is t80001 on node02 with 4 hosts.
[1:t80001] EOF
[1:tc0001] Hello World! My task id is tc0001 on node03 with 4 hosts.
[1:tc0001] EOF
[1:t100001] Hello World! My task id is t100001 on node04 with 4 hosts.
[1:t100001] EOF
[1] finished
pvm> halt

Listing Two: pvm_hello.c

#include <stdio.h>
#include “pvm3.h”
int main(int argc, char **argv)
int i, mytid, dtid, info, nhost, narch;
struct pvmhostinfo *hostp;

mytid = pvm_mytid();
dtid = pvm_tidtohost(mytid);
info = pvm_config(&nhost, &narch, &hostp);

for (i = 0; i < nhost && hostp[i].hi_tid != dtid; i++);
printf(“Hello World! My task id is t%x on %s with %d hosts.\n”, mytid,
hostp[i].hi_name, nhost);

return 0;

Figure Two shows how the pvm_ hello program in Listing Two is compiled and run. In this example, the PVM shell is used to execute the program by invoking it with the spawn command. The number of tasks is specified via the -4 flag, and the output is sent to the display (hence the -> flag). In this example, one process is spawned on each of the four nodes included in the virtual machine (again, regardless of how many processors that node may have), and each process prints its task identifier and host name. The PVM shell is exited using the halt command so that the virtual machine is disassembled (i.e., the PVM daemons on each of the hosts are terminated).

Listing Two shows the PVM calls necessary to start and end a simple program. The PVM header file (pvm3.h) is included at the top of the code. The pvm_mytid() call returns the task ID of the calling process, and the pvm_ tidtohost() call returns the task ID of the daemon running on the host used by the task mytid. The pvm_ config() call returns information about the configuration of the virtual machine, including the total number of hosts, the number of different architectures, as well as some additional host-specific information. After calling these routines, the code prints “Hello World!” and the values obtained from these library calls. Finally, pvm_ exit() is called to terminate the parallel environment.

The Tip of the Iceberg

This brief introduction to MPI and PVM has merely demonstrated the routines used for parallel program initiation and termination. Message passing and data reduction routines are needed for most useful parallel codes. Many of these routines will be presented in the next Extreme Linux column.

Forrest Hoffman is a computer modeling and simulation researcher at Oak Ridge National Laboratory. He can be reached at

Fatal error: Call to undefined function aa_author_bios() in /opt/apache/dms/b2b/ on line 62