dcsimg

Next Generation Parallel Computing

A new, federally-funded prograim called the High Productivity Computing Systems Initiative aims to to make high-performance computing more afforadable and more accessible to more scientists. Here’s a look at the goals of the project, including an example of a better way to program in parallel today.

As you read this, there’s a revolution in parallel computing going on. Every major CPU manufacturer is releasing multi-core or multi-threaded processors that can execute multiple instruction streams at a given time; advanced hardware of all sorts is becoming a commodity; and clusters of thousands of off-the-shelf processors are the norm. But more important, the software used to harness all that raw hardware power is changing for the better. Recognizing that solutions such as MPI and OpenMP are difficult and time-consuming to use, projects such as the High Productivity Computing Systems Initiative (HPCS, http://www.highproductivity.org/) are determined to create new technologies and techniques that make supercomputing more affordable and approachable. While it’s likely you’ll never see Supercomputing for Dummies at the bookstore or a Cray at Frye’s, the next generation of rocket scientists may not need a rocket scientist to design, program, and operate the next generation supercomputer.

It’s no secret that parallel programming is difficult and time-consuming. Yet, however arduous the task, it’s also very necessary: while all computers solve problems, high-performance computers — mega-clusters, supercomputers, and clusters — running complex codes solve problems much faster.

For example, the United States Army’s workhorse, the HMMVW (or “Hummer”), needed more armor to defend against novel insurgent tactics, namely improvised explosive devices (IED’s). Instead of the months and even years normally required for retrofitting deployed equipment, models and supercomputer simulations designed by the Army Research Lab (http://www.arl.army.mil/main/main/default.htm) yielded bolt-on “Armor Survivability Kits” in a matter of weeks. In this case, as in many others, minimizing the so-called time-to-solution was critical.

Time-to-Solution

In some cases, the time-to-solution — the time to develop a solution plus the computational time necessary to yield results — can be abbreviated by simply applying more hardware. If a working application exists, but doesn’t run fast enough, adding more memory, processors, bandwidth, and more, singly or in combination, can readily lead palpable improvements.

But in other cases, when an application doesn’t yet exist, a time-to-solution can be hard to predict. Moreover, it’s particularly onerous to assess the difficulty of developing a high-performance computing (HPC) application. Clearly, though, simpler is better in software development. Given better tools, powerful abstractions, and more time to focus on the problem at-hand, solutions can be created more quickly and efficiently. Indeed, and is its named implies, facilitating the application of high-performance computing to more of the sciences is the goal of the High Productivity Computing Systems project.

HPCS is a small project by government standards — only $100 million — but the program is proving to be a huge catalyst for change in parallel computing. Ultimately, HPCS hopes to create standards for high-performance computing that ease the transfer of knowledge, code, and practices between all United States government and national security supercomputers. To that end, HPCS is already closely examing and positively influencing how high-performance computers are built, deployed, maintained, and programmed.

For instance, novel hardware designs under consideration — such as Sun Microsystems’s Proximity Communications specification, which links neighboring CPU’s without the use of wires — would reduce latency and minimize system footprints. Novel candidate software designs, especially expressive parallel programming languages, would allow codes to be readily ported and tuned to new hardware platforms. The software designs are also intended to be more robust: program failures should not destroy intermediate results and should not impact other applications running concurrently on the same system. Of course, the new programming languages should also boost programmer and scientist productivity: make interprocess communication easier or make it invisible, and failing that, at least making interprocess communication more intuitive to use.

The Chapel and the Fortress

Historically, parallel programs solve problems by splintering data into subsets and sending each subset to a remote CPU for processing. But this is a somewhat fractured and artificial approach, beholden to the structure of the supercomputer and the (limited) capabilities of the programming system. It’s processor-centric.

In contrast, the languages being researched by HPCS subscribe to a holistic view of the problem. For example, MPI programs often include statements that divide the number of elements in the problem by the number of processors in the system to control how much work each processor performs. However, a program written in an HPCS language merely defines the total size of the data and the operation to perform — the rest is carried out in parallel automatically.

HPCS is currently considering the merits of at least three specialized parallel programming languages: Cray Computer’s Chapel, IBM’s X10, and Sun Microsystems’s Fortress. (See http://www.ahpcrc.org/conferences/PGAS/ for papers on each language.) All three languages aim to make the programmer more productive, and it’s likely that the best features of each language will be merged into the language that HPCS creates.

All three languages have very compact syntax for data parallelism, task parallelism, and atomic blocks. For data parallelism, each language provides a keyword to denote that a loop should be run in parallel (or in Sun’s case, because it defaults to parallel loops, a keyword makes the loop run serially). Task parallelism is also accomplished via a simple keyword: collect the statements that can run in parallel, put them in a code block, and specify a single keyword. Atomic blocks work in a similar fashion to task parallelism: put the code in a block and label it as atomic.

The three languages also provide several interesting abstractions to try to control data locality, which is extremely important to performance. If a processor is performing a computation with remote data, the latency to fetch or push data is likely to be many orders of magnitude greater than the time to perform the computation. Therefore, all three of the languages allow a programmer to organize the data (and the computation).

All three languages make communication between nodes easier. All do away with the double-sided communication required by MPI. Furthermore, memory is globally addressable, allowing remote memory to be directly referenced.

Unified Parallel C

But there’s still a while before HPCS releases its standard language. In the meantime, there are several variations of well-known programming languages that have been tweaked specifically for parallel computing. A short list includes Unified Parallel C (UPC), Co-Array Fortran (CAF, http://www.co-array.org/), the Java-based Titanium (http://titanium.cs.berkeley.edu/papers.html), Matlab*P (http://interactivesupercomputing.com/) and pMatlab (http://www.astro.princeton.edu/~jvkepner/)

The former three are similar because each maps a data structure (usually an array) across multiple processors, each makes data structures globally addressable (every processor can see the structures), and each has simple semantics to execute a statement across the machines that are part of the larger logical computer. Languages with such characteristics belong to a family of programming languages called parallel globally addressable languages, or PGAS.

Let’s look at UPC to get a flavor of how the entire PGAS group works. (The Matlab variants won’t be discussed in this feature.)

With some very small additions to the C language, UPC adds a great deal of power. In fact, the changes required to write a functional parallel program are very minimal. Even better, you can call your C libraries from a UPC program. The most significant change is the notion of how programs execute.

There are several different ways to organize parallel programs. UPC uses a model called Single Program, Multiple Data (SPMD), where a single function (code base) processes a great deal of different data. An example of an SPMD application is in physics simulations where there is a large volume of physics data (the multiple data). Because the laws of physics are constant, those same laws can be applied to all of the data in the same way (the function). Each CPU runs the same program and has its own subset of the data, yet can see all other shared data. When a UPC thread needs access to a part of remote memory, either for reading or writing, it simply references the data.

Let’s look at an example UPC application.

How UPC works

There are a few keywords that you need to know to create UPC programs.

*The first is the UPC constant THREADS, the number of threads the program is currently running on.

*The second is MYTHREAD, an integer in the range of 0 to THREADS – 1. Each number in that range is used to assign a unique number to each thread.

*The instruction upc_barrier stops all threads until all execute the instruction.

*The shared keyword is used to allocate variables that should be accessible to all of the threads. But how are shared variables allocated to each processor? The default method assigns the first element of the array to thread 0, the second element to thread 1, and so on. If the number of elements is greater then the number of threads, the process is continued in a round-robin fashion — that is to say, element THREADS+ 1 is assigned to thread 0. How many elements are placed on a node is controlled by the block size, a user-defined parameter. The default block size is one. If the block size were two, the first two elements in the array would be placed on thread 0, the next two elements on thread 1, and so on.

An example shared array of ten elements with block size three looks like:

shared [3] int array[10]

If there are three CPUs, this declaration places four elements on the first processor (due to the round-robin rule) and three elements on the second and third.

There is one last UPC construct, the for_all loop, which will be discussed in detail momentarily.

Listing One is a serial implementation of a reduction of two vectors to a single number via addition. The code reduces the two initial vectors into a third and then reduces that combined vector into the result via addition. (In practice, there is a more efficient solution, but it isn’t as illustrative.)

Listing One: Use serial code to reduce two vectors into a single number via addition

#define SIZE 100

int vector1[SIZE], vector2[SIZE], vectorSum[SIZE];

void main()
{
int i, sum = 0;

// Initialize arrays with data
for (i = 0; i < SIZE; i++)
{ vector1[i] = vector2[i] = i % 10; }

// Sum the two vectors into a third
for (i = 0; i < SIZE; i++)
{ vectorSum[i] = vector1[i] + vector2[i]; }

// Sum the third array into the final number
for (i = 0; i < SIZE; i++)
{ sum += vectorSum[i]; }

printf(“The sum is %i\n”, sum);
}

Transforming the serial code to parallel code involves very few changes. The new code is shown in Listing Two. Each thread runs every line of code in Listing Two.

Listing Two: Reduce two vectors to a number via addition in parallel using Unified Parallel C

#define SIZE 100

shared [1] int vector1[SIZE],
vector2[SIZE], vectorSum[SIZE];

int main()
{
int i, sum = 0;

for (i = 0; i < SIZE; i++)
{
vector1[i] = vector2[i] = i % 10;
}

upc_barrier;

for (i = MYTHREAD; i < SIZE; i+=THREADS)
{
vectorSum[i] = vector1[i] + vector2[i];
}

upc_barrier;

for (i = 0; i< SIZE; i++)
{
sum += vectorSum[i];
}

printf(“The sum is %i\n”, sum);
}

Let’s walk through the code:

*Adding the keyword shared distributes the arrays across the CPU’s using a block size of one.

*upc_barrier provides synchronization. Without it, one thread can start reducing the vectors into vectorSum before all of the other processors are done populating the vector arrays. This can cause incorrect answers.

*The second loop has been modified so that each CPU only references elements that are local. If the block size in the shared declaration was something other than one, the parameters in this loop would be different.

*As is, the third loop is executed by every thread, making it terribly inefficient, because all processors touch all of the data in vectorSum, incurring a large amount of communication. The overhead can be avoided by putting the statement if(MYTHREAD==0){} around the loop to ensure that only the first thread runs this code.

In addition to the previous enhancement, the UPC for_all construct can further reduce the amount of code that must change if the block size is changed. Listing Three demonstrates its use.

Listing Three: Replacement for the second loop in the second listing

#define SIZE 100

shared [1] int vector1[SIZE],
vector2[SIZE], vectorSum[SIZE];

int main()
{
int i, sum = 0;

for (i = 0; i < SIZE; i++)
{
vector1[i] = vector2[i] = i % 10;
}

upc_barrier;

for_all(i = 0; i < SIZE; i++; i)
{
vectorSum[i] = vector1[i] + vector2[i];
}

upc_barrier;

if (MYTHREAD == 0)
{
for (i = 0; i < SIZE; i++)
{
sum += vectorSum[i];
}
}

printf(“The sum is %i\n”, sum);
}

for_all adds a fourth parameter to the loop — the stride. upc_forall assigns each iteration to the appropriate thread by taking the fourth parameter, in this case i, and computing i%THREADS.

Looking for a Better Hammer

Given that UPC is based on C, you can easily and slowly modify a legacy MPI or serial application to run in parallel. UPC avoids total rewrites and lessens the expense of “porting” existing code. As shown in the example, it is fairly easy to add UPC commands to existing C code to speed up core loops. In addition, UPC code does not incur the latencies associated with calls to MPI libraries. Many UPC compilers are available and vendors offer software, documentation, and support on platforms from the PC to high-end Cray supercomputers.

UPC is still young and is but one of the languages HPCS is studying. However, it’s obvious that something better than MPI is needed, certainly as multi-threaded computers become as commonplace as laptops. UPC points in the right direction — to the future.

Benjamin Mayer is a Ph.D. student at the University of Minnesota where he studies data mining and parallel computing. He can be reached at class="emailaddress">bmayer@gmail.com.

Comments are closed.