A new, federally-funded prograim called the High Productivity Computing Systems Initiative aims to to make high-performance computing more afforadable and more accessible to more scientists. Here’s a look at the goals of the project, including an example of a better way to program in parallel today.
As you read this, there’s a revolution in parallel computing going on. Every major CPU manufacturer is releasing multi-core or multi-threaded processors that can execute multiple instruction streams at a given time; advanced hardware of all sorts is becoming a commodity; and clusters of thousands of off-the-shelf processors are the norm. But more important, the software used to harness all that raw hardware power is changing for the better. Recognizing that solutions such as MPI and OpenMP are difficult and time-consuming to use, projects such as the High Productivity Computing Systems Initiative (HPCS, http://www.highproductivity.org/) are determined to create new technologies and techniques that make supercomputing more affordable and approachable. While it’s likely you’ll never see Supercomputing for Dummies at the bookstore or a Cray at Frye’s, the next generation of rocket scientists may not need a rocket scientist to design, program, and operate the next generation supercomputer.
It’s no secret that parallel programming is difficult and time-consuming. Yet, however arduous the task, it’s also very necessary: while all computers solve problems, high-performance computers — mega-clusters, supercomputers, and clusters — running complex codes solve problems much faster.
For example, the United States Army’s workhorse, the HMMVW (or “Hummer”), needed more armor to defend against novel insurgent tactics, namely improvised explosive devices (IED’s). Instead of the months and even years normally required for retrofitting deployed equipment, models and supercomputer simulations designed by the Army Research Lab (http://www.arl.army.mil/main/main/default.htm) yielded bolt-on “Armor Survivability Kits” in a matter of weeks. In this case, as in many others, minimizing the so-called time-to-solution was critical.
In some cases, the time-to-solution — the time to develop a solution plus the computational time necessary to yield results — can be abbreviated by simply applying more hardware. If a working application exists, but doesn’t run fast enough, adding more memory, processors, bandwidth, and more, singly or in combination, can readily lead palpable improvements.
But in other cases, when an application doesn’t yet exist, a time-to-solution can be hard to predict. Moreover, it’s particularly onerous to assess the difficulty of developing a high-performance computing (HPC) application. Clearly, though, simpler is better in software development. Given better tools, powerful abstractions, and more time to focus on the problem at-hand, solutions can be created more quickly and efficiently. Indeed, and is its named implies, facilitating the application of high-performance computing to more of the sciences is the goal of the High Productivity Computing Systems project.
HPCS is a small project by government standards — only $100 million — but the program is proving to be a huge catalyst for change in parallel computing. Ultimately, HPCS hopes to create standards for high-performance computing that ease the transfer of knowledge, code, and practices between all United States government and national security supercomputers. To that end, HPCS is already closely examing and positively influencing how high-performance computers are built, deployed, maintained, and programmed.
For instance, novel hardware designs under consideration — such as Sun Microsystems’s Proximity Communications specification, which links neighboring CPU’s without the use of wires — would reduce latency and minimize system footprints. Novel candidate software designs, especially expressive parallel programming languages, would allow codes to be readily ported and tuned to new hardware platforms. The software designs are also intended to be more robust: program failures should not destroy intermediate results and should not impact other applications running concurrently on the same system. Of course, the new programming languages should also boost programmer and scientist productivity: make interprocess communication easier or make it invisible, and failing that, at least making interprocess communication more intuitive to use.
The Chapel and the Fortress
Historically, parallel programs solve problems by splintering data into subsets and sending each subset to a remote CPU for processing. But this is a somewhat fractured and artificial approach, beholden to the structure of the supercomputer and the (limited) capabilities of the programming system. It’s processor-centric.
In contrast, the languages being researched by HPCS subscribe to a holistic view of the problem. For example, MPI programs often include statements that divide the number of elements in the problem by the number of processors in the system to control how much work each processor performs. However, a program written in an HPCS language merely defines the total size of the data and the operation to perform — the rest is carried out in parallel automatically.
HPCS is currently considering the merits of at least three specialized parallel programming languages: Cray Computer’s Chapel, IBM’s X10, and Sun Microsystems’s Fortress. (See http://www.ahpcrc.org/conferences/PGAS/ for papers on each language.) All three languages aim to make the programmer more productive, and it’s likely that the best features of each language will be merged into the language that HPCS creates.
All three languages have very compact syntax for data parallelism, task parallelism, and atomic blocks. For data parallelism, each language provides a keyword to denote that a loop should be run in parallel (or in Sun’s case, because it defaults to parallel loops, a keyword makes the loop run serially). Task parallelism is also accomplished via a simple keyword: collect the statements that can run in parallel, put them in a code block, and specify a single keyword. Atomic blocks work in a similar fashion to task parallelism: put the code in a block and label it as atomic.
The three languages also provide several interesting abstractions to try to control data locality, which is extremely important to performance. If a processor is performing a computation with remote data, the latency to fetch or push data is likely to be many orders of magnitude greater than the time to perform the computation. Therefore, all three of the languages allow a programmer to organize the data (and the computation).
All three languages make communication between nodes easier. All do away with the double-sided communication required by MPI. Furthermore, memory is globally addressable, allowing remote memory to be directly referenced.
Unified Parallel C
But there’s still a while before HPCS releases its standard language. In the meantime, there are several variations of well-known programming languages that have been tweaked specifically for parallel computing. A short list includes Unified Parallel C (UPC), Co-Array Fortran (CAF, http://www.co-array.org/), the Java-based Titanium (http://titanium.cs.berkeley.edu/papers.html), Matlab*P (http://interactivesupercomputing.com/) and pMatlab (http://www.astro.princeton.edu/~jvkepner/)
The former three are similar because each maps a data structure (usually an array) across multiple processors, each makes data structures globally addressable (every processor can see the structures), and each has simple semantics to execute a statement across the machines that are part of the larger logical computer. Languages with such characteristics belong to a family of programming languages called parallel globally addressable languages, or PGAS.
Let’s look at UPC to get a flavor of how the entire PGAS group works. (The Matlab variants won’t be discussed in this feature.)
With some very small additions to the C language, UPC adds a great deal of power. In fact, the changes required to write a functional parallel program are very minimal. Even better, you can call your C libraries from a UPC program. The most significant change is the notion of how programs execute.
There are several different ways to organize parallel programs. UPC uses a model called Single Program, Multiple Data (SPMD), where a single function (code base) processes a great deal of different data. An example of an SPMD application is in physics simulations where there is a large volume of physics data (the multiple data). Because the laws of physics are constant, those same laws can be applied to all of the data in the same way (the function). Each CPU runs the same program and has its own subset of the data, yet can see all other shared data. When a UPC thread needs access to a part of remote memory, either for reading or writing, it simply references the data.
Let’s look at an example UPC application.
How UPC works
There are a few keywords that you need to know to create UPC programs.
*The first is the UPC constant THREADS, the number of threads the program is currently running on.
*The second is MYTHREAD, an integer in the range of 0 to THREADS – 1. Each number in that range is used to assign a unique number to each thread.
*The instruction upc_barrier stops all threads until all execute the instruction.
*The shared keyword is used to allocate variables that should be accessible to all of the threads. But how are shared variables allocated to each processor? The default method assigns the first element of the array to thread 0, the second element to thread 1, and so on. If the number of elements is greater then the number of threads, the process is continued in a round-robin fashion — that is to say, element THREADS+ 1 is assigned to thread 0. How many elements are placed on a node is controlled by the block size, a user-defined parameter. The default block size is one. If the block size were two, the first two elements in the array would be placed on thread 0, the next two elements on thread 1, and so on.
An example shared array of ten elements with block size three looks like:
shared  int array
If there are three CPUs, this declaration places four elements on the first processor (due to the round-robin rule) and three elements on the second and third.
There is one last UPC construct, the for_all loop, which will be discussed in detail momentarily.
Listing One is a serial implementation of a reduction of two vectors to a single number via addition. The code reduces the two initial vectors into a third and then reduces that combined vector into the result via addition. (In practice, there is a more efficient solution, but it isn’t as illustrative.)