A confession is in order. The last two installments of “Cluster Rant” have been a sales pitch of sorts. If you believe some of the things I talked about — large clusters will break, applications must both tolerate failure and be easy to write — then you may agree that dynamic programming algorithms are one workable solution.
The next question is, how do implement these ideas?
The answer is the part many may find distasteful. If you are one of the brave few who take pause to think about how we are going achieve pervasive cluster computing, then take a bite of this column. The rest of you weenies should at least nibble at the edges.
SUBHEAD Think About It
Figure One summarizes the current state of parallel programming. It describes what programmers tend to think about at various stages of writing, compiling, and running programs. Certainly, other things can be added, but you get the idea.
Additionally, compilers for sequential applications do a fair amount of optimization, scheduling, and communication for a single processor. In a parallel environment, these issues are often too difficult for the compiler to handle, and the work is instead shifted to the developer. Indeed, parallel programming is hard because the programmer must manage many more interconnected parameters.
If parallel, multi-core clusters are going to become mainstream, many of these responsibilities must somehow be pushed back into the compiler. In addition, as I outlined previously, we need applications to tolerate failure, because the statistics of hardware failure will rule the day on large clusters.
The goal, or my wish as it where, is a programming method that lets me create programs that are fault tolerant, dynamically scalable (from one to thousands of processors), and co-operative with no central point of control. Moreover and most important, the method keeps me focused on the problem and not the architecture.
Moving in the Right Direction
The “making parallel easier” challenge is by no means new and has been taken up by many people.
At one end of the spectrum are the “message passing library” solutions. These libraries include things like MPI and PVM. The main advantage of message passing is that it’s relatively easy to implement and can be added as an extension to existing standard languages. The extension feature is very important. For historical reasons, much of the HPC code base is written in Fortran. When you consider that many of the early and still useful HPC codes may contain over a million lines of functioning code, extending these codes to work in parallel by adding library calls instead of re-writing the entire program makes a good economic argument.
A similar approach using threaded programming is accomplished by the OpenMP standard. OpenMP defines a set of directives that tell the compiler where to automatically add parallel OpenMP libraries at compile time. This method can be effective in preserving existing codes, as the directives exist as comments and do not change sequential program behavior.
Extensions to the Fortran standard, F90, F95,
and High-Performance Fortran
) have included features to aid in parallel optimization by compilers. However, these improvements still rely on the programmer having some knowledge of how well various parts of a program map to a particular parallel environment.
Originally developed for shared memory computers by David Gelernter and Nicolas Carriero at Yale University, the Linda
) have seen some success in the HPC world. (Perhaps the most notable is the use by the Gassiun software package.) Linda can extend almost any programming language by introducing a shared tuple space.
Using just four additional commands, the programmer dumps work into the tuplespace where other parts of the program, running on the same or another processor, can retrieve, calculate and return the object to tuple space where it is then retrieved. This approach has many advantages: it’s dynamic, scalable, and very clean. The biggest drawback for clusters is the use of a central, shared tuple space.
Unified Parallel C
) is an extension of the C programming language covered previously in Linux Magazine (March 2006, available online at http://www.linux-mag.com/2006-03/extreme_01.html
). Parallelism is expressed using an explicitly parallel execution model, a shared address space, synchronization primitives, a memory consistency model, and memory management primitives. UPC is relatively new and not used widely. It can run on clusters, but the words explicitly parallel
aren’t what I want as part of my idealized programming experience.
At the other end of the language spectrum is a start from scratch
approach called Z-level Programming Language
). ZPL is an array programming language designed to replace common programming languages used in engineering and scientific applications. The thing I like about ZPL is that it relies on implicit parallelism
(parallelism that is implied by your program and not explicitly
expressed by you). The design goal for ZPL was to obtain machine-independent high performance. As a result, ZPL programs run fast on both sequential and parallel computers. The clean sheet of paper approach offered by ZPL is interesting, but often distasteful to many programmers because of the need to start from scratch. The ZPL team gets extra points, however, because the have created a comic book on how to use ZPL (http://www.cs.washington.edu/research/zpl/comicbook/comicbook.html
). Figure Two
shows page one.
This admittedly brief survey of parallel programing languages is not intended to be exhaustive, but rather shows some examples of how people have approached the problem of making parallel programming easier.
Now, before we jump off the deep end, there is one more approach that deserves mentioning.
Push the Button, Max
Many have suggested, some have tried, and few have succeeded with automatically converting existing codes to parallel codes. On paper, the approach sounds like a good idea, but reality is quite another story.
There are two very big problems (among a raft of smaller problems). The first problem is pointers. Pointers are great aids for programmers, but when analyzing a code for parallelism, they create difficulties. Automatically breaking a program into many parts means that careful attention must be given to shared variables. Tracing data dependencies is hard enough in static structures, but with pointers, all bets are off.
The other large problem is that of the algorithm. Parallel execution is often accomplished by reworking the algorithm. Automatic tools can understand and restructure code blocks, but they cannot understand and restructure algorithms. There is more of an interesting story here, but that will have to wait for another column.
I Do Declare
This is the part of the column where you’ll need to call up those few remaining neurons that have stubbornly remained outside the box of conventional wisdom.
To achieve the goal of moving the programming experience closer to problem and further away from the computer, you will need to lift the programmer to a higher level of abstraction. That is, the programmer must not be allowed to think about how the program runs on the computer, just like Fortran lifted the original HPC programmer above the minutiae of assembly language programming and closer to the problem to solve. (Recall that Fortran was originally an acronym for FORmula TRANslation language).
In general, approaches to programming can be broken into two main groups, imperative and declarative.
Imperative programming is what most people have done in the past. Fortran, C, BASIC, Java, Perl, and Python all fall into the imperative camp, where the computation retains knowledge of the system state. In an imperative program, you tell the computer “how to solve” your problem. For instance, you assign data to a variable and continue until you reach a limit. The programmer explicitly tells the computer how to solve the problem. Program flow is assumed to be from one line to the next.
Declarative programming (http://en.wikipedia.org/wiki/Declarative_programming/
), on the other hand, allows the program to say what is to be done, but does not say how. The declarative compiler and interpreter use implied knowledge from the program to figure out how to solve the problem. The program assumes nothing about the system state when writing the program. Instead of moving from one line to the next, declarative programs often resolve or satisfy the current line of the program, by using other lines of the program. This distinction allows the underlying language to determine “how” to resolve the current line.
In a way, declarative programming insulates the programmer from the internals of the machine (the “how” to solve it part). For all the declarative programmer knows, the innards of the computer could be little monkeys and typewriters. The insulation has tremendous advantages. It lets the compiler figure out the best way to optimize for a given computing environment.
The ZPL language is great example of this method, because the programmer doesn’t express “how” to solve a matrix operation, but describes what he or she wants to do, allowing the compiler to optimize for a particular environment. If they had explicitly told the computer how to do the matrix operation, the compiler is constrained to work within the context of implied (implicit) program flow.
Declarative approaches sound nice, and in fact, are very elegant. Some examples you may have encountered before are SQL, Lisp, Scheme, Prolog, Haskell, and others. A declarative application can often be written in less lines of code than its imperative counterpart, as it doesn’t expend resources to control the machine. For example, from the programmers point of view, SQL doesn’t dictate how the computer resolves queries, which are often done in parallel. Less lines of code results in fewer bugs and better productivity.
The problem with declarative approaches is often efficiency. The abstraction layer has a cost. To be fair, some declarative approaches often have minimal imperative features to aid in programming.
The Explicit Implications
Now that I have breezed through enough computer science topics to make anyone’s head spin, let me try and tie this up in neat little package.
My experience has shown that parallel computing includes a large cost for application development. In addition, as clusters get bigger, the traditional methodology to create and run applications may not work well in “fault expected” environments. Existing applications are probably not going to be re-written any time soon, so my remaining comments have to do with those killer applications that have yet to be written.
To solve the problems I have outlined at a reasonable cost, I propose that a declarative approach to parallel computing is one of the promising avenues in which to proceed. As distasteful as it sounds, I see no other way out of our current conundrum of hard-to-program cantankerous beasts we call clusters. If programmers can create and test ideas quickly on any computer (one node or ten thousand nodes) and not worry about how the program gets executed, then clusters will become more useful to everyone.
There’s no doubt that declarative approaches will hurt efficiency and lead to many dismissing this approach to parallel computing. But, let’s take a look at the economics for a minute.
The cost of computing power is going down. Dual cores now cost as much as single cores did last year. The effective computing power has doubled in the last year. This trend will continue. Now consider what costs will increase. The cost of people will always rise, which means the cost to create software will rise commensurately (at a minimum) as well. Where should the effort to enhance efficiency be focused?
Two columns ago I started with my “Cluster Rant” wish list. My solution to this list looks like a sandwich with three distinct parts, many of which do not exist just yet. Start off with a layer of inexpensive and massively parallel hardware, slap on a layer of dynamic execution methods, and top it off with a declarative programming interface. Goes great with side of Linux.