The Future of High Performance Computing

Read the HPC tea leaves and see what the future of high-performance computing holds in store.

Over the past few years, this column has covered a wide variety
of topics in high-performance computing (HPC), primarily with an
eye toward Beowulf-style, Linux clusters.
Since the first such commodity clusters were built in the mid- to
late-1990s, they have increasingly served as the model for
commercial supercomputers. And as supercomputer vendors have died
off, the often-struggling vendors that remain are using more
commodity components and processors. While the cluster architecture
was in use long before the first Beowulfs, clusters now make up
almost 73 percent of the fastest supercomputers in the famed
“Top 500” list ( "story_link">

This adoption of commoditized components has been reflected in
software as well. Supercomputer vendors are increasingly relying on
third-party system software and compilers to support their
supercomputing systems. For example, Silicon Graphics sells the
Intel compilers for their Intel Itanium 2-
based Altix systems, and Cray sells the
Portland Group compilers for their AMD "i">Opteron- based XT3 and
XD1 clusters. Moreover, Linux has played a
significant role in HPC commoditization, since it’s
frequently used as the starting point for operating systems on
large commercial supercomputer clusters. Although most vendors must
make significant changes to Linux for their own hardware, some of
the more general purpose improvements and additions to the kernel
are contributed back for potential inclusion in future releases of

While this commoditization transformation certainly saves money
for the vendor (as well as resulting in improvements to Linux), it
has some negative consequences as well. Even though most vendors
serve as an interface to other third-party vendors, supercomputing
centers and their users must often work with more than one company
to seek improvements or bug fixes for compilers, job schedulers,
and some system software components. More work is required of the
users and supercomputing center consultants when problems

As the industry pushes toward delivery of a petaflop (1,000
trillion floating point operations per second) system, what other
software and hardware changes can be expected? Systems with enough
processor elements to achieve a petaflop must be designed
differently, and will require new software tools and application
development paradigms. A few of the current software and hardware
advances and their expected implications are discussed below.

Software Advances

It appears that little is changing with regard to the message
passing paradigm. MPI-2.0 (version
2.0 of the Message Passing
) with MPI-IO
(MPI-Input/Output) is being deployed on all current systems.
MPI-2.0 provides single-sided communications, while MPI-IO provides
an interface layer to I/O subsystems, including parallel I/O
subsystems. Both application programming interfaces (APIs) have
been discussed with accompanying examples here previously. APIs for
“SHared MEMory” (SHMEM) and Parallel
Virtual Machine
(PVM) are still available on most HPC
systems as well.

The most significant advances in application development appear
to be new productivity-oriented languages and APIs that remove at
least some of the burden of message passing from the developer.
Unified Parallel C (UPC, "" class=
) and "i">Co-Array Fortran (COF) extend the C
and Fortran languages,
respectively, and offer inherent parallelism in a shared memory
style of programming. These languages can provide performance equal
to or better than direct message passing, while improving code
readability and maintainability. However, a limited number of
applications can benefit from either language, and care must be
taken by the developer to exploit data locality whenever

Parallel libraries have also evolved and extended their
functionality. The Global Arrays Toolkit
(GAT, "story_link">, recently
discussed here, can be employed to work with large, distributed
arrays of data. Moreover, GAT can be used in conjunction with
traditional message passing APIs to offer both shared memory and
message passing paradigms in the same program. It also offers
interfaces to numerical algorithms like "i">ScaLAPACK and PeIGS. The toolkit
handles the data distribution and communication required, so the
programmer need not worry about these details.

The ScaLAPACK effort has grown to include not only the dense and
band matrix software (called ScaLAPACK, "" class=
), but also large,
sparse, eigenvalue software (PARPACK and
ARPACK), sparse direct systems software
and preconditioners for large, sparse, iterative solvers
(ParPre). This collection of libraries has
been funded by Defense Advanced Research Projects Agency (DARPA),
the U.S. Department of Energy (DOE), the National Science
Foundation (NSF), and the Center for Research on Parallel
Computation (CRPC).

ScaLAPACK includes a subset of the Linear
Algebra PACKage
(LAPACK) routines redesigned for distributed
memory MIMD parallel computers. It is presently written in
“single program, multiple data” (SPMD) style, using
explicit message passing for interprocessor communication.
ScaLAPACK assumes matrices are laid out in a two-dimensional block
cyclic decomposition. ScaLAPACK is portable to any system that
supports MPI or PVM. The library is built on top of distributed
memory versions of Parallel Basic Linear Algebra
(PBLAS) and a set of Basic
Linear Algebra Communication Subprograms
(BLACS), in which
all communications occur.

A more general and increasingly useful package is "i">PETSc (pronounced “pet-see,” "" class=
), the
Portable, Extensible Toolkit for Scientific
It is a suite of data structures and routines
for the scalable (parallel) solution of scientific applications
modeled by partial differential equations. Employing MPI for all
message passing communication, PETSc is designed to be easy to use
for beginners, but powerful enough for advanced users to have
control over the solution process. It includes a large suite of
parallel linear and nonlinear equation solvers that may be used in
C, C++, Fortran, and "i">Python application codes, and often eliminates the need
for developers to write their own MPI code.

PETSc is intended for use in large-scale application projects,
and a wide variety of computational science projects are built
around PETSc. Developed and distributed by Argonne National
Laboratory, it provides many of the mechanisms needed in parallel
codes, including simple parallel matrix and vector assembly
routines that allow overlap of communication and computation to
hide network latency. In addition, PETSc includes support for
parallel distributed arrays, which are often useful for finite
difference methods.

PETSc includes or provides interfaces to a wide range of linear
and nonlinear equation solvers that employ a variety of Newton
techniques and Krylov subspace methods. PETSc provides several
parallel sparse matrix formats, including compressed row, block
compressed row, and block diagonal storage.

Figure One provides an overview of the
main numerical components of the PETSc library. Development of the
PETSc library is ongoing, and it is likely to be adopted more
widely since it handles the primary scientific computation without
requiring the programmer to write message passing code or implement
his own solver.

Hardware Advances

Processor technology is quickly approaching its maximum clock
speed, barring significant technological breakthroughs. The main
problems are leakage current, which causes silicon components to
generate heat, and power density. At some temperature, the
components would become molten and would no longer function, and
could also cause other parts of a system to burst into flames. We
can expect a little more performance as the dye size is decreased,
but large increases in clock speed are not expected with current
technology. As a result, all manufacturers are now moving to
multi-core technology.

By putting multiple processor cores on a single processor unit,
system performance can continue to increase. For servers that run
many applications or serve many users at once, multi-core
processors can deliver improved overall performance. However,
applications that have no concurrent components don’t run any
faster. Parallel or threaded applications are required to exploit
the performance advantages of multi-core technology. Moreover, in a
large cluster of such processors, parallel application codes may
require some adaptation to take best advantage of it.

When deployed in cluster systems, multi-core processors add
another level of parallelism that can be exploited. It is now
possible to cluster nodes containing multiple processors consisting
of multiple cores. For example, a single cluster node can now have
four quad-core processors in a symmetric multi-processor (SMP)
configuration, yielding sixteen simultaneous execution units. While
it may be most convenient to simply treat the cores as if they were
independent, distributed memory processors by initiating MPI
processes for each core, it may be advantageous to exploit the
shared memory or shared cache aspect of the processor elements, as
is often done using OpenMP on single-core SMP nodes.

The disadvantage of multi-core processors is the fact that each
core on a processor must share bandwidth to memory. For many
applications, this may not present a problem, but some HPC
applications require very high memory bandwidth and may not be able
to realize all the performance increase expected from multi-core
systems. In future HPC application development, it may be
worthwhile to consider memory bandwidth contention among processor

IBM, Sony, Toshiba, and others have taken the idea of multi-core
processors another step further in the development of
“cell” processors. Essentially, these cell chips
consist of a traditional processor core connected to eight,
special-purpose digital signal processor (DSP) cores, which IBM
calls synergistic processing elements
(SPEs). These SPEs provide SIMD (single instruction multiple data)
capabilities and incorporate large local storage areas. By
employing less capable processor elements, much of the instruction
overhead is reduced, resulting in higher overall performance. IBM
is designing such chips to be operating system-neutral and to run
many operating systems at once.

Field Programmable Gate Arrays (FPGAs) are also being
reconsidered as a method of offering improved performance. The
specialized logic circuits tend to be in embedded devices for
signal processing, aerospace and defense systems, medical imaging,
computer vision, and more. Some vendors are working to integrate
them into clusters and apply them to HPC applications. In
particular, the Cray XD1 can support FPGAs within its node
architecture. These chips have traditionally been applied for
integer arithmetic, but new programming tools are extending their
capabilities to floating point, and even double-precision floating
point, operations.

Petascale Systems

The U.S. Department of Energy is working to build a petaflop
system at Oak Ridge National Laboratory by 2008, and Berkeley
National Laboratory expects to scale up to a petaflop soon
afterward. Both systems are expected to be follow-on systems to the
AMD Opteron-based Cray XT3 cluster. In addition, the National
Science Foundation is currently considering the need for a
petascale system of its own. Such a system would contain on the
order of 100,000 processor elements, and the annual power costs
will exceed the purchase cost. A system that large poses many
challenges for HPC applications.

In particular, scaling to 100,000 to 200,000 processor elements
will be difficult for many HPC applications. Those employing
spatial domain decomposition will need to work with very large
domains or work at a very high spatial resolution. However, this is
an opportunity in some areas of computational science where spatial
domains are restricted by machine size and memory limitations. In
any case, many applications will need refactoring to achieve
maximum performance on a petascale system.

These first petascale systems will employ multi-core AMD Opteron
processors. In a system with so many multi-core processors, a
limited amount of memory may be available for each core. To scale,
parallel applications must make efficient use of memory. That means
that each process will not be able to keep track of the entire
domain decomposition, for instance. The per core memory footprint
must not increase as cores are added.

Input/output is another potentially large problem. It
won’t be possible for a single process to do I/O for 100,000
processor elements. Parallel I/O must be used to achieve
scalability. In general, some smaller number of processors must
work in concert to perform I/O operations for the larger number of
compute processor elements. Infrastructure for parallel I/O is
still evolving and improving too. Products like Lustre are now
being deployed to offer system-level parallel I/O across multiple
RAID arrays for large supercomputers. I/O may be one of the most
significant challenges for developers on these new petascale

To Infinity and Beyond!

It’s an exciting time to be on the bleeding-edge of high
performance computing. Failing some technological breakthrough,
Moore’s Law (as currently defined) is reaching the limits of
physics, so alternative designs are providing both opportunities
and challenges. Hopefully, some of the discussion here will prove
thought-provoking. It’s time for developers to take a fresh
look at applications codes will be deployed on really large
parallel systems.

Forrest Hoffman is a computer modeling and
simulation researcher at Oak Ridge National Laboratory. He can be
reached at ""

Fatal error: Call to undefined function aa_author_bios() in /opt/apache/dms/b2b/ on line 62