Forrest Hoffman Archive

The Global Arrays Toolkit, Part Three
A powerful little package to eliminate the details of communication and data distribution on distributed memory systems -- like, say, Linux clusters.
Global Arrays Toolkit, Part Two
Tackle a more complex and realistic Global Arrays Toolkit program, one that performs matrix-matrix multiplication.
The Global Arrays Toolkit
This month’s column introduces the Global Arrays Toolkit (GA, http://www.emsl.pnl.gov/docs/global/), a suite of application programming interfaces (API’s) for handling distributed data structures.
Unified Parallel C 101
Designed for high-performance computing on large-scale parallel machines, including Beowulf-style clusters, Unified Parallel C provides a uniform programming model for both shared and distributed memory hardware.
A Conference for Extremophiles
Supercomputing extremists converged on Seattle, Washington, last fall to share their experiences, exhibit their research, flaunt their wares, and award their pioneers.
Logging with syslog-ng, Part Two
Trace your cluster nodes from a central log using syslog-ng. A hands-on guide continues here.
HPC Logging with syslog-ng, Part One
If you need to monitor and manage such a configuration, try syslog-ng (syslog, next generation), a drop-in replacement for syslogd. syslog-ng provides more sophisticated log management capabilities and enables log transfers over the Internet.
Designing a Cluster Solution
Explore some of the factors involved in designing a cluster from scratch. A case study serves as a pedagogical example.
Scyld Beowulf
Scyld Beowulf, a commercial cluster operating system, rounds out our coverage of the most popular Linux cluster software distributions.
Our coverage of Linux-based cluster distributions continues this month with OSCAR, the Open Source Cluster Application Resource software bundle available free from the Open Cluster Group.
The Rocks Cluster Distribution
We continue our coverage of Linux-based cluster distributions by delving into Rocks, a free and customizable distribution for commodity platforms funded by the National Science Foundation and distributed by the San Diego Supercomputing Center.
Clustermatic: A Complete Cluster Solution
The Clustermatic Linux distribution, produced by the Cluster Research Lab at Los Alamos National Laboratory, is a collection of software packages that provides an infrastructure for small- to large-scale cluster computing.
Using BProc
This month’s column focuses on building and using Beowulf Distributed Process Space (BProc) software used by the commercial Scyld Beowulf and the Clustermatic Linux distributions for high performance computing (HPC) clusters.
A Super Computing Show
Forrest Hoffman attended the recent Supercomputing 2004 conference and trade show. Here's his report.
One-Sided Communications with MPI-2
Traditional interprocess communication requires cooperation and synchronization between sender and receiver. MPI-2's new remote memory access features allow one process to update or interrogate the memory of another, hence the name one-sided communication. Here's a hands-on guide.
Using MPI-2
Last month's "Extreme Linux" introduced MPI-2, the latest Message Passing Interface (MPI) standard. MPI has become the preferred programming interface for data exchange -- called message passing -- for parallel, scientific programs. MPI has evolved since the MPI-1.0 standard was released in May 1994. The MPI-1.1 standard, produced in 1995, was a significant advance, and the MPI-2 standard clarifies and corrects the MPI-1.1 standard while preserving forward compatibility with MPI-1.1. A valid MPI-1.1 program is a valid MPI-2 program.
MPI-2: The Future of Message Passing
The Message Passing Interface (MPI) has become the application programming interface (API) of choice for data exchange among processes in parallel scientific programs. While Parallel Virtual Machine (PVM) is still a viable message passing system offering features not available in MPI, it's often not the first choice for developers seeking vendor-supported APIs based on open standards. Of course, standards evolve, and the MPI standard is no different.
High Performance I/O: Parallel netCDF
netCDF consists of application programming interfaces (APIs) and self-describing file formats containing metadata and data all in a single file.
64-bit Computing with SGI’s Altix
A little over a year ago, Silicon Graphics, Inc. (SGI, http://www.sgi.com) announced a new 64-bit supercomputing platform called the Altix 3000. In a break from its tradition of building large machines with MIPS processors running the IRIX operating system, the Altix uses Intel's Itanium 2 processor and runs -- you guessed it -- Linux. Unlike Beowulf-style Linux clusters, SGI's cache-coherent, shared-memory, multi-processor system is based on NUMAflex, SGI's third-generation, non-uniform memory access (NUMA) architecture, which has proven to be a highly-scalable, global shared memory architecture based on SGI's Origin 3000 systems.
Writing Hybrid MPI/OpenMP Code
The last few "Extreme Linux" columns have focused on multiprocessing using OpenMP. While often used in scientific models for shared memory parallelism on symmetric multi-processor (SMP) machines, OpenMP can also be used in conjunction with the Message Passing Interface (MPI) to provide a second level of parallelism for improved performance on Linux clusters having SMP compute nodes. Programs that mix OpenMP and MPI are often referred to as hybrid codes.
Using OpenMP, Part 3
This is the third and final column in a series on shared memory parallelization using OpenMP. Often used to improve performance of scientific models on symmetric multi-processor (SMP) machines or SMP nodes in a Linux cluster, OpenMP consists of a portable set of compiler directives, library calls, and environment variables. It's supported by a wide range of FORTRAN and C/C++ compilers for Linux and commercial supercomputers.
Multi-Processing with OpenMP, Part 2
This month, we continue our focus on shared-memory parallelism using OpenMP. As a quick review, remember that OpenMP consists of a set of compiler directives, a handful of library calls, and a set of environment variables that can be used to specify run-time parameters. Available for both FORTRAN and C/C++ languages, OpenMP can often be used to improve performance on symmetric multi-processor (SMP) machines or SMP nodes in a cluster by simply (and carefully) adding a few compiler directives to the code. Most commercial compilers for Linux provide support for OpenMP, as do compilers for commercial supercomputers.
Multi-Processing with OpenMP
In this column's previous discussions of parallel programming, the focus has been on distributed memory parallelism, since most Linux clusters are best suited to this programming model. Nevertheless, today's clusters often contain two or four (or more) processors per node. While one could simply start multiple MPI processes on such nodes to use these processors, taking best advantage of the hardware requires a different approach. Processors within a node typically share all the memory within that node, and they can communicate much more quickly with each other than with processors on other nodes.
Cluster Management with Condor, Part 3
The last two Extreme Linux columns provided an introduction to the Condor workload management system, gave detailed installation and configuration instructions for Beowulf clusters, and showed the details of managing and running MPI jobs (parallel programs that use the Message Passing Interface) with Condor. This month, let's continue looking at Condor, explore some of its advanced features, and check out its powerful queuing capabilities for lots of serial tasks.
Cluster Management with Condor, Part 2
Last month's column introduced Condor and presented a sample installation of the software package in a cluster environment. Condor is a system that creates a "high-throughput computing" environment by effectively utilizing computing resources from a pool of cluster nodes and disparate workstations distributed around a network. Like many batch queuing systems, Condor provides a queuing mechanism, scheduling policy, job priority scheme, and resource classification. Unlike most other batch systems, Condor doesn't require dedicated compute servers.
Cluster Management with Condor, Part I
A good job queuing and scheduling system is required whenever more than a couple of researchers share a Beowulf cluster. Coordinating with other users about when and where to run jobs on a shared cluster isn't impossible, but cluster administrators quickly realize the importance of having a robust batch system once users begin competing for resources.
It’s All About Speed
Every cluster builder wants to know how fast his or her computer is. After all, speed is the primary reason to build a Linux cluster -- aside from the gains in data capacity and resource redundancy. But how do you measure speed? CPU clock speed is one thing, but using those cycles to actually get work done is quite another. And in a cluster, the network connections between nodes can quickly become a crippling bottleneck.
Cluster Monitoring with Ganglia
Monitoring the status of a Beowulf-style cluster can be a daunting task for any system administrator, especially if the cluster consists of more than a dozen nodes. While Linux is extremely stable, hardware problems can cause nodes to crash or become inaccessible, and chasing down problem nodes in a 500-node cluster is painful. Luckily, some sort of statistical resource monitoring can often yield early warnings of impending hardware failures.
Programming with MPI Communicators and Groups, Part 2
Last month's issue of Linux Magazine was dedicated to cluster computing, allowing leaders in the field to present a wide range of topics about Beowulf-style clusters. Last month's issue also introduced model coupling with some example code. This month we return our attention to advanced Message Passing Interface (MPI) features by continuing the discussion of MPI groups and communicators begun in May.
Coupling Parallel Models:
Scientific model developers often need to combine or couple existing models. In this feature story, Linux Magazine's "Extreme Linux" columnist Forrest Hoffman shows how to couple parallel models without a toolkit.
MPI Communicators and Groups
In the previous two columns, we discussed the master/slave programming idiom and how to pass derived data types in messages. This month, we continue our exploration of advanced uses of the Message Passing Interface (MPI) with a look at MPI communicators and groups, two MPI features that provide communications contexts and modularity for parallel codes.
Using Derived Data Types with MPI
Most programs written for distributed memory, parallel computers, including Beowulf clusters, utilize the Message Passing Interface (MPI) or Parallel Virtual Machine (PVM) programming interfaces to exchange data or messages among processes. In the past, this column has presented many of the fundamentals of message passing and has shown a number of programming examples using both MPI and PVM. Last month's column focused on the master/slave model of parallelism using MPI, and introduced the MPI_Probe() routine. This month, let's discuss another advanced feature of MPI: how to use derived data types.
Message Passing for Master/Slave Programs
While building a Beowulf cluster presents some interesting technical challenges, the only reason to build such a cluster is to gain the "horsepower" required to solve some large computational problem or perform some repetitive processing task in a practical amount of time. As such, the requirements of your software and model should always dictate your choice of hardware, node interconnects, and hardware and software configuration. Indeed, the correct answer to nearly every cluster design question is, "It depends on your application."
Kickstarting Compute Nodes, Part 2
Installing Linux on more than a few nodes in a cluster is time consuming, boring, and potentially fraught with error. Since it's typically desirable to have a nearly identical operating system on each node, selfsame, full installs and configurations must be performed repeatedly. While disk cloning is a good solution for clusters with the exact same hardware, cloning hard drives can be problematic on nodes with a variety of different disks, network interfaces, and processors.
Kickstarting Cluster Nodes, Part 1
Designing, building, installing, and configuring a Beowulf-style cluster presents a number of challenges, even for very capable system designers and system administrators. Once decisions about topology, layout, hardware, and interconnect technologies have been made (based primarily on the needs of the software and models you intend to run), a considerable amount of work must still be done to forge a working cluster.
Running Commands on Many Nodes
Since this column became a regular part of Linux Magazine in January (2002), it's covered a wide variety of Linux high-performance computing topics. In the past year, we've configured network topologies, deployed job schedulers, explored high-bandwidth, low-latency networking hardware, and used open source tools such as as MPI and PVM to develop parallel software applications.
Job Scheduling with Maui
Scheduling jobs and allocating resources on a Beowulf cluster quickly becomes a challenge once more than a few users start running codes on the system. Manual coordination of runs is tedious, particularly when different codes have very different resource requirements. A job queuing and scheduling facility solves these problems by automatically executing jobs as resources become available ensuring optimal utilization of the cluster. Moreover, a good job scheduler can be configured to enforce operational policies about when and where jobs belonging to different users may be run.
Job Scheduling and Batch Systems
One of the measures of success of a Beowulf cluster is the number of people waiting in line to run their code on the system. Build your own low-cost supercomputer, and your cycle-starved colleagues will quickly become your new best friends. But, when they all get accounts and start running jobs, they'll soon find themselves battling each other for the limited resources of the machine.
High Performance Interconnects
As Beowulf-style clusters have proliferated, many computational scientists have discovered that, although clusters provide adequate CPU performance for most applications, more finely-grained models are often limited by the performance of the network that interconnects nodes. While the network is typically Fast Ethernet (100 Mb/s) or Gigabit Ethernet (1.0 Gb/s), it's still not fast enough to run applications originally developed for shared memory parallel systems or commercial memory systems featuring high performance, custom-designed switched interconnects.
Scalable I/O on Clusters, Part 2
Linux clusters are an inexpensive yet very effective parallel computing solution. Indeed, the low cost and ease of building Linux clusters has encouraged researchers all around the world to build larger and faster collections of machines.
Scalable I/O on Clusters, Part I
Linux clusters have become so successful that they've proliferated internationally through research labs, universities, and large industries that require an inexpensive source of high performance computing cycles. Developers and users have pushed the technology by scaling their applications to more and more processors so that larger problems can be solved more quickly. This has resulted in clusters where some applications can actually become I/O bound -- the input/output of data to/from a large number of processors limits the performance of the application.
In the past few columns, we've discussed parallel computing, a methodology that divides and distributes the work of solving a significant computing task. Parallel computing can solve large problems more quickly because it focuses the processing power of many computers working in tandem (called a cluster) on a single task. A master node in the cluster breaks the problem into discrete parts, and passes tasks and computational data to slave nodes. The master node receives results from the slave nodes and generates a final answer. Coordination of the entire process is controlled with message-passing APIs.
Advanced Message Passing
In the past two columns, we've looked at the theory and basics of parallel computing. This month we get to the real world with a program that performs matrix multiplication.
Message Passing with MPI and PVM
Parallel computing can be utilized to perform numerous computations quickly and solve problems by making many processors work simultaneously on smaller subtasks or subsets of the data. Parallel computing is often used on problems that cannot be solved by more conventional means (such as using a serial algorithm running on a PC or workstation).
An Introduction to Parallel Programming
Having built a Beowulf cluster using instructions found on the Internet or in popular magazines, some zealous individuals are disgusted to discover that their favorite word processor and spreadsheet packages will not run on their powerful new creation.
Configuring a Beowulf Cluster
Without serious Linux savvy, in- stallation and administration of a Beowulf cluster can be cumbersome and time con- suming, particularly when the cluster consists of more than a handful of nodes. Lack of a single system image -- one operating system controlling all nodes simultaneously -- makes day-to-day administration challenging. The result is that, without software tools, you must maintain each node individually.
Concepts in Beowulfery
Computational scientists were early adopters of Linux because of its reliability and efficiency. These folks needed a powerful yet stable computing platform to run their complex scientific simulations. Linux provided a solid development platform and had the reliability and stability they required. Additionally, the open source aspect of Linux was appealing because the open source development process paralleled the scientific method; the code is widely published and reviewed by others prior to acceptance.