In the past few weeks, I've received two new titles on Linux performance Tuning; Performance Tuning for Linux Servers from IBM Press and Optimizing Linux Performance: A Hands-On Guide to Linux Performance Tools from Prentice Hall Professional Technical Reference.
I'm quite happy to see books of a more advanced nature coming out about Linux. Beginner's books are nice, but the real need for documentation is on the more advanced topics, where man pages and HOWTOs aren't quite sufficient to get it done. Performance tuning, in particular, is of heavy interest to admins who are deploying or thinking about deploying Linux, and they need to get the most bang for their buck.
Our coverage of Linux-based cluster distributions continues this month with OSCAR, the Open Source Cluster Application Resource software bundle available free from the Open Cluster Group.
We continue our coverage of Linux-based cluster distributions by delving into Rocks, a free and customizable distribution for commodity platforms funded by the National Science Foundation and distributed by the San Diego Supercomputing Center.
The Clustermatic Linux distribution, produced by the Cluster Research Lab at Los Alamos National Laboratory, is a collection of software packages that provides an infrastructure for small- to large-scale cluster computing.
This month’s column focuses on building and using Beowulf Distributed Process Space (BProc) software used by the commercial Scyld Beowulf and the Clustermatic Linux distributions for high performance computing (HPC) clusters.
PolyServe Matrix Server overwhelmed competitive cluster file system (CFS) products in a performance benchmark conducted by a non-profit computing consortium for Italian universities, PolyServe, Inc. announced today. It is the first known independent performance comparison of major commercial and open-source Linux cluster file systems.
Clusters of every size experience failures: processors can die, hard disks often crash, and interface cards have been known to produce spurious errors. Of course, software can fail, too, for any number of reasons. Prevention is a necessity, but the next best option is to react and respond to faults as they occur. If you're a cluster developer, Fault Tolerant MPI (FT-MPI) can help keep your compute jobs humming.
Traditional interprocess communication requires cooperation and synchronization between sender and receiver. MPI-2's new remote memory access features allow one process to update or interrogate the memory of another, hence the name one-sided communication. Here's a hands-on guide.
Imagine all of the processing power within your enterprise - from every large and small server and cluster in every datacenter, to every networked personal computer - all available to work on solving the day's business problems. That's the notion of an enterprise grid, and if the Enterprise Grid Alliance (EGA) fulfills its mission, a company-wide computing farm will be a reality.
The Message Passing Interface (MPI) has become the application programming interface (API) of choice for data exchange among processes in parallel scientific programs. While Parallel Virtual Machine (PVM) is still a viable message passing system offering features not available in MPI, it's often not the first choice for developers seeking vendor-supported APIs based on open standards. Of course, standards evolve, and the MPI standard is no different.
A little over a year ago, Silicon Graphics, Inc. (SGI, http://www.sgi.com) announced a new 64-bit supercomputing platform called the Altix 3000. In a break from its tradition of building large machines with MIPS processors running the IRIX operating system, the Altix uses Intel's Itanium 2 processor and runs -- you guessed it -- Linux. Unlike Beowulf-style Linux clusters, SGI's cache-coherent, shared-memory, multi-processor system is based on NUMAflex, SGI's third-generation, non-uniform memory access (NUMA) architecture, which has proven to be a highly-scalable, global shared memory architecture based on SGI's Origin 3000 systems.
The last few "Extreme Linux" columns have focused on multiprocessing using OpenMP. While often used in scientific models for shared memory parallelism on symmetric multi-processor (SMP) machines, OpenMP can also be used in conjunction with the Message Passing Interface (MPI) to provide a second level of parallelism for improved performance on Linux clusters having SMP compute nodes. Programs that mix OpenMP and MPI are often referred to as hybrid codes.
This is the third and final column in a series on shared memory parallelization using OpenMP. Often used to improve performance of scientific models on symmetric multi-processor (SMP) machines or SMP nodes in a Linux cluster, OpenMP consists of a portable set of compiler directives, library calls, and environment variables. It's supported by a wide range of FORTRAN and C/C++ compilers for Linux and commercial supercomputers.
This month, we continue our focus on shared-memory parallelism using OpenMP. As a quick review, remember that OpenMP consists of a set of compiler directives, a handful of library calls, and a set of environment variables that can be used to specify run-time parameters. Available for both FORTRAN and C/C++ languages, OpenMP can often be used to improve performance on symmetric multi-processor (SMP) machines or SMP nodes in a cluster by simply (and carefully) adding a few compiler directives to the code. Most commercial compilers for Linux provide support for OpenMP, as do compilers for commercial supercomputers.
In this column's previous discussions of parallel programming, the focus has been on distributed memory parallelism, since most Linux clusters are best suited to this programming model. Nevertheless, today's clusters often contain two or four (or more) processors per node. While one could simply start multiple MPI processes on such nodes to use these processors, taking best advantage of the hardware requires a different approach. Processors within a node typically share all the memory within that node, and they can communicate much more quickly with each other than with processors on other nodes.
The last two Extreme Linux columns provided an introduction to the Condor workload management system, gave detailed installation and configuration instructions for Beowulf clusters, and showed the details of managing and running MPI jobs (parallel programs that use the Message Passing Interface) with Condor. This month, let's continue looking at Condor, explore some of its advanced features, and check out its powerful queuing capabilities for lots of serial tasks.
A good job queuing and scheduling system is required whenever more than a couple of researchers share a Beowulf cluster. Coordinating with other users about when and where to run jobs on a shared cluster isn't impossible, but cluster administrators quickly realize the importance of having a robust batch system once users begin competing for resources.
Every cluster builder wants to know how fast his or her computer is. After all, speed is the primary reason to build a Linux cluster -- aside from the gains in data capacity and resource redundancy. But how do you measure speed? CPU clock speed is one thing, but using those cycles to actually get work done is quite another. And in a cluster, the network connections between nodes can quickly become a crippling bottleneck.
Monitoring the status of a Beowulf-style cluster can be a daunting task for any system administrator, especially if the cluster consists of more than a dozen nodes. While Linux is extremely stable, hardware problems can cause nodes to crash or become inaccessible, and chasing down problem nodes in a 500-node cluster is painful. Luckily, some sort of statistical resource monitoring can often yield early warnings of impending hardware failures.
Last month's issue of Linux Magazine was dedicated to cluster computing, allowing leaders in the field to present a wide range of topics about Beowulf-style clusters. Last month's issue also introduced model coupling with some example code. This month we return our attention to advanced Message Passing Interface (MPI) features by continuing the discussion of MPI groups and communicators begun in May.
While building a Beowulf cluster is cheap, estimating the true costs of acquiring an entire cluster can sometimes be a headache. Duke University's Dr. Robert G. Brown describes what you need to know before writing a proposal -- or a check -- for your first Beowulf.
Scientific model developers often need to combine or couple existing models. In this feature story, Linux Magazine's "Extreme Linux" columnist Forrest Hoffman shows how to couple parallel models without a toolkit.
In the past several months, a good number of corporate, government, academic, and research institutions -- Pixar, the Lawrence Livermore National Lab (LLNL), the Centers for Disease Control (CDC), Shell E&P, and others -- have announced the installation of substantial, high performance Linux computing clusters. In the case of LLNL, for example, the largest of its three new clusters (built by Linux Networx) is composed of 252 Pentium 4 processors, capable of a theoretical peak of 857 gigaflops, making it one of the fastest clusters ever built. In the case of Pixar, a 1,024-processor blade cluster (using 2.8 GHz Xeons) from RackSaver is replacing the company's existing Sparc-based render farm.
In the previous two columns, we discussed the master/slave programming idiom and how to pass derived data types in messages. This month, we continue our exploration of advanced uses of the Message Passing Interface (MPI) with a look at MPI communicators and groups, two MPI features that provide communications contexts and modularity for parallel codes.
Most programs written for distributed memory, parallel computers, including Beowulf clusters, utilize the Message Passing Interface (MPI) or Parallel Virtual Machine (PVM) programming interfaces to exchange data or messages among processes. In the past, this column has presented many of the fundamentals of message passing and has shown a number of programming examples using both MPI and PVM. Last month's column focused on the master/slave model of parallelism using MPI, and introduced the MPI_Probe() routine. This month, let's discuss another advanced feature of MPI: how to use derived data types.
Installing Linux on more than a few nodes in a cluster is time consuming, boring, and potentially fraught with error. Since it's typically desirable to have a nearly identical operating system on each node, selfsame, full installs and configurations must be performed repeatedly. While disk cloning is a good solution for clusters with the exact same hardware, cloning hard drives can be problematic on nodes with a variety of different disks, network interfaces, and processors.
Designing, building, installing, and configuring a Beowulf-style cluster presents a number of challenges, even for very capable system designers and system administrators. Once decisions about topology, layout, hardware, and interconnect technologies have been made (based primarily on the needs of the software and models you intend to run), a considerable amount of work must still be done to forge a working cluster.
Tying a LAN of computers together to do cooperative work isn't a new idea. Combining several thousand computers spread across the globe into a commodity service like water or electricity is. But is the "computing grid" ready for us to plug in? Here's a report.
Given the amount of press Java garners, you'd think that every programmer is busily building Java applications. The reality is that many programmers have yet to give Java a try. If you're one of those programmers, Robocode might just be the project you need to jump into Java.
Linux clusters are an inexpensive yet very effective parallel computing solution. Indeed, the low cost and ease of building Linux clusters has encouraged researchers all around the world to build larger and faster collections of machines.
In the past two columns, we've looked at the theory and basics of parallel computing. This month we get to the real world with a program that performs matrix multiplication.