Turning a Linux Cluster into a Supercomputer

While a Linux cluster offers a cost-effective alternative for numerous enterprise and technical computing applications, its price/performance for high-performance computing (HPC) applications can be substantially worse. Why?

While a Linux cluster offers a cost-effective alternative for numerous enterprise and technical computing applications, its price/performance for high-performance computing (HPC) applications can be substantially worse. Why? HPC applications stress the entire system in ways that business and technical applications do not.

For instance, many HPC applications are so communications-intensive that in order to run efficiently they need an interconnect system with higher performance than most commodity systems can provide. Others utilize very fine-grained parallelism and require frequent synchronization between processors, something the Linux scheduler doesn’t provide. Still others require high-speed I/O to a shared global file system with file management capabilities not available in a standard Network File System (NFS). Finally, HPC applications that are designed to stretch the boundaries of scientific exploration and take engineering analysis to new limits require scalability that is far beyond the practical limits of a commodity Linux cluster.

However, Linux clusters can be enhanced to run high-performance computing workloads efficiently by optimizing the system architecture, interconnect technology, the kernel, and the file system, and improving system scalability and manageability. With innovative designs that address these areas, the Cray XD1 and Cray XT3 systems demonstrate that Linux can handle demanding HPC applications. Optimizing Linux clusters to support HPC brings the cost benefits of commercially available processors and the portability benefits of Linux to those scientists and engineers with demanding application processing requirements.

HPC Challenges for Linux Clusters

Enterprise applications such as web serving and transaction processing, as well as some HPC applications, such as the searching and sorting programs used in life sciences, consist of a vast number of problems that can be distributed among processors with little or no interaction. Each processor solves a single problem and the results from each processor are returned to a central database. These applications are known as embarrassingly parallel, and individual processor speed is the greatest determinant of performance.

In contrast, most HPC applications solve a single problem by distributing work to multiple processors. Frequent, voluminous communication and coordination between the processors is the norm. Any delays encountered in the course of sharing information can dramatically increase the application’s runtime.

In typical Linux clusters, inefficiencies due to high-latency, low-bandwidth interconnect technology and the lack of coordination between the operating systems on individual processors tends to swell communications wait times and steal efficiency. These losses become greater as the number of processors increases. Eventually, adding processors actually decreases application performance.

HPC applications also require an extremely reliable system because a single run might take several weeks to complete and the failure of any single component could cause the run to abort. Ensuring reliability at the scale of large HPC systems requires conscious systems engineering beyond connecting individual servers.

This systems engineering should also extend to system management. Managing systems consisting of individual nodes, each with its own copy of Linux and a local disk that must be updated individually, is a daunting administrative challenge. Large national facilities may have the requisite staff to meet the challenge, but smaller sites likely find the personnel requirements onerous or impractical. Thus, efficient system management is critical to the viability of HPC. The best management environments portray the entire system as a single entity, and allow the administrator to work with it at this level of abstraction.

Finally, HPC applications generate and use tremendous amounts of data, and thus require a high-performance file system. Datasets are frequently generated by and shared between all the processors running an application. A high-performance, global file system that provides concurrent access to all of the nodes in the system is mandatory for many, perhaps most HPC installations.

Optimizing a Linux Cluster for HPC

Linux clusters can be optimized to deliver the high application processing performance and system efficiency necessary for HPC applications. These systems require:

*High-speed interconnect technology to increase communications performance

*An optimized operating system to eliminate delays

*System-wide management to ensure efficient operation and high availability

*A high-performance file system for data coordination

The Cray XD1 and Cray XT3 systems illustrate how systems can be optimized for HPC. Cray designers used similar approaches for the two systems, which are targeted to the differing budget and performance needs of their respective customer bases. The Cray XD1 supercomputer is designed for industry, academic, and government applications requiring exceptional performance, reliability, and usability. The Cray XT3 supercomputer is designed for large, capability-class HPC applications, and scales to support the most challenging HPC workloads.

Interprocessor Communication Speed

Multiple processors within a cluster are most commonly connected through a PCI bus, using an I/O-connected architecture. This architecture can easily scale to hundreds or thousands of processors by adding PCs, servers, or SMP systems to the external switching network. However, high-latency, low-performance I/O buses such as PCI are fundamentally limited in the performance they can achieve. This technology is sufficient for computing peripherals, but isn’t suitable for high-performance interprocessor communications. The Cray systems eliminate the PCI-bus bottleneck with a direct-connect architecture, where processors are connected directly to the interconnect.

The direct connect architecture was responsible for Cray’s selection of the AMD Opteron processor for both the XD1 and XT3 systems. The Opteron excels in moving data from processor to memory, or from processor to processor. Outstanding memory performance derives from dual Integrated DDR Memory Controllers that increase application performance by dramatically reducing memory latency by as much as half relative to competitive processors. Opteron processors also feature HyperTransport Technology, which directly connects CPUs to the interconnect over a 6.4 GB/s path.

The direct connect architecture provides the hardware basis for a fast interconnect. Fully capitalizing on the hardware requires offloading communications overhead onto a dedicated communications processor. This processor performs functions normally associated with software, such as ensuring reliable transmission, enabling one-sided memory operations in hardware, and synchronizing clocks across the nodes in the system.

A dedicated communications processor and the direct connect architecture lay the foundation for low-latency, high-bandwidth communications — both key to allowing HPC applications to efficiently run across large numbers of processors. The absence of this foundation results in applications executing inefficiently and the inability to scale the HPC application to large numbers of nodes, which in turn translates into lower quality product designs, longer time-to-market, and a more expensive development process.

The Cray XD1 system utilizes the Cray Rapid Array Processor for communications management, while the Cray XT3 uses the Cray SeaStar routing and communications processor. These custom processors remove communications overhead from the compute processors, and speed communication between processors and between processors and memory.

FIGURE ONE: The Cray XD1 architecture with RapidArray Interconnect System

Figure One shows the Cray XD1 direct-connect architecture with the Cray Rapid Array Processor. The Cray XD1 architecture delivers 8 GB/s bandwidth with 1.7 microsecond MPI latency between nodes.

FIGURE TWO: The Cray XT3 architecture with Cray SeaStar 3-D Interconnect

Figure Two shows the Cray XT3 direct-connect architecture with the Cray SeaStar routing and communications processor. The Cray SeaStar processor uses an embedded PowerPC 440 microprocessor with on-chip memory to set up message path routing and pass control to DMA Engine hardware. The DMA Engine moves bytes between the communication links and the memory of the Opteron. The dedicated DMA hardware permits a peak data rate of 7.6 GB/s.

Interprocessor Synchronization

One of the benefits of Linux for most applications is its ability to manage a myriad of operational tasks concurrently, such as scanning for viruses, cleaning up logs, and maintaining basic accounting records. For HPC application processing, this benefit becomes a liability.

First, processors that are engaged in housekeeping functions are not running the application and reduce the application-processing productivity of the system. More important, HPC programs are commonly written with barriers, or instructions that tell a processor to wait for results from other processors once it reaches a particular point in the computation. If processors reach a barrier and must wait for another processor that is performing housekeeping tasks, many compute cycles that could be spent on the application are wasted. Operating system(OS)jitter is the lack of synchronization between nodes.

Benchmarks confirm that OS jitter can have even more impact on efficiency than interconnect performance. The performance impact is visible in systems with as few as eight processors, and grows more pronounced with increased system scale. The Cray XD1 addresses OS jitter with a new Linux scheduler policy known as the Linux Synchronized Scheduler (LSS). The LSS uses a system-wide clock synchronized to better than a microsecond resolution. The system uses this clock to align the beginning of time slices across the entire system, to ensure that the user’s application processes run in the same time slices on all nodes of a system, and to ensure that housekeeping is performed in the same time slices across the entire system. LSS also restricts the number of time slices available for housekeeping.

Figure Three shows the performance of the Cray XD1 system at various processor counts, with and without LSS, in the LS-DYNA benchmark. LS-DYNA is a three-car collision benchmark with 794,780 elements and six contact interferences, and a simulation time of 150 milliseconds.

FIGURE THREE: Cray XD1 system performance in the LS-DYNA version 970, revision 5434a benchmark

The Cray XT3 system addresses the problem of OS jitter by dividing processor responsibilities into service I/O functions and compute functions. This differentiation is controlled by the system and doesn’t require additional management expertise. All housekeeping tasks are performed by service I/O processors running the familiar Linux environment. The compute processors run the Cray XT3 Catamount microkernel, a lightweight kernel that minimizes system overhead and allows the system to scale to thousands of processors. Catamount has limited interaction with applications, such as managing virtual memory addressing, providing memory protection, and performing basic scheduling. This proven microkernel architecture ensures reproducible run-times for MPP jobs, supports fine-grain synchronization at scale, and ensures high-performance, low-latency communication.

File System Efficiency

Many HPC applications share a dataset, consisting of one or more files, across many processes running on disparate processors. This is problematic to a Linux cluster where each node has its own file system. Which node owns which file? How do other nodes access the file? How can write access to a single file be granted to a large number of processes running on different nodes?

Some clusters attempt to use NFS to meet these sharing requirements. Unfortunately, NFS does not scale well, either in performance or reliability. NFS also depends upon TCP/IP, which introduces considerable additional performance overhead.

The Cray XD1 and XT3 systems solve this problem by using the Lustre parallel file system from Cluster File Systems, Inc. This high-performance, high-availability, object-based storage architecture was architected specifically for HPC. It provides uniform, global filenames to users and applications and can scale to thousands of nodes. Files can be striped across multiple storage devices attached to multiple nodes, so arbitrarily high bandwidth can be achieved. Lustre runs natively across the high-performance interconnect, bypassing TCP/IP overheads.

Management and Availability

Large clusters are notoriously difficult to manage, commonly requiring a large staff and multiple, integrated management software packages. Even so, failures caused by differing software versions or inconsistent node, file system, or user configurations are common. Cascading failures, where a failure on one node causes a chain of failures on other nodes, can be very difficult and time-consuming to diagnose. System administrators are faced with the difficult choice of ensuring a stable cluster by forsaking the benefits of advancing technology, or making upgrades and risking system failure.

The Cray XD1 and XT3 systems use integrated monitoring and management systems that include an independent management processor, management software, and an independent supervisory network to monitor, control, and manage the computer. Often, these subsystems detect and automatically resolve potential problems before they become job-threatening. Tightly integrated operating and management systems also let administrators manage hundreds or thousands of processors as a single system, eliminating the administrative effort and problems associated with loosely-coupled cluster systems. By architecting the Cray XD1 and XT3 for manageability, Cray has significantly reduced the management effort and, consequently, the total cost of ownership.

Optimized for HPC

Linux clusters offer cost-effective, high-performance processing for applications that can be distributed across multiple processors and do not require extensive interprocessor communication. For HPC applications, Linux clusters lack the necessary interprocessor communication speed, interprocessor synchronization, file system efficiency, and manageability to run large HPC applications. These issues can be successfully addressed in purpose-built systems that optimize commercially available components to enhance HPC application processing performance.

The Cray XD1 and XT3 systems are optimized for HPC application processing. These systems provide uniform performance across processors and fast, scalable, global I/O with high reliability in configurations up to thousands of processors — all in an environment that is cost-effective and easy to manage.

Optimizing commodity Linux components to run HPC applications will let more scientists and engineers push the boundaries of technology in their respective fields.

Jeff Brooks is the MPP Product Manager at Cray Inc.

Comments are closed.