While a Linux cluster offers a cost-effective alternative for numerous enterprise and technical computing applications, its price/performance for high-performance computing (HPC) applications can be substantially worse. Why?
While a Linux cluster offers a cost-effective alternative for numerous enterprise and technical computing applications, its price/performance for high-performance computing (HPC) applications can be substantially worse. Why? HPC applications stress the entire system in ways that business and technical applications do not.
For instance, many HPC applications are so communications-intensive that in order to run efficiently they need an interconnect system with higher performance than most commodity systems can provide. Others utilize very fine-grained parallelism and require frequent synchronization between processors, something the Linux scheduler doesn’t provide. Still others require high-speed I/O to a shared global file system with file management capabilities not available in a standard Network File System (NFS). Finally, HPC applications that are designed to stretch the boundaries of scientific exploration and take engineering analysis to new limits require scalability that is far beyond the practical limits of a commodity Linux cluster.
However, Linux clusters can be enhanced to run high-performance computing workloads efficiently by optimizing the system architecture, interconnect technology, the kernel, and the file system, and improving system scalability and manageability. With innovative designs that address these areas, the Cray XD1 and Cray XT3 systems demonstrate that Linux can handle demanding HPC applications. Optimizing Linux clusters to support HPC brings the cost benefits of commercially available processors and the portability benefits of Linux to those scientists and engineers with demanding application processing requirements.
HPC Challenges for Linux Clusters
Enterprise applications such as web serving and transaction processing, as well as some HPC applications, such as the searching and sorting programs used in life sciences, consist of a vast number of problems that can be distributed among processors with little or no interaction. Each processor solves a single problem and the results from each processor are returned to a central database. These applications are known as embarrassingly parallel, and individual processor speed is the greatest determinant of performance.
In contrast, most HPC applications solve a single problem by distributing work to multiple processors. Frequent, voluminous communication and coordination between the processors is the norm. Any delays encountered in the course of sharing information can dramatically increase the application’s runtime.
In typical Linux clusters, inefficiencies due to high-latency, low-bandwidth interconnect technology and the lack of coordination between the operating systems on individual processors tends to swell communications wait times and steal efficiency. These losses become greater as the number of processors increases. Eventually, adding processors actually decreases application performance.
HPC applications also require an extremely reliable system because a single run might take several weeks to complete and the failure of any single component could cause the run to abort. Ensuring reliability at the scale of large HPC systems requires conscious systems engineering beyond connecting individual servers.
This systems engineering should also extend to system management. Managing systems consisting of individual nodes, each with its own copy of Linux and a local disk that must be updated individually, is a daunting administrative challenge. Large national facilities may have the requisite staff to meet the challenge, but smaller sites likely find the personnel requirements onerous or impractical. Thus, efficient system management is critical to the viability of HPC. The best management environments portray the entire system as a single entity, and allow the administrator to work with it at this level of abstraction.
Finally, HPC applications generate and use tremendous amounts of data, and thus require a high-performance file system. Datasets are frequently generated by and shared between all the processors running an application. A high-performance, global file system that provides concurrent access to all of the nodes in the system is mandatory for many, perhaps most HPC installations.
Optimizing a Linux Cluster for HPC
Linux clusters can be optimized to deliver the high application processing performance and system efficiency necessary for HPC applications. These systems require:
*High-speed interconnect technology to increase communications performance
*An optimized operating system to eliminate delays
*System-wide management to ensure efficient operation and high availability
*A high-performance file system for data coordination
The Cray XD1 and Cray XT3 systems illustrate how systems can be optimized for HPC. Cray designers used similar approaches for the two systems, which are targeted to the differing budget and performance needs of their respective customer bases. The Cray XD1 supercomputer is designed for industry, academic, and government applications requiring exceptional performance, reliability, and usability. The Cray XT3 supercomputer is designed for large, capability-class HPC applications, and scales to support the most challenging HPC workloads.
Interprocessor Communication Speed
Multiple processors within a cluster are most commonly connected through a PCI bus, using an I/O-connected architecture. This architecture can easily scale to hundreds or thousands of processors by adding PCs, servers, or SMP systems to the external switching network. However, high-latency, low-performance I/O buses such as PCI are fundamentally limited in the performance they can achieve. This technology is sufficient for computing peripherals, but isn’t suitable for high-performance interprocessor communications. The Cray systems eliminate the PCI-bus bottleneck with a direct-connect architecture, where processors are connected directly to the interconnect.
The direct connect architecture was responsible for Cray’s selection of the AMD Opteron processor for both the XD1 and XT3 systems. The Opteron excels in moving data from processor to memory, or from processor to processor. Outstanding memory performance derives from dual Integrated DDR Memory Controllers that increase application performance by dramatically reducing memory latency by as much as half relative to competitive processors. Opteron processors also feature HyperTransport Technology, which directly connects CPUs to the interconnect over a 6.4 GB/s path.
The direct connect architecture provides the hardware basis for a fast interconnect. Fully capitalizing on the hardware requires offloading communications overhead onto a dedicated communications processor. This processor performs functions normally associated with software, such as ensuring reliable transmission, enabling one-sided memory operations in hardware, and synchronizing clocks across the nodes in the system.
A dedicated communications processor and the direct connect architecture lay the foundation for low-latency, high-bandwidth communications — both key to allowing HPC applications to efficiently run across large numbers of processors. The absence of this foundation results in applications executing inefficiently and the inability to scale the HPC application to large numbers of nodes, which in turn translates into lower quality product designs, longer time-to-market, and a more expensive development process.
The Cray XD1 system utilizes the Cray Rapid Array Processor for communications management, while the Cray XT3 uses the Cray SeaStar routing and communications processor. These custom processors remove communications overhead from the compute processors, and speed communication between processors and between processors and memory.