A little over a year ago, Silicon Graphics, Inc. (SGI, http://www.sgi.com) announced a new 64-bit supercomputing platform called the Altix 3000. In a break from its tradition of building large machines with MIPS processors running the IRIX operating system, the Altix uses Intel's Itanium 2 processor and runs -- you guessed it -- Linux. Unlike Beowulf-style Linux clusters, SGI's cache-coherent, shared-memory, multi-processor system is based on NUMAflex, SGI's third-generation, non-uniform memory access (NUMA) architecture, which has proven to be a highly-scalable, global shared memory architecture based on SGI's Origin 3000 systems.
A little over a year ago, Silicon Graphics, Inc. (SGI, http://www.sgi.com) announced a new 64-bit supercomputing platform called the Altix 3000. In a break from its tradition of building large machines with MIPS processors running the IRIX operating system, the Altix uses Intel’s Itanium 2 processor and runs — you guessed it — Linux. Unlike Beowulf-style Linux clusters, SGI’s cache-coherent, shared-memory, multi-processor system is based on NUMAflex, SGI’s third-generation, non-uniform memory access (NUMA) architecture, which has proven to be a highly-scalable, global shared memory architecture based on SGI’s Origin 3000 systems.
In fact, the Altix 3000 uses many of the same components — called bricks — that the Origin uses. These bricks mount in racks and may be used in various combinations to construct a desired system. The C-brick is the computational module housing the CPUs and memory; the M-brick is a memory expansion module; the R-brick is a NUMAflex router interconnect module; the D-brick is a disk expansion module; the IX-brick is a base system I/O module; and the PX-brick is a PCI-X expansion module. It’s the C-bricks in the Altix that are different from those in the Origin, because the ones in the Altix are based on Intel’s Itanium 2 instead of MIPS processors.
The Altix C-brick (described in Figure One) consists of two nodes, each containing two Itanium 2 processors with their own cache. The front-side buses of these processors are connected to custom ASICs referred to as SHUBs. The SHUBs interface the two processors to the memory DIMMs, to the I/O subsystem, and to other SHUBs via the NUMAflex network components. The SHUBs also interconnect the two nodes in a C-brick at the full bandwidth of the Itanium 2 front side bus (6.4 GB/sec).
|Figure One: A block diagram of the Altix C-brick|
The global shared memory architecture, implemented through SGI’s NUMAlink interconnect fabric, provides high cross-sectional bandwidth and allows performance scaling not usually obtained on commodity Beowulf clusters. While some coarse grained applications scale just fine on Beowulf clusters, others need the high bandwidth and very low latency offered by a machine like the Altix. Still other applications are best implemented as shared-memory applications using many processors.
The Altix provides a platform for all these models of parallelism on a 64-processor system running a single copy of Linux, a single system image (SSI). In addition, eight systems can be interconnected with NUMAlink in a dual “fat tree” topology, yielding a 512-processor cluster with as much as 16 TB of global shared memory. The Altix supports MPI and SGI’s Message Passing Toolkit (MPT) for distributed memory parallelism, and OpenMP, SHMEM, and POSIX threads for shared memory parallelism.
The Altix is a promising platform for those who can afford it. Or, for those who want to start small, SGI just released the Altix 350, a department-sized version of the Altix, which scales to sixteen processors.
While the Altix is good news for SGI (since the company will benefit from moving away from completely proprietary hardware and software), it’s great news for Linux and high-performance computing. SGI has contributed much of their work on the Linux kernel back to the community. Moreover, adoption of Linux by large computer vendors like SGI helps dispel the fear, uncertainty, and doubt (FUD) spread by vendors of proprietary, closed operating systems.
While the Altix system is not flawless — Oak Ridge National Labs (ORNL) has experienced problems with MPI/ OpenMP hybrid codes, compiler bugs, optimization problems, and utilization problems requiring specific process/thread placement — the Altix runs pure MPI and pure shared memory applications requiring lots and lots of memory very well right out of the box. While the kinds of problems ORNL’s encountered are expected on new architectures haven’t fully matured, SGI’s willingness to pursue and resolve any problems is very encouraging.
A Chat with SGI
I recently got the chance to interview Dave Parry and Rich Altmeier of SGI and discuss the Altix. Parry is Senior Vice President and General Manager of the Servers and Platforms Group, and Altmeier is Vice President of the Software and Storage Group. Their enthusiasm for their machine and for the Open Source community process were evident throughout our discussion.
FORREST HOFFMAN: What is the market for the SGI Altix?
DAVE PARRY: The markets are different for the 3000 and the 350 series. The 3000 is a high-end product by the standards of Linux, and even a higher-end product than the RISC [Reduced Instruction Set Chipset] systems that compete with it. We see the largest customer adoption of the 3000 in large research institutions and national laboratories — like the Department of Energy’s Oak Ridge National Laboratory and Pacific Northwest National Laboratory, NASA’s Ames Research Center, and others — and in technical commercial organizations of the automotive, engineering, and pharmaceutical industries.
The 350 is a little different. It’s a lower end product targeted at department use. We expect to see it as a “baby brother” to the 3700. The 350 scales to sixteen processors, and we’re now partnering with Voltaire to offer clustered versions of the 350 using their InfiniBand interconnect solutions.
HOFFMAN: What about the SGI Origin, the IRIX operating system, and MIPS processors? Will SGI continue offering these products?
PARRY: We’ll continue to offer Origin, but in the very long term, our architectural and revolutionary research and development will be directed toward the Altix line. Of course the NUMAflex architecture is the same for both the Origin and Altix systems. We’re leveraging the knowledge gained from the Origin 2000 and 3000, but switching to the Itanium 2 processor — which offers higher peak and application performance than MIPS RISC — and running Linux.
[Altix and Origin] are independent, in the sense that the Origin is MIPS and IRIX only, while the Altix is Itanium 2 and Linux only. However, much of the design was carried over [from the Origin to the Altix]. On the software side, the years of development that went into IRIX is being ported over as improvements to Linux. Under Linux, that software is either being contributed to the community or, in a few cases, being put out as new products.
RICH ALTMEIER: Another advantage of Altix is the tons of other software available for Linux. SGI isn’t the only source of software, as is the case with the MIPS environment.
HOFFMAN: Will we see future MIPS and IRIX releases?
PARRY: Yes. MIPS processors will undergo evolutionary enhancements, and we continue providing on-going releases of IRIX. We have a sustaining engineering strategy [for MIPS and IRIX] to protect the investments of our existing customers.
ALTMEIER: Future IRIX releases will provide bug fixes and feature enhancements as we’ve done for almost six years now.
HOFFMAN: What about your high-end graphics customers? Are they shifting to Altix?
PARRY: Some of that customer base is transitioning from proprietary systems to open systems, but other transitions are occurring in high-end graphics. For instance, we are moving from large, monolithic graphics pipes to aggregation of multiple graphics pipes. Today, we have [the Onyx4] UltimateVision using 32 graphics processors on one shared memory backbone to provide the composition capabilities that you’d expect only from large graphics clusters. We are seeing customer interest in graphics on Altix and Linux, but lots of customers are still using and buying new Onyx systems.
HOFFMAN: Will the Altix evolve as Intel’s Itanium processor matures?
PARRY: Yes. We had an internal effort to develop the Altix using the original Itanium processor. We built that system on the same architecture as the Origin 3000. That system was used by Rich’s group for early development work on NUMA capability, scalability, and I/O performance. The 3700 was introduced just over a year ago with the Itanium 2 (code named “McKinley”), then as “Madison” [Intel's follow-on Itanium 2 processor] became available last June or July, we began offering those as well.
HOFFMAN: What’s next for Altix? Will you continue scaling to higher processor counts or will you backfill with smaller systems like the 350?
PARRY: Yes to both. We intend to push the Altix product out in all directions. We will be building larger, more scalable versions of the 3700, as well as pushing on more optimal price-to-performance solutions and having better software and broader third party hardware support.
ALTMEIER: You haven’t seen the top end!
HOFFMAN: Your advertised configurations scale nodes to 64 processors, but I know here at Oak Ridge National Laboratory we’re running a single Linux image on a 256 processor Altix with reasonable success.
PARRY: We’ve been shipping systems with up to 64 processors in a single Linux kernel, and we’ve been consistently growing the memory size of the supercluster as it scales from 128 processors to 256, and, as of last December, to 512 processors, all in a single coherent shared memory environment with each chunk being managed by a single operating system image. At the same time we’ve had a beta program, and will soon productize the 128 processor [configuration]. The 256 processor [configuration] isn’t a product yet, but we already have a NASA customer running a single system image on 512 processors.
ALTMEIER: We persistently advance the kernel’s support for scaling. This is a “must-accomplish” kind of task for SGI.
HOFFMAN: Were dramatic modifications in the kernel required to support the Altix and NUMA architecture?
ALTMEIER: The surgery is not as radical as all that, and we tried to work very closely with the Linux community. Only a handful of technologies were required to get the kernel working on the system, including NUMA support (for discontiguous memory, some virtual memory enhancements, and local memory allocation), the “O(1)” scheduler (needed for large processor counts), kernel lock improvements (in which SGI was a participant), and a variety of bug fixes that were contributed back to the community. We’ve checked most of our work back in at kernel.org. We’ve not hacked up the kernel for our hardware and you can get the standard kernel.org kernel to run on the machine.
HOFFMAN: Is SGI responsible for adding NUMA support in the 2.6 kernel?
ALTMEIER: We participated in development of NUMA support in 2.6, but many others contributed as well, including NEC, IBM, and other community participants. We see tremendous value in these Linux community efforts. We are impressed by the broad range of testing and [the number of] people hammering on the system. Bugs are fixed very quickly.
HOFFMAN: What other contributions has SGI made to the Linux community?
ALTMEIER: Our XFS high-performance journaling filesystem recently entered the 2.4.25 maintenance stream. It has been in 2.6 longer, but it’s been five years since we first did the GPL release. It was a long road, but we have a very strong commitment to Linux and the community process. We think it facilitates the whole ecosystem [of software], resulting in stronger products with more features.
We’ve made other contributions as well. I already mentioned kernel enhancements like CPUMemSets for processor and memory placement, the Linux kernel debugger called kdb, kernprof for kernel profiling, lockmeter, and DISCONTIG. A variety of other filesystem, storage, and graphics work, many ported from IRIX, are open source projects from SGI. Other products are not open source, like the CXFS cluster filesystem. which makes a SAN really useful by allowing all machines shared access the filesystem.
PARRY: While CXFS isn’t an open source project, it’s not a closed proprietary product. It’s offered on IRIX, Solaris, Mac OS X, Linux, Windows, and AIX. Most HPC [high performance computing] customers have a rich mixture of heterogeneous systems, but want to manage all their data as one entity. CXFS enables customers to have a single solution for direct filesystem performance and get the benefits of aggregation.
HOFFMAN: Are you happy with the Intel compilers?
ALTMEIER: The Intel compilers are doing a good job. The Itanium is the fastest processor on the planet and we want the compiler to deliver that performance to scientific applications. Although we’re happy, we’re never satisfied.
HOFFMAN: I know some people have experienced problems with complex hybrid MPI/OpenMP codes on the Altix because of Intel’s implementation of OpenMP in the 7.x compilers. Is that fixed by the 8.0 compilers? Are there new problems in the 8.0 compilers?
|Figure Two: Don Maxwell (left) and Sergey Shpanskiy administer RAM, the SGI Altix 3700 supercluster at Oak Ridge National Laboratory. RAM has 256 1.5 GHz Intel Itanium 2 processors, 2 TB of global shared memory, 12 TB of disk storage, and a peak performance of 1.5 teraflops.|
ALTMEIER: We focused first on MPI and on OpenMP second, so there was some catching up to do.
PARRY: The level of maturation and improvement in OpenMP in the last nine months has been phenomenal. A year ago it was lackluster. Now with 8.0 we are seeing really solid OpenMP performance.
ALTMEIER: In the 8.0 compiler, Intel upgraded the front end, but some of the fine tuning isn’t quite there and a number of algorithms are not giving the results we want. 8.0 provided functional enhancements, but some optimizations have regressed a bit. These problems should be quickly rectified. We’re working closely with Intel to ensure that they are.
HOFFMAN: Some scientific applications do not scale well on typical Beowulf-style clusters. Are you finding that to be the case on the Altix?
PARRY: We were surprised, having been down this path with Origin and IRIX. We knew there would be things we had to fix, but were confident that we would be able to get to a system that scaled well for HPC workloads. In fact, we got better scalability than we expected.
ALTMEIER: My theory is that Linux design centers around short code paths, often not aimed at SMP [symmetric multi-processing] at all. There’s nothing better than a low-level thing for which we can crank operations per second. SMP-parallelized code is not the only way to get the operations per second you need. For typical kinds of intense compute workloads, people are seeing good scalability.
HOFFMAN: Any final thoughts?
ALTMEIER: We have this great ccNUMA [cache-coherent non-uniform memory access] architecture that is fundamentally a shared memory machine offering a productive programming environment. It’s a finely tuned machine from the processors to the compilers to the run-time libraries. We think it’s criminal to see someone not getting superior performance, and we work to correct that situation. The Altix augments clusters very well. It fits right in with Beowulf cluster environments that customers may already be using.
PARRY: One of the things we’re most proud of with Altix is that we made a decision early on in the product life-cycle to do things that hadn’t been done with Linux systems before.
We’re seeing, using a powerful processor and an open source environment, that we are delivering capabilities that no one else can deliver.
Our goal on the Origin and now on Altix is to provide an environment that will work well with mixed applications. Regardless of the programming style or parallel algorithm, our system competes with any other at running MPI jobs and delivers capabilities like OpenMP, and “pthreads” [POSIX threads] at the same time on the same system.
People often ask if a code is a cluster application or a shared memory application. We think it’s more important to consider whether it’s an implicitly or explicitly parallel application. Depending on the answer, you usually either cast the application onto a big SMP or a large cluster with a high performance, low latency interconnect. The Altix provides both on the same architecture, all running a single system image.
Forrest Hoffman is a computer modeling and simulation researcher at Oak Ridge National Laboratory. He can be reached at firstname.lastname@example.org.