dcsimg

The Coming of Diskless Clusters

With advances in many facets of networking, diskless clusters are now quite practical. Better yet, nodes without local storage are cheaper to build and cheaper to maintain. Here’s a survey of the relevant technologies and techniques.
The world of clusters is fun and unique because there are so many ways to customize the design and implementation of your cluster. One design that’s becoming increasingly popular is the diskless cluster — that is a cluster with no hard drives in the compute nodes.
Diskless clusters are now practical to build, due to the convergence of several key technologies, including affordable, high-speed interconnects, high-speed network file systems and storage network protocols, MPI implementations with MPI-IO, and high- performance processors.
Why no hard drives in compute nodes? Removing storage hardware from each machine reduces cost, cuts down on noise, lessens cooling requirements, and curtails power consumption, and minimizes the number of moving parts in a cluster, thereby improving reliability. (Indeed, several studies indicate that hard drives are the least reliable component of cluster configurations.)

Where Did the Operating System Go?

One of the first questions asked about diskless clusters is, “Where is the OS for each node?”
For small clusters, you can use the Network File System (NFS) to mount a remote root (or /, the topmost directory in a Linux system) on each of the compute nodes. For larger clusters, with hundreds or even thousands of nodes, you can use a portion of each node’s RAM as a RAM disk and place a useful subset of the operating system within it. (To minimize the amount of RAM needed by the RAM disk, remove all non-essential daemons and applications from the distribution you choose to install.)
The next question asked is usually, “Where does the data go?” Of course, to do any useful work, all of the diskless nodes need to read data and write results.
In the past, diskless nodes have used NFS as a storage system. More recently, though, a number of arguably better choices have emerged, including high-speed parallel file systems such as Lustre, IBRIX, GPFS, Terrascale, Polyserve, and Panasas, and file systems that use network storage protocols, such as iSCSI (see http://www.linux- mag.com/2003-10/feature/iscsi.html), HyperSCSI, and ATA-Over- Ethernet (AOE, see http://www.linux-mag.com/2005-06/feature/ ata.html). Also, the growing number of MPI implementations that include MPI-IO allow codes to utilize alternative file systems, even PVFS2.
Let’s delve into how some of these file systems enable diskless cluster nodes, dividing the discussion into solutions for small- and medium-sized clusters (say, those up to 64 or even 128 nodes), and configurations for large clusters of 128-, one thousand-, and ten thousand nodes.

NFS for Small- and Medium-Sized Clusters

Small- and medium-sized clusters are by far the most prevalent kinds of clusters. (Here, small clusters are arbitrarily defined to be less than sixteen nodes, and a medium cluster is defined to be 16-128 nodes.) The systems are relatively inexpensive to build and there are many file system packages to choose from, some commercial and some open source, but all quite capable of providing the storage that diskless nodes need.
For small- and medium-sized clusters, good ole NFS (http://nfs.sourceforge.net) is an acceptable solution for applications that aren’t I/O intensive. NFS allows a directory or even an entire file system to be shared across many machines.
NFS has been around for a few years, and while slow, it’s well-known, widely-supported and well-understood. Moreover, for small and medium systems, NFS has been found to be very stable. (However, avoid writing data to a single file from more than one node at the same time.)
NFS uses a basic client/server model, with the server exporting a file system to various clients that mount the file system. In the case of small clusters, it’s a good idea to put the NFS server and all of the clients on the same network with a single switch to improve performance. If the client nodes have multiple network interface cards (NICs), usually Gigabit Ethernet (GigE) NICs, then it’s a good idea to put the NFS traffic on a separate GigE network. Small GigE switches are fairly inexpensive and good GigE switches for medium-sized clusters are also reasonable.
The actual storage device (s) on the NFS server can take many forms. Probably the simplest form is an NFS server with internal disks. The internal disks can be configured in many ways, dependent upon the importance of the data, I/O requirements, and a host of other factors. To make life easier down the road, it is a good idea to configure the disks using the Logical Volume Manager (LVM). LVM allows the storage space to be easily expanded or contracted as needed and while it’s online, avoiding the need to reformat the file system and restore the data from a backup.
Once the drives are configured and the file system is formatted on the drives, the file /etc/exports is edited to “export” the desired file systems. Details of this process can be found at the Linux NFS website. The site also has a HOWTO on configuring and tuning NFS. (You can also read the feature “Tuning the Network File System” in the August 2005 issue of Linux Magazine, available online at http://www.linux-mag.com/2005-08/ feature/nfs.html).
For smaller clusters, the NFS server and the master node can be one and the same. For medium clusters, up to 128 nodes, a dedicated NFS server should be used. As in the small cluster, it should have a reasonable amount of storage and, if possible, a dedicated storage network.

Going, Going, Going Faster with PVFS2

The Parallel Virtual File System (PVFS2) is a specialized file system intended to act as high-speed, parallel scratch storage. Currently PVFS2 supports TCP, Infiniband, and Myrinet interconnects. It also supports multiple network connections (multihome). However, for codes to take true advantage of PVFS2, they should be ported to use MPI-IO.
PVFS2 is built atop an existing file system such as Ext2, Ext3, JFS, Reiserfs, and XFS. Configuring PVFS2 is very easy. Select one or more diskful machines to act as servers (although you can use a single machine for this purpose) and decide whether each server should be a data server or a metadata server. A data server actually stores the file contents, while the metadata server stores information about the files and directories, such as where the data is located,who owns the files, and what permissions have been set.
PVFS2 IS then installed and configured on each server, and the PVFS2 client software is installed on each of the diskless nodes. Once all of the machines are configured, you can mount the PVFS2 file system on the compute nodes.
A sample PVFS2 configuration might include a few servers, where each server has several disks. To make things easier, use md to combine the block devices on the disks into a single block device. If you think you’ll need additional storage space at some point, then use LVM to allow more storage space to be added easily. Of course, you can place the disks in a RAID configuration, using hardware or software RAID, to gain some resilience, and use LVM on top of the RAID-ed block devices. One other option is to use multiple networks to connect the clients to the servers. Many server boards come with multiple GigE ports, so it’s very easy to just add another set of GigE switches to the network. This extra network gives you some resilience in case of network failure.
The PVFS2 FAQ (http://www.pvfs.org/pvfs2/pvfs2-faq.html) provides good guidance on selecting the number of servers. For small clusters of less than 16 nodes or so, a recommended configuration includes at least two PVFS2 servers connected to the same network as the compute nodes, using GigE or better, or connected via a dedicated PVFS2 network. The exact configuration you NEED will depend on your I/O needs and your budget.
For medium clusters, up to 128 nodes, several PVFS2 servers should be used. As in the small cluster, you can use the same GigE network as the compute nodes, but you are likely to saturate it with MPI and PVFS2 traffic. Moving the PVFS2 traffic to a dedicated GigE network improves things, but using a high-performance interconnect such as Myrinet, Infiniband, Quadrics, or Dolphin, helps.
For small clusters that only run one job and have only one PVFS2 server, NFS may give better results. For larger clusters, having more than one PVFS2 server is beneficial.

Commercial Offerings

There are several commercial options that provide a global file system suitable for diskless clusters. For example, one could use Lustre, IBRIX, or GPFS with various storage devices, or use the Panasas ActiveScale Storage Cluster (http://www.panasas.com).
For smaller clusters, these solutions are likely to be too expensive. For larger clusters, perhaps of 32 nodes or more, these commercial products can prove to be a price/performance winner. However, there are some applications that are very I/O intensive that could benefit from a high performance file system regardless of the size of the cluster. For large, diskless clusters, a high-performance parallel file system is usually needed.
Lustre (http://www.lustre.org) is an interesting possibility for small- to medium-sized clusters, because it offers higher performance than NFS. Lustre is an object-oriented, highly-scalable, parallel, global, distributed file system with very high I/O performance. It can potentially scale to tens of thousands of nodes and hundreds of terabytes of storage, but can also be run on only a single node.
Lustre breaks the file system into two components: metadata servers (MDS) and data servers (or in Lustre parlance, object storage targets, or OSTs). Filesystem metadata can be distributed across multiple MDS machines, and the file data itself can be distributed across multiple OST machines. Clients can mount the Lustre file system and do not have to be either an MDS or OST. On the other hand, a single machine can be an MDS, an OST, and a client, or a combination of all three.
For small clusters of less than 16 nodes, you can put both the MDS and OST on the master node. As with NFS and PVFS2, it is recommended that you use at least GigE and if possible use a dedicated GigE network for Lustre traffic. For medium clusters, up to 128 nodes it gets a bit more complicated. To create a Lustre file system that works well for the cluster, you need to know the I/O behavior of your applications and the overall I/O requirements for the cluster. Given that knowledge, you can plan the interconnect for Lustre, the number and amount of data servers (OSTs), the type of storage in the OSTs, and so on.

Options for Large Clusters

In the world of high-performance compute clusters, size does matter: a large number of diskless nodes mandates the use of a parallel global file system.
NFS is still widely-used in large clusters, although its performance is lackluster. However, what it lacks in performance is countered by its simplicity. As mentioned above, it’s well-understood, simple to administer on servers and clients, and it’s built-in to virtually every operating system. That said, larger and larger NFS configurations do become unwieldy and eventually impair the potential of the cluster.
Once a cluster becomes very large, most sites look to commercial products, often because such products come with expansive technical support. Luckily, there are a number of vendors to choose from.
*Panasas. Panasas develops and sells a product called the Panasas ActiveScale Storage Cluster, a high-speed, scalable storage system. The Panasas Storage Cluster uses an object-storage architecture to create a global parallel file system called the Panasas ActiveScale File System (PanFS).
There are two primary hardware components in PanFS: StorageBlades and DirectorBlades, both of which are commodity based. The DirectorBlades virtualize the data into objects and orchestrate file system activity. The StorageBlades actually store the data. A StorageBlade consists of two Serial ATA drives, a processor, cache memory, and a NIC.
The blades are contained in a 4U chassis called a “shelf” and use an integrated GigE switch to connect to outside infrastructure. The shelf can accommodate up to 11 blades and up to 5 TB of storage. Each shelf also has redundant power supplies and redundant Ethernet switches. There is also a dual power-grid configuration and a battery for the entire shelf to minimize potential points of failure and downtime.
The shelf has sixteen internal ports and five external ports. The external GigE ports can be trunked to get better throughput for the system, but only one port is required. (Typically, only four of the ports are trunked together for higher capacity systems while the fifth port is used for management.) The five ports of each shelf are connected to a GigE switch that’s typically a larger central GigE switch, so that all shelves only have to go through a single backplane.
Ideally, a shelf would have a minimum of a single Director Blade and up to ten Storage Blades. Adding StorageBlades gives you increased bandwidth, since each one has a built-in processor and a separate network interface. The performance increases linearly as you add StorageBlades and shelves.
*Lustre. While Lustre is an open-source file system, the newest version of Lustre is only available from its commercial counterpart, Cluster File Systems, Inc. (http://www.clusterfs.com).
Lustre uses open network protocols to allow components to communicate. Currently it can use TCP networks (Fast Ethernet, Gigabit Ethernet), Quadrics Elan, Myrinet GM, Scali SDP, and native Infiniband. Lustre also uses Remote Direct Memory Access (RDMA) and OS-bypass capabilities to improve I/O performance.
Architecting a Lustre file system and a good storage subsystem for larger clusters takes some work. Roughly, the process includes these steps:
1.First, analyze your applications and determine the I/O requirements of each one.
2.Next, calculate the aggregate I/O requirements of the entire cluster by estimating the number of applications that will run at the same time, accumulating those I/O requirements, and estimating the probability that the applications are writing to the file system at the same time.
3.Once you have the (estimated) aggregate I/O bandwidth, design the Lustre network, choosing how it’s to be connected to the compute nodes.
4.Finally, you can choose the storage devices.
As you might imagine, the architecting process takes careful planning, hard work, and patience.

The Diskless Are Among Us

There are some fairly large clusters out there, both diskful and diskless. One cluster in particular, Pink, at the Los Alamos National Laboratory (LANL), is one of the largest diskless clusters and was designed and built by Linux Networx.[ The author is an employee of Linux Networx.]
Pink has a total of 1,024 nodes, all connected with Myrinet. It has one master node, 959 compute nodes (each with a two Intel Xeon processors), and 64 I/O nodes. It also has a 40 TB Panasas ActiveScale Storage Cluster that is accessed via the I/O nodes. It has eight 5 TB shelves.
According to Ron Minnich, the leader of the cluster research team at LANL, Pink uses LinuxBIOS (htp://wiki.linuxbios.org) to boot the nodes. He says that without disks, the nodes go from power-on to ready-to-use in under five seconds. The lab also boots the nodes using Myrinet — a configuration that required some very interesting work.
Minnich describes the process: “The LinuxBIOS boots, and in 3 to 5 seconds boots a Linux kernel from flash. The Linux kernel has an initial RAM disk that includes Myrinet drivers. Linux comes up, initiates the Myrinet network, contacts the master, loads a kernel from the master, and starts the new Linux. In other words, Linux boots Linux. The second Linux brings each node up as a cluster node.” Minnich added all 1,024 nodes BOOT in 2.5 to 3 minutes.
Asked about his motivation to build a diskless cluster, Minnich answered, “Our goal is to put nothing critical on the node’s local disk, assuming it has one. The big goal is never to have a local disk, which is how we built Pink. Consider this: on a 1,024-node cluster, with 1,024 root file systems on local disks, there are hundreds of files that are needed to boot. If you change just one bit in any one of those files, the node is helpless. You have to walk up to it with a keyboard and monitor and resuscitate it. For example, change ‘root’ in /etc/password to ‘ropt’ and the node is now dead. The examples are endless. A diskless architecture removes these issues.”
When asked about the storage system for Pink, Minnich continued, “Pink was deliberately built without local disks, and from the very beginning, it was always an objective to have a high-performance and highly-available central storage system. We wanted a storage system that could deliver superior bandwidth and lower latency than local disks. Also, we needed a storage system that could scale to support the large number of nodes in the clusters.”
When asked about LANL’s experience with Pink, in particular if large diskless clusters and ready to solve real-world problems, Minnich said, “Absolutely. Pink is providing very good service. In fact, we have added many Pink-like clusters that range in size from 127 nodes to 1,700 nodes that are setting records for uptime and availability. On average, we lose one node every seven weeks, or about six per year. The industry average when using local disks is about one node a week for similar hardware.”

Jeff Layton is a Field Sales Engineer at Linux Networx (http:// www.lnxi.com). He would love for his garage diskless cluster to be included in the Top5000 someday. Jeff can be reached at class="emailaddress">jlayton@lnxi.com.

Comments are closed.