Scalable I/O on Clusters, Part I

Linux clusters have become so successful that they've proliferated internationally through research labs, universities, and large industries that require an inexpensive source of high performance computing cycles. Developers and users have pushed the technology by scaling their applications to more and more processors so that larger problems can be solved more quickly. This has resulted in clusters where some applications can actually become I/O bound -- the input/output of data to/from a large number of processors limits the performance of the application.

Linux clusters have become so successful that they’ve proliferated internationally through research labs, universities, and large industries that require an inexpensive source of high performance computing cycles. Developers and users have pushed the technology by scaling their applications to more and more processors so that larger problems can be solved more quickly. This has resulted in clusters where some applications can actually become I/O bound — the input/output of data to/from a large number of processors limits the performance of the application.

Most Linux clusters use NFS (the Network File System) to share data among nodes and provide a consistent name space across all machines. As a result, parallel applications (executing simultaneously on multiple processors) typically read data from files stored on a single disk on a single server.

While NFS works well for small clusters (less than 32 nodes), its performance decays rapidly as more than 64 nodes start making simultaneous I/O requests. These requests “saturate” the NFS server that stores the files of interest.

What’s needed is a way of applying the Beowulf philosophy, which has worked well for computational scaling, to file systems by spreading the file serving workload across many disks, buses, and nodes. Enter PVFS, the Parallel Virtual File System (http://parlweb.parl.clemson.edu/pvfs/).

The Parallel Virtual File System (PVFS)

Developed at Clemson University and designed specifically for Linux clusters, PVFS provides network access to a “virtual” file system distributed across different disks on multiple independent servers or nodes. This is done by “striping” the data. For instance, the first 10K of a file are stored on server number 1, the next 10K on server number 2, and so on. Because all files are spread across multiple nodes (and even I/O buses and disks), I/O bottlenecks are reduced and overall I/O performance is increased.

PVFS was designed to be a high-performance production-quality parallel file system that supports concurrent read and write operations from multiple processes or threads to and from files. The file system is accessible via the standard Linux I/O API, a native PVFS API, and APIs built on top of these. PVFS provides a consistent name space across the cluster and allows existing binaries (and familiar commands like ls, cp, and mv) to operate on files without needing to be recompiled.

Like NFS, PVFS is based on a client/server model. But where NFS is a distributed file system, PVFS is a parallel file system. I/O operations occur in parallel across multiple nodes in the cluster simultaneously. All PVFS communication is currently done with TCP, but ports to faster protocols are being considered by the development team.

PVFS accomplishes scalable I/O by striping data across disks on cluster nodes called I/O servers. Each I/O server runs a PVFS I/O daemon that reads and writes data on its local disk. A single PVFS metadata daemon, running on its own node (called a metadata server), stores ownership, permissions, and striping metadata about each file.

When you create a file, you can specify which nodes the file’s data will be striped across, as well as the size of each stripe. This allows you to maximize I/O performance based on your analysis of how the file will be accessed by an application.

The metadata server is contacted whenever a file is opened, created, closed, or removed. But after that, applications can communicate directly with the relevant I/O servers without contacting the metadata server during read/write operations.

Figure One: A PVFS layout

Figure One shows a typical PVFS system layout consisting of a metadata server, a number of dedicated I/O servers, and a number of compute nodes. Compute nodes can be used as I/O servers and even as the metadata server, but performance will be better if nodes are dedicated to I/O and to metadata.

Applications access files under PVFS either directly (using PVFS function calls) or indirectly (by way of a kernel module using normal I/O function calls). The direct method uses a special file I/O library (libpvfs) that contains routines similar to standard Linux functions. These routines hide the details of PVFS I/O from the client applications and automatically call the standard functions when non-PVFS resources are accessed.

The system call wrappers in the client library are loaded before the standard C library and trap all the I/O system calls before they reach the kernel. If a PVFS file is being accessed, the appropriate PVFS I/O routine is called; otherwise, the normal kernel system call is made. The drawback to this approach is that applications must be ported to use the PVFS library and recompiled.

In the indirect method, when normal Linux I/O functions are used to access PVFS resources, a kernel module translates the request to the appropriate PVFS function allowing the kernel to fulfill the I/O request. This effectively implements a virtual file system (VFS) interface to PVFS. The kernel module allows normal Linux commands and traditional applications to access PVFS files without modification or re-compilation. However, I/O is slower when using this method, since the kernel must be involved in the process.

Installing and Testing PVFS

PVFS comes in two parts: the primary PVFS package (containing the daemons, utilities, and client library) and the kernel module. These can be downloaded from the PVFS homepage or from ftp://ftp.parl.clemson.edu/pub/pvfs/. Each package should be downloaded, untarred, configured, and built as shown in Figure Two. The resulting kernel module must be manually copied to the correct modules directory. Figure Two shows the correct location for a system running RedHat 7.2 with a 2.4.7-10smp kernel. Although the figure only shows the build process on a single host, it should be repeated on all systems that will be accessing the PVFS file system.

Figure Two: Building and installing PVFS

[root]# cd /usr/src
[src]# wget ftp://ftp.parl.clemson.edu/pub/pvfs/pvfs-1.5.3.tgz
[src]# wget ftp://ftp.parl.clemson.edu/pub/pvfs/pvfs-kernel-1.5.3.tgz
[src]# tar xvzf pvfs-1.5.3.tgz
[src]# tar xvzf pvfs-kernel-1.5.3.tgz
[src]# cd pvfs-1.5.3
[pvfs-1.5.3]# ./configure
[pvfs-1.5.3]# make
[pvfs-1.5.3]# make install
[pvfs-1.5.3]# cd ../pvfs-kernel-1.5.3
[pvfs-kernel-1.5.3]# ./configure
[pvfs-kernel-1.5.3]# make
[pvfs-kernel-1.5.3]# make install
[pvfs-kernel-1.5.3]# mkdir -p /lib/modules/2.4.7-10smp/kernel/fs/pvfs
[pvfs-kernel-1.5.3]# cp -p pvfs.o /lib/modules/2.4.7-10smp/kernel/fs/pvfs/pvfs.o

After installing the software, the next step is to configure a single metadata server on one node. This requires creating a directory to hold the metadata and building the setup files as shown in Figure Three . In our example, node08 is the metadata server, and two nodes, node08 and node07, are configured as I/O servers. On larger clusters, a separate dedicated metadata server may be preferred. Using /pvfs-meta to hold the metadata information is arbitrary (but typical).

Figure Three: Creating a PVFS file system

[root@node08 root]# mkdir /pvfs-meta
[root@node08 root]# cd /pvfs-meta
[root@node08 pvfs-meta]# /usr/local/bin/mkmgrconf
This script will make the .iodtab and .pvfsdir files
in the metadata directory of a PVFS file system.

Enter the root directory (metadata directory):
Enter the user id of directory:
Enter the group id of directory:
Enter the mode of the root directory:
Enter the hostname that will run the manager:
Searching for host…success
Enter the port number on the host for manager:
(Port number 3000 is the default)
Enter the I/O nodes: (can use form node1, node2, … or
node07, node08
Searching for hosts…success
I/O nodes: node07 node08
Enter the port number for the iods:
(Port number 7000 is the default)
[root@node08 pvfs-meta]#

Next, create a configuration file for the I/O servers and then create a directory or partition that will be incorporated into the PVFS file system. In our example, we’ll copy the default configuration file for an I/O server (iod.conf) into /etc on each node.

By default, the config file says that a file system mounted at /pvfs-data will be considered part of the PVFS partition, although this can be changed. Here, both our systems have a 12 GB partition mounted at /pvfs-data. Figure Four shows how to configure your I/O server, including setting correct ownership and permissions for the PVFS partition.

Figure Four: Configuring an I/O server

[root@node07]# chmod 700 /pvfs-data
[root@node07]# chown nobody.nobody /pvfs-data
[root@node07]# cp /usr/src/pvfs-1.5.3/system/iod.conf /etc/iod.conf
[root@node08]# chmod 700 /pvfs-data
[root@node08]# chown nobody.nobody /pvfs-data
[root@node08]# cp /usr/src/pvfs-1.5.3/system/iod.conf /etc/iod.conf

Finally, start the mgr daemon on the metadata server and the iod daemon on each I/O server (see Figure Five). PVFS also includes utilities to check that the daemons are running.

Figure Five: Starting the PVFS daemons

[root@node08 root]# /usr/local/sbin/mgr
[root@node08 root]# /usr/local/sbin/iod
[root@node07 root]# /usr/local/sbin/iod
[root@node08 root]# /usr/local/bin/mgr-ping -h node08
node08:3000 is responding.
[root@node08 root]# /usr/local/bin/iod-ping -h node07
node07:7000 is responding.
[root@node08 root]# /usr/local/bin/iod-ping -h node08
node08:7000 is responding.

Now your servers are configured and ready for business. Next, configure each client to access the PVFS file system. After following the same download and installation instructions in Figure Two (pg. 42), create a mount point on each client node (node02 in our example), and create a PVFS file system table in /etc as shown in Figure Six. The mount point can be anywhere on the system, but /mnt/pvfs is typically used so that it’s obvious that files below that point are on a PVFS file system.

Figure Six: Configuring a PVFS client

[root@node02 root]# mkdir /mnt/pvfs
[root@node02 root]# echo “node08:/pvfs-meta /mnt/pvfs pvfs port=3000 0 0″ > /etc/pvfstab
[root@node02 root]# chmod 644 /etc/pvfstab

At this point, any applications that access PVFS files directly (i.e., using the native client library) should work; however, additional steps are needed to access the file system with normal commands. Figure Seven shows how to create a special device file, load the kernel module we just built, start the client daemon, and mount the PVFS file system. A new composite file system made up of the two 12 GB partitions is now available on the client node at /mnt/pvfs. Files copied directly to this mount point will be visible to all client nodes.

Figure Seven: Starting the client PVFS daemons

[root@node02 root]# /bin/mknod /dev/pvfsd c 60 0
[root@node02 root]# /sbin/insmod pvfs
[root@node02 root]# /usr/local/sbin/pvfsd
[root@node02 root]# /sbin/mount.pvfs node08:/pvfs-meta /mnt/pvfs
[root@node02 root]# df
Filesystem 1k-blocks Used Available Use% Mounted on
/dev/sda2 4127108 1434800 2482660 37% /
/dev/sda1 31079 4748 24727 17% /boot
none 514244 0 514244 0% /dev/shm
/dev/sda3 505636 33733 445798 8% /var
node01:/home 12925832 3998940 8796608 32% /home
node08:/pvfs-meta 25268608 217856 23050752 9% /mnt/pvfs

The pvstat utility will let you view the metadata information about a file on a PVFS file system (see Figure Eight).

Figure Eight: Getting metadata information

[forrest@node02 ~]$ cd /mnt/pvfs
[forrest@node02 pvfs]$ cp ~/obs.bin.fullUS .
[forrest@node02 pvfs]$ ls -Fla
total 761890
drwxrwxrwx 1 root root 0 Mar 19 16:52 ./
drwxr-xr-x 6 root root 4096 Mar 19 16:46 ../
-rw-rw-r– 1 forrest forrest 780171000 Mar 19 16:54 obs.bin.fullUS
[forrest@node02 pvfs]$ pvstat obs.bin.fullUS
obs.bin.fullUS: base = 0, pcount = 2, ssize = 65536

In this case, the file is stored on two nodes (pcount is 2), starting with the first PVFS I/O server (base node 0) with a stripe size (ssize) of 65536 bytes.

PVFS is now installed and files can be accessed with the native client library or the kernel module. In next month’s column, we’ll examine different methods of accessing files under PVFS (with sample source code). We’ll learn how to access PVFS files via the kernel module, the PVFS client library, and from parallel applications.

Forrest Hoffman is a computer modeling and simulation researcher at Oak Ridge National Laboratory. He can be reached at forrest@esd.ornl.gov.

Comments are closed.