Scalable I/O on Clusters, Part 2

Linux clusters are an inexpensive yet very effective parallel computing solution. Indeed, the low cost and ease of building Linux clusters has encouraged researchers all around the world to build larger and faster collections of machines.

Linux clusters are an inexpensive yet very effective parallel computing solution. Indeed, the low cost and ease of building Linux clusters has encouraged researchers all around the world to build larger and faster collections of machines.

But, as we discussed in last month’s column, the Linux I/O (input/output) system can be a cluster’s Achilles heel. If individual processes of a parallel application need to read and/or write to the same file frequently and simultaneously, the Network File System (NFS) (and other distributed file systems) can quickly become saturated, limiting overall system throughput. To avoid this bottleneck, the I/O system must also be scalable — for example, if a file could be spread or “striped” across lots of file servers, I/O performance could be vastly improved.

That’s where the Parallel Virtual File System (PVFS, http://parlweb.parl.clemson.edu/pvfs/) comes in. Developed at Clemson University and designed specifically for Linux clusters, PVFS provides network access to a “virtual” file system distributed across different disks on multiple independent servers or nodes. By striping the data across separate nodes, buses, and disks, PVFS reduces I/O bottlenecks and increases overall I/O performance.

After installing and configuring PVFS using instructions from last month’s column, you should be ready to use your new virtual file system. There are three methods to access data via PVFS: normal Linux commands and system calls, the native PVFS library, and a parallel method that utilizes message-passing. While the first method is convenient and provides backwards compatibility, the latter two methods offer better performance because they bypass the kernel interface altogether.

Let’s see how to use each of the three methods, and investigate what kind of performance we can expect.

Accessing Files under PVFS

Listing One is a simple I/O read test program, written in C, that repeatedly seeks to a random location in a file and reads a large block of data. This kind of random file access should minimize the effectiveness of the cache (for sufficiently large files), which would otherwise tend to improve the performance of serial read operations. The code opens the file specified in PATH, seeks to a random location in the file, and reads BSIZE floating point values. The code performs the seek and read operations NTIMES times before closing the file and exiting. The code may be easily modified to change the file location, the file size, the buffer size, and the number of seek and read operations performed.




Listing One: read_test.c


#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>

#define PATH”/mnt/pvfs/obs.bin.fullUS”
#define FSIZE195042750
#define BSIZE2048000
#define NTIMES1000

int main(int argc, char **argv)
{
float data[BSIZE];
int fd, i;
off_t offset;

if ((fd = open(PATH, O_RDONLY)) == -1 )
perror(PATH);
srandom((unsigned int)FSIZE);

for (i = 0; i < NTIMES; i++) {
offset = (long)(((double)random() /
(double)RAND_MAX) *
(double)(FSIZE – BSIZE));
printf(“Seeking to position %ld\n”, offset);
if (lseek(fd, offset, SEEK_SET) == (off_t)-1)
perror(“lseek”);
printf(“Reading a block of data\n”);
if (read(fd, data, BSIZE * sizeof(float)) == -1)
perror(“read”);
printf(“First, middle, last points: %g, %g, %g\n”,
data[0], data[(int)(BSIZE*0.5)],
data[(BSIZE-1)]);
}

close(fd);
exit(0);
}

When this code is compiled and executed (see Figure One), it performs 1000 seek and read operations against the specified file on the PVFS partition in 7 minutes, 17 seconds. If you were to run this program on a file on a local disk or even an NFS server, it would run faster than under PVFS. Local disk access is, of course, very fast, and NFS performance is better than PVFS when only a single node is accessing a file (as is the case with this program).




Figure One: Benchmark using Linux system calls


[forrest@node02 pvfs]$ cc -O -o read_test read_test.c
[forrest@node02 pvfs]$ time ./read_test > output.A.log

real 7m17.061s
user 0m0.060s
sys 0m32.040s

The problem is that NFS does not scale well when many nodes attempt to perform simultaneous I/O operations. On the other hand, PVFS can provide scalable I/O for hundreds of nodes.

The PVFS Client Library

Listing Two shows a program similar to that in Listing One. The PVFS file, buffer size, and number of repetitions remains the same, but the code in Listing Two has been modified to use the PVFS native client library directly.




Listing Two: read_test_pvfs.c


#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <pvfs.h>

#define PATH”/mnt/pvfs/obs.bin.fullUS”
#define FSIZE195042750
#define BSIZE2048000
#define NTIMES1000

int main(int argc, char **argv)
{
float data[BSIZE];
int fd, i;
off_t offset;

if ((fd = pvfs_open(PATH, O_RDONLY, 0, NULL, NULL)) == -1 )
perror(PATH);
srandom((unsigned int)FSIZE);

for (i = 0; i < NTIMES; i++) {
offset = (long)(((double)random() / (double)RAND_MAX) *
(double)(FSIZE – BSIZE));
printf(“Seeking to position %ld\n”, offset);
if (pvfs_lseek(fd, offset, SEEK_SET) == (off_t)-1)
perror(“lseek”);
printf(“Reading a block of data\n”);
if (pvfs_read(fd, (void *)data,
(BSIZE * sizeof(float))) == -1)
perror(“read”);
printf(“First, middle, last points: %g, %g, %g\n”,
data[0], data[(int)(BSIZE*0.5)], data[(BSIZE-1)]);
}

pvfs_close(fd);
exit(0);
}

When compiled and executed, the same 1000 seek and read operations take only 5 minutes, 49 seconds because the kernel interface has been bypassed. Figure Two shows how the program was compiled and lists the results of the native library benchmark.




Figure Two: Benchmark using PVFS native library


[forrest@node02 pvfs]$ cc -O -o read_test_pvfs
read_test_pvfs.c -lpvfs
[forrest@node02 pvfs]$ time ./read_test_pvfs >
output.B.log

real 5m49.069s
user 0m12.050s
sys 1m30.690s

ROMIO: An Implementation of MPI-IO

Parallel applications typically use MPI or PVM for message passing between individual processes (see http://www.linux-mag.com/2002-04/extreme_01.html for an in-depth discussion about MPI and PVM). If you’re developing parallel programs with MPI, you can use a library called ROMIO (developed at Argonne National Laboratory, http://www.mcs.anl.gov/ romio/) to access PVFS files directly. (ROMIO is an implementation of the MPI-IO application programming interface. MPI-IO allows a programmer to use various high performance I/O subsystems without having to write specific code for each subsystem’s individual API. For more information about MPI-IO, see http://www.mpi-forum.org/docs/docs.html.)

Listing Three contains yet another version of the read test code that uses MPI-IO routines for reading data from the file. In this version, a different random number seed in each process ensures that no two processes seek to the same location at the same time. In addition, each process creates an output file that may be compared with the output generated by the serial programs described above when the same random number seed is used.

When compiled and executed on two nodes (see Figure Three), each process performs 1000 seek and read operations using the MPI_File_read_at() routine in 6 minutes, 43 seconds.




Figure Three: Benchmark using MPI-IO


[forrest@node02 pvfs]$ cc -O -o read_test_mpio
read_test_mpio.c -lmpich -lpvfs
[forrest@node02 pvfs]$ time mpirun -np 2 -nolo
cal -machinefile machines ./read_test_mpio
Task 0 started on node02
Task 1 started on node03

real 6m43.201s
user 0m0.060s
sys 0m0.160s

These simple read tests show that PVFS scales reasonably well even on just a couple of nodes. More detailed tests performed by the PVFS developers on a large cluster at the Argonne National Laboratory show that aggregate bandwidth will increase as the number of I/O servers is increased until the network is saturated. For their analysis, they varied the number of I/O nodes, the number of compute/client nodes, and the I/O buffer size. In addition, they tested both fast Ethernet and a high speed communications interface called Myrinet.

While PVFS is still under development, it appears to be stable and scalable. It is included in many of the clustering toolkits and suites available today. Further, it offers reasonable performance on Beowulf clusters and is well supported by the ROMIO implementation of MPI-IO.

If you’re building or maintaining a computational Linux cluster and want to enhance the performance of parallel applications doing significant I/O, PVFS outperforms NFS and is worth a look.



Forrest Hoffman is a computer modeling and simulation researcher at Oak Ridge National Laboratory. He can be reached at forrest@esd.ornl.gov.

Fatal error: Call to undefined function aa_author_bios() in /opt/apache/dms/b2b/linux-mag.com/site/www/htdocs/wp-content/themes/linuxmag/single.php on line 62