NFS (Network File System) is truly one of the enabling technologies behind the rise of HPC, and in particular, clusters. NFS is the first and to date, the only standard file system that is usable by distributed processing systems. It allows data to be shared across distributed nodes, across different operating systems, and across different architectures.
Without NFS we would be stuck with proprietary file systems, none of which work with the others. Moreover, you would likely only be able to get support and patches from that vendor. Effectively this locks you into a vendor. So if you want to change vendors, for whatever reason, you have to copy the data from the previous storage to the new storage and dump the old storage. This can be a very daunting and time consuming task. So, developers and users alike wanted a shareable file system across a network.
NFS – And Oldy but a Goody
In 1984 Sun Microsystems developed NFS (Network File System) that allows a system to access files over a network. The basic architecture is a client-server one where an NFS server (sometimes called a filer) exports some storage to various clients. The clients mount the storage as though it were a local file system and then read and write to the space (it is possible to mount the file system as read-only so the clients can only read the data).
The very first version of NFS was NFS Version 1 (NFSv1), was an experimental version only used internally at Sun. NFS Version 2 (NFSv2) was released in approximately 1989 and originally operated entirely over UDP. The designers chose NFS to be a stateless protocol with other features such as locking to be implemented outside of the protocol. In this context stateless means that each request is independent of all others and that the NFS server is not required to maintain any information about a client, a file, or request stream between requests. The reason the server was chosen to be stateless was to make the implementation much easier.
In the late 1980′s the University of Berkeley developed a freely distributable but compatible version of NFSv2. This lead to many OS’s becoming NFS compatible so different machines could share a common storage pool. There were some incremental developments of NFSv2 to improve performance. One of these was to add support for TCP to NFSv2.
The Version 3 specification (NFSv3) was released around 1995. It added some new features to keep up with modern hardware, but still did not address the stateless feature nor address some of the security issues that had developed over time.
However, NFSv3 added support for 64-bit file sizes and offsets to handle files larger than 4 gigabytes (GB), added support for asynchronous writes on the server to improve write performance, and added TCP as a transport layer. These are three major features that users of NFS had been asking to be added for some time. NFSv3 was quickly adopted and put into production and is probably the most popular version of NFS in use today.
Using NFS, regardless of the version, you can have various clients read and/or write to the file space at the same time. This capability is fundamental to clusters. In addition, you can have clients read or write to the same file at the same time, but you have to be very careful about how you do it (if the writes don’t overlap or share pages then it can be possible but you have to be very careful of caching).
The general form of NFS forces a particular file space to on a single server. A single server can export multiple file spaces, but a particular file space can only be on a single server. So it’s easy to see how you can overload an NFS server if too many clients access the file system at the same time.
However, believe it or not, NFS is the most popular file system for clusters at this time. There are many reasons why, including that’s it’s really easy to use, easy to configure, and is well understood. Plus you don’t have to modify your codes to use it. In addition, there is another reason, many codes work well with NFS.
NFS is Just Great for Many Codes
HPC codes have come a long way in a few years (relatively speaking). There are now codes that scale to ten of thousands of processors and there are more ISV (Independent Software Vendor) codes than ever before. It’s pretty easy to say that within the HPC community there has been a steady development of better codes, more scalable codes, and better designed codes. However, one aspect that has been ignored in a general sense, is the I/O part of the codes.
Distributed parallel codes have the problem of coordinating the I/O among the processes. This is especially true for applications that need to perform I/O to a single file. In this case, each of the N processes will be performing I/O to the same set of files, most likely at the same time. For this particular scenario you have to be careful that each process does not over write a region of the file(s) that another process is supposed to be using.
To get around this problem many code developers use just B of the processes to perform B I/O. In the case of reading data, this process reads the data and then sends the data to the other processes. In the case of writing data, all of the other processes send their data to the specific process which then does all of the writing. As you can imagine, such an arrangement can easily create bottlenecks and can also cause the single I/O process to use extra memory to hold the data for the other processes.
The most common model for I/O for parallel codes is to have all of the I/O performed by the rank 0 process. The various processes that make up an MPI run are numbered, starting with 0 and going to N-1 where N is the number of processes (can you tell a computer scientist did the numbering?).
The rank 0 process is the first process. So the rank 0 process reads all of the needed data and sends the appropriate data to the other processes. It also has all of the other processes send data to it and then writes the data to the file system.
Code developers could get away with doing I/O in this manner because of several additional reasons: (1) some codes don’t do a lot of I/O compared to their total run time so they don’t need much I/O performance, (2) some of the early parallel file systems were proprietary, so creating code to use them would have rendered the code non-portable (not a good idea considering how fast systems change in the HPC world).
There are many example of codes that don’t do much I/O compared to computation. One example I like to use is CFD codes (Computational Fluid Dynamics). When CFD codes are run for “steady” problems (i.e. those problems where the fluid flow is not dependent upon time), then the code only needs to read data the input data and write out the final solution (basically at the beginning and end of a run). The rest of the time, the code is computing.
The amount of time a code runs varies depending upon a number of factors: how many processors (cores) are being used, the size of the problem, the network, the details of the model and the code, etc. But in the examples I have run in the past a complete CFD solution (a good solution) used over 95% of the total runtime to compute the solution.
Assuming that the code uses over 95% of it’s time for computation, even if we were able to make the I/O portion of the run time infinitely fast, we would only improve the run time by 5%. Consequently, modifying the code so that each of the N processes does it’s own I/O to a parallel file system may not be worth the effort. In this case, NFS works just great. But there are cases when we need more performance from NFS.
Clusters are always getting bigger (I don’t think I’ve ever seen one that gets smaller). This means that larger problems can be tackled and more problems can be run at the same time. Consequently there may be a number of processes performing I/O at the same time and the amount of I/O is increasing. While NFS over Gigabit Ethernet may work just fine for a single job and smaller amounts of I/O, something faster may be needed for more than one simultaneous job.
There are several options for more I/O performance above what NFS provides. For example, you could throw out the current NFS hardware and bring in a larger parallel file system and hardware. This is a good overall solution, but it can get expensive and may not improve the overall performance of the runs as much as we would like. A simpler option is to just run NFS over a faster network.
Enter NFS-RDMA: Stage Right
Just in case you haven’t been paying attention, Infiniband is fast becoming a very popular interconnect for large, and even small, systems (see http://tinyurl.com/57qm93). At the same time, the bandwidth of Infiniband has been increasing rapidly.
For Single Data Rate (SDR) Infiniband the signal rate is 10 Gb/s with a data rate of about 8 Gb/s. Double Data Rate (DDR) Infiniband has a signal rate of 20 Gb/s with a data rate of about 16 Gb/s. By the end of 2008 you should see Quad Data Rate (QDR) Infiniband with a signal rate of 40 Gb/s and a data rate of about 32 GB/s. At the same time, the latency has been steadily dropping — getting closer to 1 microsecond.
But many codes don’t need all of the bandwidth that Infiniband provides. So you have lots of left over bandwidth that you aren’t using. What do you do with it? One answer has been to use the extra bandwidth for storage traffic. The combination of the extra bandwidth capacity of Infiniband coupled with the fact that many codes don’t need anything better than NFS, points to something called NFS/RDMA or, simply, NFS over Infiniband. This is also abbreviated as NFSoRDMA (kind of like IPoIB).
With the basic concept of running NFS over Infiniband a working group of the IETF (Internet Engineering Task Force) was created to define NFSoRDMA. To create NFS/RDMA, the group added a new transport protocol to RPC (Remote Procedure Call) to allow the call/reply to be transferred directly between the NFS server and the client memory.
Remember that NFS is the the only standard “parallel” or distributed file system. Notice that there have been no changes to the NFS layer. This means that it will work with NFSv2, NFSv3, and NFSv4, with some small exceptions for the specific protocol version.
In defining NFS/RDMA in this manner the benefits are:
Using RDMA can reduce the load on the CPU(s) in the NFS server
Reduces client overhead
Avoids data copy
The I/O is done in userspace (OS Bypass)
Increased throughput (ops/sec)
Mellanox and others have been testing NFSoRDMA for some time and are getting some great results. For example, even with a single DDR IB port, Mellanox is getting very good performance. In one test, they used a simple head node with 32, 15k rpm 36GB SAS drives configured as two RAID-0 volumes, one of which is a two-disk volume for the OS, and the remaining 30 drives are a RAID-0 volume for the NFS export.
Four clients were used: (2) dual-core, dual-socket 3.46 GHz Intel Xeons, (1) dual-core 32-bit, single socket 3.4GHz Intel Xeon processor, and (1) single core, dual-socket 1.8 GHZ AMD Opteron node. Everything was connected via DDR Infiniband. They used I/Ozone to test NFSoRDMA with all 4 clients accessing the file system. They also varied the file size to show the impact of caching.
When the data still fits in the NFS cache (basically the memory of the NFS server node), the read performance for 4 nodes was very close to wire speed – about 1.3 GB/s. Once the data exceeded the cache, the performance was about 200-250 MB/s. The single client performance in cache reached a maximum of a bit over 1.05 GB/s. Out of cache, the single client performance was about 250 MB/s.
The write performance for the 4 clients reached almost 600 MB/s while in cache and about 250 Mb/s for 4 nodes once the file size exceeded the memory of the NFS server. The single client write performance in cache was reached at about 250 MB/s. Once out of cache, the single client write performance was about 80 MB/s.
Test Drive It
A number of people have been working on getting NFS/RDMA into the Linux kernel. The 2.6.24 kernel had the client patches to run RPC traffic over RDMA. The patches for the server side were included in the pre-releases of the 2.6.25 kernel.
Don’t be afraid of building a new kernel. There are lots of articles around the Web about how to built a kernel and how to use grub to boot with that kernel. You can even install multiple kernels and just select which one you want at boot time (I always liked that feature). So grab the latest kernel, buy some inexpensive SDR Infiniband hardware and connect your NFS node to the IB fabric. Then it’s fairly easy to export data over IB and mount file systems over IB. My point is that it’s fairly simple to build your own NFS/RDMA setup and many codes don’t need lots of I/O performance and work well with just NFS particularly NFS/RDMA.
Fatal error: Call to undefined function aa_author_bios() in /opt/apache/dms/b2b/linux-mag.com/site/www/htdocs/wp-content/themes/linuxmag/single.php on line 62