The Network File System is still commonly used to connect desktops to file shares and to interconnect compute nodes in high-performance clusters. By tuning NFS — using just a handful of parameters and measurements — you can help clients and servers run at peak performance. Heres how.
A little bit of testing and tuning can yield big results
The Network File System is still commonly used to connect desktops to file shares and to interconnect compute nodes in high-performance clusters. By tuning NFS — using just a handful of parameters and measurements — you can help clients and servers run at peak performance. Here’s how.
Mark F. Komarinski
Much like gourmet cooking, performance tuning is a little bit of science and a little bit of art.
When cooking, you have to take account of the quality of your ingredients, the thermal characteristics of your oven or stovetop, and even things like the weather. Overcook and you get charcoal; undercook and nothing tastes quite right.
Similarly, when tuning performance, you have to account for the quality of the components, the characteristics of your NICs, switches, and disk drives, and even your TCP/IP implementation. Don’t tune enough and your hardware can’t realize its full potential; tune too much and you may realize good performance in some situations but suffer terrible performance for others.
In cooking and in tuning, somewhere in between is what you’re looking for.
If you run a cluster, chances are that your users are a demanding bunch. And more often than not, many of those demands are contradictory: the life sciences team wants to use BLAST (http://www.ncbi.nlm.nih.gov/BLAST/), whose databases often exceed one gigabyte. (You can think of BLAST as grep on steroids: given a genetic sequence, BLAST compares the sequence against a known database, perhaps humans or E. coli); Computing Services has a web-based front end that uses thousands of small files. And other researchers want summarized results from life sciences, a mix of file sizes that depend on the results from BLAST.
So what’s a cluster administrator to do? Compromise. Provide the best performance for the greatest number of applications.
Many clusters use the Network File System (NFS) to share files between nodes. (Moreover, NFS is also widely used to share files between servers and desktops and to provide filesystems to otherwise diskless workstations.) Indeed, NFS is a good general solution because it’s easy to configure, is built-in to Unix, Linux, and Mac OS X systems, and is available from third-parties for Windows. But NFS can be a blessing or a curse: configured improperly, it can reduce performance to 1% of total network capacity([ as benchmarked on the author’s cluster]). Configured optimally — by ferreting out hardware and software problems — it can run at close to 100% of network speed.
Let’s look at how to tweak your Network File System configuration and measure the effects.
NFS Server Considerations
Because NFS mounts one machine’s filesystems to another machine, the performance of NFS can be affected by all “points” — software and hardware — that connect the two endpoints. Here are some considerations:
Track Your Tweaks
When tuning (any system), record each configuration that you try and its results. If much of the tuning can be captured numerically, you can visualize the results in something as simple as a spreadsheet (as shown in Figure One).
Once you’re happy with a configuration, take a number of measurements to establish a baseline before your NFS server goes into production. Great variances from your previously observed “expected” performance may very well point to problems.
*Ethernet speed.Gigabit Ethernet is the fastest TCP/IP connection most people are going to see. Ten Gigabit Ethernet is on its way, but is still rather expensive.
*Ethernet chipsets. Gigabit Ethernet cards do some TCP/IP offloading, but don’t think you’ll get a speed boost out of a full TCP/IP offloading card. Since most NFS connections run over UDP, there isn’t a whole lot to a UDP packet and offloading UDP packet building doesn’t get you much.
*Ethernet tuning. Use an application like the Network Protocol Independent Performance Evaluator (NetPIPE, http://http://www.scl.ameslab.gov/netpipe/) to measure network speeds before doing any tuning. If you can talk between a client and server at 500 Mbps, but writing to an NFS export at 10Mbps, you can be pretty sure your network is not the bottleneck.
*Kernel tuning. Your Linux kernel should be tuned for best performance for your network. (See the sidebar “Gigabit Speeds” for more information.) In short, a stock 2.4 kernel is tuned for a 100 megabit network.[ Ed.: Need some motivation here about what to do to the kernel for higher-speed networks.]
On a standard 2.4 kernel and using NetPIPE to measure network speeds between two nodes, you may find throughput on Gigabit Ethernet to be closer to 300-500 Mbps, less than half the capacity.
You can boost your throughput by making the following additions to /etc/sysctl.conf and running sysctl –p, you may be able to get close to 925 Mbps in the right conditions and were usually above 500 Mbps.
(These same options are also available in the 2.6 kernel, but settings and the effects of those settings had not been tested by press time, so recommendations for 2.6 cannot be made.)
In addition to settable kernel parameters, the drivers for the network interface cards that you run may have additional options for performance tuning. For example, Intel provides additional tuning instructions for the drivers for their e1000 controllers. Check with the documentation of your Ethernet driver for tuning tips and options specific to Gigabit.
These options are also available for the 2.6 kernel, but as yet untested by the author.
*Switching hardware. Not all switching gear is alike: there are reasons why companies like Cisco gets a lot more than $100 for a switch (though all you’re really looking for here is throughput). Switches that support Ethernet bonding can also be a benefit as you can bond multiple connections to one NFS server, allowing the server to appear to have a multi-gigabit connection to the switch and NFS clients. Managed switches also allow you to see utilization in real time, so you can monitor network utilization as your tests run.
*RAID level for disk storage. Using something like RAID 5 provides good read performance and provides a high amount of usable storage. Using RAID 10 (or 1+ 0, which is a RAID 1 array on top of a RAID 0 array) means you lose 50% of your disk space to the array, but write performance is far better than RAID 5. If you’re looking to maximize your write performance, use RAID 10. Otherwise, use RAID 5 and stretch your budget and disk size.
*SCSI versus ATA. ATA drives are very inexpensive when compared to SCSI, but storage vendors usually recommend ATA for archival or near-line storage. Using ATA as your primary storage can cause performance problems over the long term, especially when there’s a large number of I/O operations per second. (ATA drives run at 5,400 or 7,200 RPM, while SCSI drives are available at 10,000 and 15,000 RPM.) In addition, the SCSI protocol is designed more for heavy use than that of ATA. Recent advances in Serial ATA (SATA) protocols are narrowing this stability and performance difference, so check with your storage vendor to see what they recommend.
If you’re building a cluster from scratch, a little research and proper planning ahead of time can avoid a number of potential bottlenecks.
NFS Client Options
With the underlying technology and a suitable kernel in place, you can turn your attention to some of the more common NFS options you’ll want to tweak to see how each affects performance. In general, each of the options will affect all of your clients uniformly, but you can run simple tests on different NFS clients to make sure that’s the case.
One of the lesser-known features of most Unix filesystems is the access time or atime, which records the last time a file was accessed for read or write. (Run ls –lu to see the last access time.) atime isn’t really used all that much, but it translates to extra overhead nonetheless. (The NFS server tells the filesystem to tell the disk to update the file to the last time the file was accessed.) If you’re using Linux as an NFS server, you may want to consider using the NFS option noatime to disable atime updates. (On the other hand, NFS appliance servers may be optimized or not even support this feature. However, it’s worthwhile to test with noatime anyway.)
On most Linux filesystems, the default block size is 4 K, and it’s usually best to keep block sizes in 4 K increments. On other operating systems, Solaris, for example, the default block size is 8 K. However, some NFS appliances can benefit from a 32 K read and write block size. You can use the rsize and wsize to tune read and write block size, respectively. Some testing[ performed by the author] indicated that read/write block size and disk block size should be the same. Try different read and write sizes and measure the performance.
When running in sync mode (the default), the NFS server makes sure that each block is written to disk before telling the client that the block’s been written. This can result in a performance hit, especially if the disk is experiencing a high amount of I/O. The performance of async can be better — but there a significant caveat: async is not recommended as there’s no guarantee that data is actually written to the NFS server, potentially resulting result in data loss.
(NFS appliance servers and some RAID controllers get around the sync latency by using a battery-backed cache. Should the appliance lose power, the cache either flushes to disk or is able to flush to disk once power is restored.)
The default protocol for NFS is UDP. UDP is connectionless and has less protocol overhead. Using UDP and should the NFS server be rebooted or become disconnected, clients cannot read or write until the server is available again, but otherwise pick up where things left off when the server returns (the connection appears to have never been severed). Any remaining read or write operations resume once the connection is restored and no data loss occurs if the filesystem is mounted in sync mode (the default). If your network is running at close to its maximum throughput, UDP doesn’t guarantee your packet gets there, so there may be more retransmits of data as UDP packets get lost. Using TCP makes for more efficient use in high throughput networks, but may impact total performance.
Now that you know what each setting affects, you can make suitable changes.
The next step is to measure the effectiveness of each change or a suite of changes. Before you start your testing, you’ll want to have some method of monitoring your network throughput and some way to measure the CPU of the client and of the server. Watching this real-time during the test can give you useful information about where the bottlenecks are.
Gauging NFS performance is more than just doing something simple like timing dd if=/dev/zero of=somefile bs=1024 count=1024k, since you need to check across a variety of file block sizes. What gives good performance for large files? What’s best for small files? What gives best performance for all file sizes?
There are two applications you can use to test NFS performance and each provides its own useful information. You can use either or both as you choose.
My bonnie++ Lies Over the Ocean
The first application to try is bonnie++. It’s standard in some distributions or can be easily compiled from source available at http://www.coker.com.au/bonnie++ /. The latest version is 1.03, and the author Russell Coker, is now working on 2.0. (The original version of bonnie++ was named bonnie and was written by Tim Bray.)
bonnie++ is a short test, running about 30 minutes on a single client. The information it yields can be used to compare differences between different mount options, but its real value is in deriving CPU usage during the various tests it performs. A test that returns a high CPU utilization indicates that your bottleneck is the CPU itself and that a faster processor will improve performance. A low CPU utilization indicates that the problem is elsewhere.
There are two parts to the bonnie++ “benchmark.” The first part works with large files and the second part tests with small files. As the bonnie++man page states, the first part simulates what would happen to a filesystem when being used as a database server, while the second part is good for things like a mail cache or for a cluster node hosting a large number of small output files that need processing.
Table One presents a list of options for bonnie++.
Table One: A list of options for NFS test bonnie++
Directory to create files in
Number of files (times 1,024) to create in the second part of the test. Default is 30.
The name of the machine, used when outputting results
Size of RAM (in MB). The tests you run should be at least twice the size of physical RAM so that you can be sure that the filesystem buffers don’t skew your results.
The user ID (UID) to use when running tests. bonnie++ doesn’t like running as root, so specify some non-root user that has permissions to access the directory specified earlier wth –d.
A standard run on a 4 GB machine is something like this:
Each time you run bonnie++, change the –m option to reflect what options you’ve enabled or changed between each test. For example, a subsequent run that enables noatime would use –m myhost-noatime and another test with 32 K block sizes and noatime is –m myhost-noatime-32k. The naming convention is up to you.
The results of bonnie++ include a human-readable summary that lists the statistic and the results and a comma-delimited line that can easily be imported into a spreadsheet or graphing application.
Once all of your bonnie++ tests are complete, you can start analyzing the results. What you’re looking for is the highest numbers regardless of CPU utilization. In the case of two numbers being the same, take the lower CPU utilization as the better of the two. Now compare this to what you’ll be using the fileserver for. If you have larger files, put more importance on the results of the first half of the test. For lots of smaller files, pay attention to the second half of the tests.
In the I/O Zone
Another useful NFS test is iozone (http://www.iozone.org/), which has a number of features not found in bonnie++. It’s a much more comprehensive test, allowing you to test different block size writes and different size files. In addition, iozone performs more standard types of tests done on each file. It even exports to Excel- compatible files that can be read by OpenOffice.org and can be used to generate graphs.
iozone is a bit more complicated to run, can take longer to run than bonnie++, and produces a lot of results to scan through. (You can use bonnie for doing quick and dirty testing and for beating up a server, then use iozone for more extensive testing.) Table Two has a list of common options for iozone.
Table Two: Some common options for iozone
Run an automatic test.
Output in space and tab-delimited format.
Produce output in Microsoft Excel.xls format (binary)
Maximum file size. Should be two times the size of RAM.
Unmount and mount between tests, which ensuresthat buffers are cleared. Make sure that mount is listed in /etc/fstab before you start the test. Use the –f filename option to list where you want to run the test (assumes the current working directory), as you can’t umount the current working directory while the shell is still using it.
List the types of tests to run. You can use multiple –i options to specify multiple tests to run. (You can get a full list of available tests in the iozone manual.)
Minimum file size (in KB) for running in auto mode.
Maximum file size (in KB) for running in auto mode. You can also specify –q n m to use megabytes and –q n g to use gigabytes.
The stdout of iozone emits some useful information, but to really visualize the results, take advantage of the Excel spreadsheet function of iozone. The resulting spreadsheet loads nicely in OpenOffice.org’s oocalc.
FIGURE ONE: Sample results derived by tuning various parameters in NFS
The first cell (“A1”) contains the command line used to generate the spreadsheet, and the following cells contain the raw results. (If you don’t know how to use oocalc, you can easily generate a graph by selecting the results including the record and file sizes, as shown in Figure Two. With these cells selected, click on Insert then Graph. The cells you selected should be entered as the range. Click “First row as label” and “First column as label” and select a sheet where you want the graph to go. The last part of creating the graph is to choose legends and axes: the X axis is the record size, the Y axis is speed in KB per second, and the Z axis is file size. You can see the results in Figure One.
Figure Two: Screenshot of oocalc selecting cells
Now that you’ve run the tests from one client, you can test from multiple clients simultaneously and check the performance. The tests may show a drop in performance versus one client, but the results of many clients running simultaneously are closer to what will happen once your fileserver goes into production.
iozone and bonnie++ are not the only applications you can use — the important thing is that you have some sort of methodology to test performance under your typical conditions and have a way of finding bottlenecks.
And once you know where your bottlenecks are, you can take steps to speed traffic through.
For example, faster RPM disks, faster RAID or SCSI controllers, better networking gear, and more RAM can all improve performance. Knowing what will improve your performance will make your upgrade dollars go farther.
Mark Komarinski is a Senior Research Systems Architect at Harvard Medical School. He can be reached at firstname.lastname@example.org.