Jeff Layton talks with Theodore Ts'o about getting the best performance out of your file system, painless migration and the work still to do.
JL What options do you recommend to get the best performance from ext3 and ext4? (understanding that there are some articles around the web that discuss performance tuning).
TT The big one that I always recommend is the noatime mount option. It disables POSIX-required functionality, but it makes a huge difference on many workloads, especially desktop workloads.
If you compare the number of megabytes written for the “make” and “make clean” steps with and without noatime, you’ll find that “make” requires 10% less disk writes (with all of the attendent seeks to the inode table) and “make clean” requires 50% less disk writes.
Other than that, it really depends on the workload. We try to make the defaults work well for most users. Speaking generally, ext4 will perform better if you create a fresh ext4 file system image compared to converting an existing ext3 file system. (On the other hand, if you have a very large pre-existing ext3 file system, you may not have the space or can afford the downtime to do a backup/reformat/restore operation.)
If you don’t need the reliability guarantees of what happens on a crash, you can run without a journal, and disable barriers (ext4 enables barriers by default, for safety; for historical reasons, ext3 does not enable barriers by default) via the mount option “barriers=0″. If you don’t need the security guarantees of what happens after a crash that are provided by “data=ordered”, try using the “data=writeback” mount option. “data=ordered” prevents files which were created right before a crash, from containing blocks that contain uninitialized data, which might reveal private information from another’s mail or p0rn directory, for example. This is much more important on timesharing systems than it is on single-user systems. “data=ordered” also has some implied data safety issues for badly written application which don’t bother to call fsync() that has been the subject of recent controversy
Most of the ways to get best performance out of the file system isn’t in the tuning, but rather in making minor changes to your application programs.
If you are using ext4, and you are writing a large file, particularly in a random order (for example, as a bittorrent client might do, or an HPC program which is filling in a results file in random order), preallocate the output file to expected final size, using fallocate() or posix_fallocate(). Using fallocate() is also good idea if the file will take a long time to write out — for example, if you are writing out a large video file in real-time, as you might in a DVR, and you know that you are recording a one hour show at a compression/quality rate that requires 1GB/hour, then fallocate()ing the 1GB in advance will allow the file to be allocated contiguously on disk.
[ n.b. The fallocate() system call is not in most glibc's as of this writing, but posix_fallocate() is; the problem with posix_fallocate is that if you use it on ext3, it will attempt to emulate fallocate() by writing all zeros to the file. This emulation step can be very slow, and may come as a surprise to the application that was expecting posix_fallocate() to be quick; the fallocate() system call has the advantage that if it is not present, it will fail, and the application can then decide on its own what it wants to do. ]
For both ext3 and ext4, if you are using readdir() and then accessing all of the files in a directory, is a very good idea to sort the files returned by readdir() in inode order. For why, see here and here.
(Ext4 has a inode table readahead performance algorithm that helps avoid this problem somewhat, but it’s still a good idea to do sort-after-readdir.)
For both ext3 and ext4, try to avoid small writes; large writes which are block aligned will always be faster. If the application must do many small writes, it may be worthwhile to use mmap(); however, if the application is only going to be making a single sequential read or write pass over the file, mmap() is unlikely to be helpful.
Hope this helps!
Jeff Layton is an Enterprise Technologist for HPC at Dell. He can be found lounging around at a nearby Frys enjoying the coffee and waiting for sales (but never during working hours).