In 2.6.24, a new parallel distributed file system, named ceph, was added to the kernel. Ceph is object-based, has distributed metadata, and decouples metadata and data, and promises great performance as well as great scalability.
In the 2.6.37 kernel, a new block device, called the Rados block device (RBD) was introduced by the ceph developers. RBD allows you to create a block device that is backed by object storage in ceph. Each device consists of a single metadata object while data is then striped over many objects. With careful selection and layout this means that you can lose a storage node without losing the file system when you use RBD’s (i.e. it stays functioning). Other network based block devices such as iSCSI and AoE don’t allow this. If you lose a node using these protocols, then you lose the file system. Keep an eye on RBD’s.
File System Potpourri
As with all Linux kernels, there are updates to other storage aspects in addition to file systems. In this section developments for other file systems and other aspects of storage are summarized starting with everyone’s favorite standard shared file system, NFS.
NFS NFS has several updates in the 2.6.37 kernel. The first one is a patch was added so that NFSv4 can mimic the readdir plus operation that was in NFSv3. Actually readdir plus is a significant improvement in “stating” files in a file system. In the commit for this patch a simple example was given of doing an “ls -lU –color=none” command on a system. Without the readdir plus patch there were 16x more rpc calls and the system was 4-5 times slower on large directories. So the readdir plus patch is a big performance winner on stat operations for NFSv4.
There was also a patch in 2.6.37 that introduced a new idmapper system that allows more than one request to be run at the same time (the old idmapper restricted things to a single request at a time). Again this improves performance and scalability.
A third patch that was added to NFS in 2.6.37 introduced a new mount option, “-olocal_lock”. A problem was reported that some Windows clients accessing a Samba share were having lock problems. This was due to some Windows applications using two types of locks (who knows what crazy things those silly Windows applications actually do). As a result, this option was added so that these applications didn’t get confused when accessing a Samba share.
OCFS2 OCFS2 is a clustered file system developed by Oracle. The original file system, OCFS, was a shared file system for use by databases. With the addition of POSIX compliance it became OCFS2 which is an unappreciated shared (clustered) file system. In 2.6.37 a number of new features was added to OCFS2.
The first patch is actually fairly significant. OCFS2 always had the capability of addressing larger than 16 TiB file systems but there were a few remaining sanity checks that needed to be complete for this to work. This patch in 2.6.37 now allows OCFS2 to mount file systems larger than 16 TiB (i.e the sanity checks were added).
A second patch created a new mount option “coherency=*” that handles clustered coherency for O_DIRECT writes. There are two options for the coherency mount option and choosing one of them depends upon your specific use case. Please see the link for more details.
CFQ Changes Recall the the kernel has several I/O schedulers, of which the most popular is called CFQ (Completely Fair Queuing). CFQ is the typical default I/O scheduler in almost all Linux distributions.
In the 2.6.37 kernel, a patch was added that improved fsync performance for small files. This may seem somewhat trivial but this can affect journal commits for file systems, so it’s actually a fairly significant patch.
In case you didn’t know, the networking stack within Linux has the ability to throttle networking. This is typically referred to as traffic shaping with the general idea of delaying packets before transmitting them so that a desired output rate (bandwidth) is met.
For some time people have wanted for the same concept but applied to I/O. Specifically, they want the ability to throttle or control the amount of I/O through the kernel. In the 2.6.37 kernel, the ability was added. This new I/O throttling capability allows you to set upper and read and/or write limits to a group of processes.
The kernel newbies link has a simple example from the patch documentation that I’ve reproduced here.
Step 1: Mount the cgroup blkio controller:
# mount -t cgroup -o blkio none /cgroup/blkio
Specify a bandwidth rate on particular device for root group. The format for policy is “:”
The result is that the commands will impose a limit of 1 MB/s on reads happening for the root group on devices having a major/minor number 8:16.
Rather than specify the data throughput rate in MB/s, the rate can also be set using IO operations per second (limits IOPS performance using blkio.throttle.read_iops_device). You can also do the same thing for writes (blkio.throttle.write_bps_device and blkio.throttle.write_iops_device). For more information you can read the document in the git commit. I expect to see more articles on this capability in the future since this has great potential.
The 2.6.37 kernel had several nice surprises in it. Some file systems got an upgrade with an eye toward performance, scalability, and larger capacities. there were also some changes to the CFQ I/O scheduler. But potentially, the biggest change is the inclusion of I/O Throttling.
I/O Throttling is a very powerful mechanism for controlling I/O resources within a system. For example, you could limit the amount of I/O resources (read and/or write) that a given user can access. You could also impose the same limitation on a device for all users.
These types of limitations are also important even for storage servers. Imagine being able to limit the amount of I/O from an NFS or CIFS gateway so that the server is still responsive. This gives administrators time to log into the server and determine the cause of a high load on the server and correct the problem. Without I/O throttling the server could potentially just lock without giving any indication of the problem forcing the admin to play detective when the system returns to normal. Like I said – the potential is enormous.