If you blinked you might have missed the announcement of the new 2.6.34 kernel. Things have been happening very quickly around file systems and storage in the recent kernels so it's probably a good idea to review the kernels from 2.6.30 to 2.6.34 and see what developments have transpired.
The 2.6.30 kernel was an amazing kernel for the Linux storage world but the 2.6.31 kernel took a break from the pace but still had some interesting storage developments. One development that is noteworthy is that it included preliminary support for the NFS v4.1 client. Recall that NFS has two parts – a server and a client. 2.6.31 included some preliminary support for a NFS v4.1 client in the kernel. The NFS v4.1 client is still under heavy development but since it is in the kernel it can get more testing than before.
Of course there were also improvements in many file systems in general. There was also the inclusion of osdblk, a new block capability that allows you export a single SCSI OSD object as a Linux block device. Again, we’re starting to see object-based storage device support develop in the kernel.
The discussion of storage changes to the 2.6.31 kernel may be short but the changes made were important. However, the excitement of the 2.6.30 reappeared in the 2.6.32 kernel.
2.6.32 – Excitement Returns
While the 2.6.31 made some steady progress on storage aspects of the kernel, the 2.6.32 kernel added some new features important for storage.
Probably the biggest feature was a performance enhancement that improved IO performance by an appreciably amount. Jens Axboe, a well known kernel developer, added some patches called “Per-backing-device based writeback”. While this may sound like a mouth full what it does is to improve the performance when large chunks of data need to be written to disk, particularly when there are multiple storage devices.
As an example, the performance of XFS increased by about 40% when a streaming write to a 32GB file to five SATA drives was pushed into a LVM stripe set. Btrfs performance for the same scenario increased by about 26% and the performance on SSD drives also increased. (yeah – performance!)
There were also some significant updates to btrfs. Two noteworthy updates are: (1) changes to the snapshot capability so that it is much faster now and you can rename snapshots and subvolumes, and (2) performance improvements that reduce the CPU usage for streaming writes. At the same time btrfs also added the ability to use the SSD TRIM command even though support for TRIM wasn’t yet in the kernel.
Perhaps the most important changes in the 2.6.32 kernel was the change to the CFQ IO Scheduler. Recall that IO schedulers are a very important part of the kernel, defining how and when data is actually sent to the storage devices. In the 2.6.32 kernel the behavior of the CFQ (Completely Fair Queuing) was changed. Recall that CFQ is the default IO scheduler in most distributions and kernels. From the kernel newbies summary on 2.6.32,
In this release, the CFQ IO scheduler (the one used by default) gets a new feature that greatly helps to reduce the impact that a writer can have on the system interactiveness. The end result is that the desktop experience should be less impacted by background IO activity, but it can cause noticeable performance issues, so people who only cares about throughput (ie, servers) can try to turn it off echoing 0 to /sys/class/block//queue/iosched/low_latency. It’s worth mentioning that the “low_latency” setting defaults to on.
The important take-away from this quote is that there have been some changes to the CFQ IO scheduler that favor desktops and laptops and these changes are “on” by default. However, this can have an impact on IO throughput performance for servers. The author of CFQ, Jens Axboe, has a good discussion of the changes in an article,/a> on LWN. This change could have a significant impact on performance for some workloads. Be sure to test both with and without this new “low latency” option to understand the impact on your workloads.
Another significant update in the 2.6.32 kernel are some changes to the block layer using some “NAPI” like features. NAPI, or “New API”, has been used in the networking layer for some time. The basic concepts from NAPI were applied to the block layer by Jens Axboe (I swear he must not sleep). The big result from this patch is that the performance of the block layer is improved for some workloads. An article on LWN explains the patches in more detail.
A couple of other changes to the kernel that happened in 2.6.32 are around the md/dm devices. These are the devices within the kernel that deal with multiple devices for RAID or related behavior. There are really two significant developments:
The md/dm devices have the capability of distributing raid processing over multiple cores (very cool since it’s hard to find single core devices except for Atom processors or something similar).
Asynchronous raid6 operations were added.
Both developments are welcome to the kernel and appreciated by those who need them.
Finally, for those into off-the-beaten-path capabilities, the 2.6.32 kernel added fscache support for Plan9.
2.6.33 – Small But Important Changes
The 2.6.33 kernel added some under appreciated features and also removed one capability. This kernel doesn’t really “feature” big changes in storage but there are some significant features that may have gone unnoticed.
The “feature” or capability that was removed from the kernel was the anticipatory IO scheduler. The anticipatory scheduler does what its name describes – it anticipates subsequent block requests and caches them for use. Putting on your storage expert hat one can see that the anticipatory scheduler works really well for certain workloads. For example it has been observed that the Apache web server may achieve up to 71% more throughput using the anticipatory IO scheduler. On the other hand it has been observed that the anticipatory scheduler has caused up to a 15% slowdown on a database run. So the performance of the anticipatory IO scheduler is something of a mixed bag.
By version 2.6.33 of the kernel, the kernel developers felt that the CFQ scheduler (Completely Fair Queuing) had reached a point where it was nearly as good or better than the anticipatory IO scheduler on the workloads where the anticipatory IO scheduler worked well. So they removed it. Goodbye anticipatory IO scheduler – we’ll miss you.
However, a potential performance enhancing change was added to the 2.6.33 kernel. A Block IO Controller (BIC) was added to the kernel that allows you to create proportional-weight time divisions of disk policy using the CFQ IO scheduler. While a mouth full, it allows you to have better control over how much IO bandwidth (performance) various applications within the CFQ Scheduler can have. Getting practical use of this feature within the kernel will take some experimentation. A short but reasonable introduction to it is here and the commit for the Patch is listed here.
There was also a nice surprise in the 2.6.33 kernel in that DRBD (Distributed Replicated Block Device) was included in the kernel. This kernel capability allows two separate nodes to duplicate block storage over a network. So basically you have a “primary” node with a block device that is mirrored over a network to a block device on a “secondary” node. If you like, you can consider DRBD to be RAID-1 over the network. The intent of DRBD is to help HA (High Availability) clusters by synchronously mirroring a block device from one node to another. LWN has a nice article that discusses the details of DRBD.
For many years the Linux kernel has had something that the developers refer to as the “Big Kernel Lock” (BKL). It was put into Linux to make multiprocessor systems possible. However it has proven to be a big stumbling block to the evolution of the kernel, particularly since even current desktop systems have so many cores. So there has been an effort for some time to remove the BKL from the kernel by replacing it with much finer grain locks allowing the kernel to scale better.
In 2.6.33, the reiserfs team rolled in some patches to remove the BKL from reiserfs. While this may not sound like a big deal, it actually was a remarkable update to a file system that is still in use in a number of places. This update doesn’t have any impact on performance or add any increased resiliency, but it is very important in the development of the kernel.
Perhaps one of the most important inclusions in the 2.6.33 kernel is that all of the md devices (personalities) support barrier requests. This is important because the kernel now has end-to-end support for barriers (from the file systems to LVM to md). This is a big deal and was highly anticipated by many people (myself included).
As usual there are always updates to file systems and the 2.6.33 kernel was no exception. Many times they are slight performance improvements for particular workloads or perhaps corner cases that need to be fixed and 2.6.33 was no exception. Btrfs was updated with a few changes including one that made the metadata chunks smaller. This change allows you to use more of the hard drive space for storing data. (yeah – more stuff!)
Another file system update was to exofs. If you recall, this is an object based file system that recently made it into the kernel. In the 2.6.33 kernel it gained the capability of multi-device mirror support. So now you can mirror a particular device with another from within exofs. Keep your eye on exofs – it is progressing quite rapidly.
And finally, one big update that many people missed, is that the 2.6.33 kernel added support for the SCSI WRITE SAME command. You may not be excited by this command until you realize that the alias for this command is TRIM (now you can get excited). The TRIM command is a great way for SSD drives to maintain good performance by wiping pages clean when they are deleted, prior to new data being written to them. The TRIM command works by forcing an actual erase of the unused pages during the data delete step where performance may not be as important as during a write step. In other words, when a page or more is deleted by the application, it is erased immediately. I think we can all cheer for TRIM being added to the kernel because it increases write performance on SSD’s. (yeah – performance!)
2.6.34 – More File Systems
The 2.6.34 kernel returned to the days of 2.6.30 and added two new file systems to the fun. One of them, Ceph, was recently covered here. Ceph is a distributed parallel object-based file system that also uses replication to improve resiliency. Another key aspect of Ceph is that the metadata is distributed. That is, the metadata is spread throughout the file system so that the loss of a single node will not result in the loss of the file system.
One of the core assumptions in Ceph is that file systems are constantly changing. Additional storage is being added, nodes may die or go off-line, or you may want to take part of a file system off-line for maintenance. Ceph assumes that this happens all of the time so it was designed for distributed metadata and data while also being able to adapt to storage devices being added or removed.
In the 2.6.34 kernel the client portion of Ceph was added. It is still marked as experimental partially because it is still being developed and partially because it depends upon btrfs, which has not been marked stable yet. However, there is a great deal of work going on with Ceph, so keep an eye on it.
The second file system that was added to the 2.6.34 kernel is called LogFS. It is a log based file system like NILFS2 and is designed for flash devices. It is primarily focused on the embedded world but it can be used in other places.
As with every kernel there are updates to just about everything including file systems. In the 2.6.34 kernel there were quite a number of updates to btrfs that bear mentioning. The major changes that users will notice are:
Btrfs has the ability to change which subvolume or snapshot is mounted by default. One use case for this particular update is to to enable snapshot assisted distribution updates. The idea is that you can take a snapshot of your distribution, update it to something else (perhaps a beta version), but then can revert back to the default root to the old tree if you want to go back to the previous version (for example if the beta version had problems).
The defragmentation code has added the ability to compress a single file on demand and/or defragment only a range of bytes in the file. This gives you finer grain control over the file system.
There are other updates to btrfs including some updates to the user-space tools and some updates to btrfs itself to get ready for future updates.
Once again there are updates to various file systems with a notable one being an update to exofs. For the 2.6.34 kernel, the ability for exofs to do RAID-0 was added (yeah – performance!). Exofs also added the ability for group support.
Another notable update to a file system is to squashfs. Remember that squashfs can take a tree from a file system and create a compressed image to save space. This is great for data that isn’t changing (you would be surprised how much data is “cold” in user accounts). This update to squashfs adds lzma and lzo capability. This means that you can get even more data compression from squashfs than before (yeah – more capacity!).
From the length of this article you can tell that the kernel has been very busy over the last year or so. There are new file systems, new features, better performance, more capacity, and more options than we’ve ever had before. With these changes comes the responsibility of understanding the new kernels and how it impacts your systems and your workloads.
It’s a great time for storage in the Linux kernel.