Linux 2.6.37: Scalability Improvements Abound

While 2.6.37 might be considered a quiet release, there are some very nice scalability improvements for file systems and one cool new feature that warrant a review.

This year’s holiday kernel was 2.6.37, which was actually released on 4 January 2011 (perhaps it’s a New Year’s kernel) and is a good example of a kernel release during the holidays. At first glance, one would think that it was a quiet kernel with no flaming articles on the web or some seriously flawed benchmarks being posted, but you didn’t see too much of that. However, there are some great things that happened in 2.6.37 around file systems and one really cool feature that I’ll talk about at the end of the article.

Improvements for ext4

Ext4 is the proverbial little engine that could. The file system has proven to have remarkably great performance and it solves many (most) of the issues with ext3. However, it is still really limited to 16 TB because the user space tools have not been updated yet (good project if anyone is interested). In 2.6.37, several really cool features were added to ext4, primarily around scalability.

Systems are getting more cores faster than we may realize. A four-socket AMD system that has 12-cores per socket resulting in a total of 48 cores in a single system, is a fairly affordable server. In the 2.6.37 kernel, scalability improvement patches were added to ext4. In particular, ext4 will now use the “bio” layer directly rather than use the intermediate “buffer” layer. The basic reason is that buffer layer has a number of performance and SMP scalability issues. The bio layer (bio = Block I/O) is the part of the kernel that sends the requests to the I/O scheduler allowing performance and scalability to improve.

An example of the scalability improvement was that a ffsb benchmark on a 48-core AMD system using a 24-disk hardware RAID array with 192 simultaneous ffsb threads improved performance by 300% (400% if journaling was disabled) compared to performance before this patch was applied. Moreover, CPU usage was reduced by a factor of 3-4 in the benchmark.

In addition to the scalability patches for ext4, 2.6.37 added a couple of cool new features. The first one is that mke2fs, the command to create an ext based file system, now has the ability to leave the inode table uninitialized. This means that the creation of an ext4 file system can now happen very quickly whereas before the inode table had to be constructed taking some time. However, the inode table has to be initialized as quickly as possible for the file system to be useful. So on the first mount of the file system, the kernel runs a kernel thread that will initialize the table.

The second patch added batched discard support to ext4. This may sound uneventful, but it has a big improvement in one area – SSD’s. Recall that the TRIM command in SSD’s can result in much better overall performance because the blocks are marked for erasing which is done only when needed. In Linux, the basic concept of TRIM is called discard. So this patch adds the ability to do batched discards (multiple blocks) allowing the entire file system to be “trimmed” if needed. So far, ext4 is the first file system in Linux to support batched discards.

In addition to these two major new features in ext4, there was a somewhat minor one that is useful nonetheless. In 2.6.37 a patch was added that allows ext4 to list or “advertise” the features of the particular version of ext4 in sysfs. More specifically, there is a “features” directory in /sys/fs/ext4 that advertises what features are available in ext4 in the running kernel. That can be very useful for people wanting to know, or needing to know, exactly what features are available in the particular version of ext4.

Scalability improvements in xfs

Xfs is one of the highest performing file systems in Linux for certain workloads. It is very popular in the HPC (High Performance Computing) crowd because of the excellent file performance, particularly for larger files. However, it has the reputation of not having very good metadata performance. It is still under heavy development and many of the recent patches have been targeting metadata performance.

In the 2.6.37 kernel release, xfs gained some scalability performance improvements. In particular, the scalability of xfs metadata workload performance improved. For example, on an 8-way system, running the fs_mark benchmark for an instance of 50 million files, improved the performance by over 15%. The performance of the removal of those files improved by over 100%.

Of course other improvements and features were added to xfs in 2.6.37. In a previous article I mentioned a new logging option (delayed logging) was added in the 2.6.35 kernel that can greatly improve I/O bandwidth for the log by several orders of magnitude. This can greatly improve metadata performance for really heavy metadata workloads. In 2.6.37, a patch was added that removed the “experimental” label from delayed logging making it production ready.

Other improvements/changes added to xfs in 2.6.37 are:

  • Project quotes to support 32-bit project ids were added
  • XFS_IO_ZERO_RANGE was introduced which is a function that enables files to quickly zero ranges of files without changing the layout of the file in any way
  • The cache hash was converted to use rbtree in this patch. This was done because the buffer cache hash was showing scalability problems. By switching to rbtrees performance the performance and scalability should be greatly improved, particularly for systems doing a great deal of I/O.

Btrfs improvements

Everyone’s favorite file-system-in-development, btrfs, had some interesting patches added in 2.6.37. Overall, if you watch the btrfs mailing list, you will see lots of active testing of btrfs. This has resulted in a number of good patches even if they aren’t adding significant new “features”. Several of the patches can be considered “major” while there are also some very good “minor” patches as well.

Probably the most significant feature added to btrfs is to cache the free space information on disk. It sounds kind of confusing so let me explain. Before this patch, if btrfs had to allocate from a block group that was not previously cached, it had to scan the entire extent-tree (i.e. it took a great deal of time and resources to find available block groups). After this patch, every time a transaction is committed producing a dirtied block group, the free space is dumped to the on-disk free space cache. So finding available block groups is a simple lookup greatly improving performance for this situation.

This patch results in an disk format change for btrfs. Recall that it is still in development so don’t be surprised by any disk format changes. However, you can mount existing btrfs file systems so that this option is not used. In fact, currently, you have to enable this new option using the “-o space_cache” mount option.

Another major feature that was added to btrfs in 2.6.37 was asynchronous snapshot creation. The benefit of this features is that you don’t have to wait for a new snapshot to be committed to the disk. You can use this feature by adding “async” to the “btrfs subvolume snapshot” command.

Believe it or not, the asynchronous snapshot creation capability was added primarily with ceph in mind. Remember that ceph was added a few kernel versions ago and is a distributed parallel file system that is still under heavy development. Ceph uses btrfs as the underlying file system (Ceph can arguably be called a meta file system since it is file system on top of a file system). There is more on Ceph itself later in this article.

A somewhat minor feature that was added to btrfs in the recently released 2.6.37 kernel is the ability to delete sub-volumes by unprivileged users. However, the user can only delete the sub-volume if they have “write” and “execute” permission on the sub-volume root inode. Otherwise they don’t have permission to delete it. The option “-o user_subvol_rm_allowed” can be used during the mounting of btrfs to enable this option.

An additional minor feature was added that switched from extent buffer rbtrees to a radix tree. This switch should reduce CPU time spent in the extent buffer search and improve performance for some operations (see the commit link for more details).

The last feature for btrfs that I want to mention is all around chunk allocation tuning. This particular patch allows data and metadata block groups to be mixed. According to the kernel newbies article on 2.6.37 this should be useful for small storage devices.

Comments on "Linux 2.6.37: Scalability Improvements Abound"

Leave a Reply