dcsimg

Kernels 2.6.35 and 2.6.36: Storage Updates

Two kernel releases have gone by since we last checked in with the check-ins. While the storage related changes are seemingly minimal, it's always good to review what changed; you might be surprised.

The 2.6.36 kernel was released on Oct. 20 of 2010 and contained some interesting changes.

The most interesting change, and it may have some consequences for Windows users, is that FS-Cache capability was added to CIFS (Common Internet File System). This allows the CIFS clients to cache I/O operations on the client, possibly improving performance.

As always there are other additions to the kernel that help Linux storage. Focusing on the file systems first, there were a number of changes to various key file systems. Let’s begin with squashfs (one of my personal favorites).

For a long time, patches to squashfs that use lzo compression were maintained outside the kernel because lzo support was not available inside the kernel. In the 2.6.34 kernel, lzo capability was added to squashfs for lzo and lzma compression. In 2.6.35, lzo compression was finally added to squashfs (not just a “capability”). The addition of lzo compression means that you get faster compression (in some cases), and even more data compression (in some case), saving space, relative to the previous compression capabilities in squashfs.

Ext4 added some features to enhance debugging and manageability. In particular:

  • The ability for ext4 to store the mount options in the superblock was added to the kernel. This allows you to read the superblock to discover the mount options – useful for debugging and other scenarios.
  • A patch was added so that once a day, ext4 uses a printk function (used to print kernel messages to the console, dmesg, and the syslog daemon) to send file system error information to dmesg (again useful for debugging).
  • Another very useful debugging change was made to save the number of file system errors, the time function name, line number, block number, and inode number of the first and most recent errors from the file system into the superblock. This information can be very helpful for debugging problems.
  • The last major change made to ext4 in 2.6.36 was the addition of a discard request (TRIM) in the function ext4_free_blocks when ext4 has no journal and is mounted with the discard option. This change was made to benefit Google, which uses ext4 without journaling. This change can help free up space more rapidly within ext4 (great for situations where the file system is under pressure because it is getting closer to being full)

As with the 2.6.35 kernel, we can see efforts within ext4 to add more debugging information to help capture as much information as possible. These types of patches can be associated with a maturing file system and one that is being more widely used.

The next file system that had some patches applied in 2.6.36 is nilfs. The changes aren’t too dramatic but they are still important to review.

  • The option of mounting nilfs with write barriers was added. Before 2.6.36 it was the default but this change allows the file system to be remounted with write barriers in case of problems.
  • A new patch was included allowing nilfs to be mounted without discards (TRIM) if desired.
  • Three new fields were added to the superblock to get ready for future disk format modifications:
    • compatible features set
    • incompatible feature set
    • readonly-compatible
  • The last new patch allows nilfs to check the feature flag to reject a file system with unknown features when mounting or remounting a file system.

It looks like nilfs is getting ready for a possible future format change.

XFS also had some developments in 2.6.36 that should help users. It added some patches to speedup and simplify direct I/O completions (everyone likes speed don’t they?). The second major xfs patch removed an obsolete mount option – osyncisosync.

Even the venerable ext3 wasn’t without a small change. In 2.6.36 ext3 was changed to use the ordered mode as default. “Ordered mode” means that all data is written to the main file system and the metadata is committed to the journal whose blocks are logically grouped into transactions to decrease I/O. This patch was authored by Dave Chinner from Redhat. His comments in the patch are worth reading to provide an explanation.


data=writeback mode is dangerous as it leads to higher data loss and stale data exposure when systems crash. It should not be the default, especially when all major distros ensure their ext3 filesystems default to ordered mode. Change the default mode to the safer data=ordered mode, because we should be caring far more about avoiding stale data exposure than performance.

The experimental file system, Ceph also had some changes in the 2.6.36 kernel. In particular it added support for file locking (flock/fcntl) for supporting advisory file locking. Remember that Ceph is a distributed file system including distributed metadata. These file locking patches ensure that all file locking is synchronous with the metadata server (MDS).

The other significant change made to ceph was additional of a “lazy” IO control (ioctl) that marks a file descriptor for lazy file consistency semantics. This in turn allows buffered reads and writes when multiple clients are accessing the same file (very important in HPC and other fields).

There were some important changes in the 2.6.36 kernel for the device mapper (dm) as well. In this kernel release the delay, linear, mpath (multipath), and dm stripe targets in the device mapper, all got support for block discard (TRIM). This patches are all very important because it means that the various targets in DM (typically LVM uses them) will pass the TRIM command to underlying SSDs! The TRIM command can greatly help performance as well as wear leveling for SSDs. Thank you Mike Snitzer of Redhat!

There was one additional change in the block layer of the kernel in 2.6.36. A number of people worked on a “secure discard” patch that is the same as “discard” (TRIM) but forces all copies of the discarded sectors to be erased. This prevents the case where copies of a given sector might hang around even if the original sector was erased. Again, this is important for the case of SSDs as well as data security.

Summary

The two kernels, 2.6.35 and 2.6.36, may not have had the most exciting changes for Linux storage but there are some important changes nonetheless, resulting in what I see as three general trends.

Overall we’re seeing recently added file systems starting to mature while also addressing performance problems that have appeared in some cases. You can primarily see this in ext4 but to some degree in btrfs as well (even though it’s still experimental). Btrfs is still missing many of the features that are ultimately to be added but perhaps the developers are solidifying the code at this point before adding new features.

A second trend is the addition of features and techniques to help debug problems. Debugging storage problems can be very, very difficult while being very, very important. In the kernel patches in 2.6.35 and 2.6.26 we can see developers adding new statistics and tracepoints in various file systems and other parts of the kernel to assist in tracking down problems. These are also very useful while debugging new code for file systems.

The third general trend is that we’re seeing support for the TRIM command (discards) being implemented in various parts of the kernel. In particular, the device mapper (dm), which is used in LVM, has gained TRIM (discard) support for various targets. Now we can begin to use LVM in conjunction with SSDs and be sure the important TRIM command is propagated to the SSD, improving performance and wear leveling.

Fatal error: Call to undefined function aa_author_bios() in /opt/apache/dms/b2b/linux-mag.com/site/www/htdocs/wp-content/themes/linuxmag/single.php on line 62