dcsimg

Churning Butter(FS): An Interview with Chris Mason

The founder of btrfs talks about features, terabyte raid arrays and comparisons with ZFS.

Following up our introductory article on “Butter FS” (See Linux Don’t Need No Stinkin’ ZFS: BTRFS Intro & Benchmarks), Jeff Layton talked with Chris Mason, Director of Linux Kernel Engineering at Oracle and the founder and lead developer of Btrfs.

Jeff Layton What was the motivation behind the development of btrfs? Was it in response to Sun’s ZFS

Chris Mason Starting Btrfs was less about ZFS than it was about making sure that Linux is able to keep up with the massive storage devices that are coming out in the next 10 years.

The main goal is flexible management of storage, and making sure that all of the normal administrative tasks can be done online.

JL What are the key features in btrfs that people are looking for?

CM One important part about Btrfs development is that we wanted to focus on features and not strictly performance. It is important that we perform well, but we wanted to make sure Btrfs had features that other Linux filesystems could not easily provide.

The biggest single feature is the copy on write snapshotting, which is actually the basis of most of the other advanced features in the FS. Btrfs snapshots are writable and can be snapshotted again.

Data and metadata checksums are also a key part of making sure we can administer the storage over time. We need to be able to detect when the disk is giving us the wrong data, and try to correct it by grabbing data from another mirror or sending a command down to the disk array to ask for another copy. This command doesn’t exist yet, but we do plan on adding it to the Linux software raid stack at least.

Managing multiple devices inside the filesystem is what gives Btrfs very flexible storage management. Devices can be mixed in size and speed, and over the long term Btrfs will do the right thing to optimize access. Raid levels can also be mixed, using different stripe sizes for data and metadata etc.

JL While you may not follow ZFS I was wondering if you have any comparisons between btrfs and ZFS?

CM I haven’t yet done any benchmarking comparisons. While we do share some features with ZFS, the overall design of the two is very different. I haven’t spent any time in the ZFS code, but I hope to have the chance to do more design level comparisons with ZFS in the coming months.

JL One of the interesting features in btrfs is the way it handles RAID so you don’t have to necessarily use md. Can you talk about RAID and btrfs?

CM Currently, Btrfs has the ability to RAID the metadata and the data itself. Right now it’s limited to RAID-0, RAID-1, and RAID-10. At this time to get other RAID levels you need to use MD that is in Linux. But RAID-5 and RAID-6 are in Btrf’s future.

Btrfs on top of MD would work just like any other storage device. By default it will mirror metadata as though it were on a single spindle, and Btrfs would only maintain a single copy of the data. Since Btrfs doesn’t yet provide RAID5 or RAID6, people may want to test with MD to get those features.

The main difference between Btrfs raid and MD raid is that Btrfs can detect incorrect data returned by the device with checksums. Even if we don’t get an IO error, Btrfs will know if a block is correct.

During a RAID rebuild, Btrfs is able to only rebuild blocks that are used by the filesystem. This allows much better rebuild performance. (Jeff’s Note – this is a huge development that people should not underestimate. With disks getting larger the possibility of getting a bad block increases dramatically. Decreasing rebuild time decreases the risk of hitting another bad block during the rebuild which means you have to restore from a backup.)

From a management point of view, Btrfs allocates space from the drives in large chunks, and then uses those chunks to build specific RAID levels. This means that your RAID level isn’t tied to the number of devices used by the filesystem, and it makes it much easier to add, remove or restripe storage over time. (Jeff’s Note – again this is a huge advantage for Btrfs over other file systems. The ability to easily add or remove devices is covered in the Btrfs wiki.)

JL With the current version, is it possible to do something like the following?

mkfs.btrs /dev/sda /dev/sdb /dev/sdc /dev/sdd -R raid10

CM Today you can do this:
mkfs.btrfs -m raid1 -d raid10 /dev/sda /dev/sdb /dev/sdc /dev/sdd And you’ll get metadata on raid1 and data on raid10. The raid10 will use all four drives and the raid1 will use two drives at a time. Yes, btrfs allows you to pick different values for data or metadata.

The idea is to have the raid5 and raid6 support work the same way.

JL Are you thinking about “coupling” btrfs and md directly so you can do some easy builds without having to use md first and then build the file system using the md device?

CM Christoph Hellwig is working on abstracting out parts of MD, but he’s starting with integrating that into the DM modules. Long term I want btrfs to use as much of the existing MD code as possible, but the place it makes the most sense is for the xor calculations.

JL What’s the largest btrfs configuration that you know has been tested to date?

CM HP has tested on a 40TB raid array, I think that is the biggest so far. IBM has a large test rig as well, but they have limited the size of the FS during testing so that all the Linux filesystems can easily be compared.

JL Since Btrfs is now in the kernel, people can test it. What kind of testing would you like to see and how best should people report problems or results?

CM I’d love to see Btrfs tested in any configuration people are actively using. Right now we do have some problems in database workloads, but we hope to fix this starting with 2.6.31.

JL What recommendations for tuning Btrfs for performance would you make recognizing that it’s still in the experimental stage?

CM Btrfs uses the crc32c algorithm for checksumming data because some CPUs can compute it directly in hardware. The kernel already supports this on the Nehalem CPUs, as long as the intel crc module is compiled in. It makes a big difference.

For anyone testing on an SSD, please use “mount -o ssd”. This makes writing much faster, especially on SSDs with a higher random write penalty.

Fatal error: Call to undefined function aa_author_bios() in /opt/apache/dms/b2b/linux-mag.com/site/www/htdocs/wp-content/themes/linuxmag/single.php on line 62