dcsimg

Churning Butter(FS): An Interview with Chris Mason

The founder of btrfs talks about features, terabyte raid arrays and comparisons with ZFS.

Following up our introductory article on “Butter FS” (See Linux Don’t Need No Stinkin’ ZFS: BTRFS Intro & Benchmarks), Jeff Layton talked with Chris Mason, Director of Linux Kernel Engineering at Oracle and the founder and lead developer of Btrfs.

Jeff Layton What was the motivation behind the development of btrfs? Was it in response to Sun’s ZFS

Chris Mason Starting Btrfs was less about ZFS than it was about making sure that Linux is able to keep up with the massive storage devices that are coming out in the next 10 years.

The main goal is flexible management of storage, and making sure that all of the normal administrative tasks can be done online.

JL What are the key features in btrfs that people are looking for?

CM One important part about Btrfs development is that we wanted to focus on features and not strictly performance. It is important that we perform well, but we wanted to make sure Btrfs had features that other Linux filesystems could not easily provide.

The biggest single feature is the copy on write snapshotting, which is actually the basis of most of the other advanced features in the FS. Btrfs snapshots are writable and can be snapshotted again.

Data and metadata checksums are also a key part of making sure we can administer the storage over time. We need to be able to detect when the disk is giving us the wrong data, and try to correct it by grabbing data from another mirror or sending a command down to the disk array to ask for another copy. This command doesn’t exist yet, but we do plan on adding it to the Linux software raid stack at least.

Managing multiple devices inside the filesystem is what gives Btrfs very flexible storage management. Devices can be mixed in size and speed, and over the long term Btrfs will do the right thing to optimize access. Raid levels can also be mixed, using different stripe sizes for data and metadata etc.

JL While you may not follow ZFS I was wondering if you have any comparisons between btrfs and ZFS?

CM I haven’t yet done any benchmarking comparisons. While we do share some features with ZFS, the overall design of the two is very different. I haven’t spent any time in the ZFS code, but I hope to have the chance to do more design level comparisons with ZFS in the coming months.

JL One of the interesting features in btrfs is the way it handles RAID so you don’t have to necessarily use md. Can you talk about RAID and btrfs?

CM Currently, Btrfs has the ability to RAID the metadata and the data itself. Right now it’s limited to RAID-0, RAID-1, and RAID-10. At this time to get other RAID levels you need to use MD that is in Linux. But RAID-5 and RAID-6 are in Btrf’s future.

Btrfs on top of MD would work just like any other storage device. By default it will mirror metadata as though it were on a single spindle, and Btrfs would only maintain a single copy of the data. Since Btrfs doesn’t yet provide RAID5 or RAID6, people may want to test with MD to get those features.

The main difference between Btrfs raid and MD raid is that Btrfs can detect incorrect data returned by the device with checksums. Even if we don’t get an IO error, Btrfs will know if a block is correct.

During a RAID rebuild, Btrfs is able to only rebuild blocks that are used by the filesystem. This allows much better rebuild performance. (Jeff’s Note – this is a huge development that people should not underestimate. With disks getting larger the possibility of getting a bad block increases dramatically. Decreasing rebuild time decreases the risk of hitting another bad block during the rebuild which means you have to restore from a backup.)

From a management point of view, Btrfs allocates space from the drives in large chunks, and then uses those chunks to build specific RAID levels. This means that your RAID level isn’t tied to the number of devices used by the filesystem, and it makes it much easier to add, remove or restripe storage over time. (Jeff’s Note – again this is a huge advantage for Btrfs over other file systems. The ability to easily add or remove devices is covered in the Btrfs wiki.)

JL With the current version, is it possible to do something like the following?

mkfs.btrs /dev/sda /dev/sdb /dev/sdc /dev/sdd -R raid10

CM Today you can do this:
mkfs.btrfs -m raid1 -d raid10 /dev/sda /dev/sdb /dev/sdc /dev/sdd And you’ll get metadata on raid1 and data on raid10. The raid10 will use all four drives and the raid1 will use two drives at a time. Yes, btrfs allows you to pick different values for data or metadata.

The idea is to have the raid5 and raid6 support work the same way.

JL Are you thinking about “coupling” btrfs and md directly so you can do some easy builds without having to use md first and then build the file system using the md device?

CM Christoph Hellwig is working on abstracting out parts of MD, but he’s starting with integrating that into the DM modules. Long term I want btrfs to use as much of the existing MD code as possible, but the place it makes the most sense is for the xor calculations.

JL What’s the largest btrfs configuration that you know has been tested to date?

CM HP has tested on a 40TB raid array, I think that is the biggest so far. IBM has a large test rig as well, but they have limited the size of the FS during testing so that all the Linux filesystems can easily be compared.

JL Since Btrfs is now in the kernel, people can test it. What kind of testing would you like to see and how best should people report problems or results?

CM I’d love to see Btrfs tested in any configuration people are actively using. Right now we do have some problems in database workloads, but we hope to fix this starting with 2.6.31.

JL What recommendations for tuning Btrfs for performance would you make recognizing that it’s still in the experimental stage?

CM Btrfs uses the crc32c algorithm for checksumming data because some CPUs can compute it directly in hardware. The kernel already supports this on the Nehalem CPUs, as long as the intel crc module is compiled in. It makes a big difference.

For anyone testing on an SSD, please use “mount -o ssd”. This makes writing much faster, especially on SSDs with a higher random write penalty.

Comments on "Churning Butter(FS): An Interview with Chris Mason"

ttsiodras

I appreciate the efforts involved in creating BTRFS – Linux *needs* a copy-on-write FS.

In the company I work for, we are already using ZFS (via OpenSolaris) to create practically unlimited daily backups of our virtual machines (huge VMWARE server .vmdk files that only differ daily in less than 1% of their data – only a copy-on-write fs could handle this well). The only problem I see with BTRFS is that it will take quite some time before we will be able to trust it as much as ZFS… Filesystems need a lot of time to iron out obscure race-conditions and rare usage patterns… I hope BTRFS will catch up quickly… and I feel good about Oracle owning both of them – no danger of patent wars on BTRFS!

Reply
dog

how long did you wait to trust ZFS? Its not that old. And Oracle owns btrfs?

Reply
neondiet

Is it possible with a raid10 filesystem to control which devices contain which halves of the mirror? I’ll give you an example of why I’m asking. I’ve previously build raid10 volumes on HP-UX. HP’s implementation of LVM includes a feature called Physical Volume Groups. Disks can be included in an LVM VG and then bunched into PVGs so that when creating a mirrored LV the mirror is split across PVGs. On systems with dual raid cards this has allowed me to put all the disks attached to one raid card in one PVG and disks attached to the other in a second PVG. The result is that I/O to a mirrored LV gets split evenly between both raid cards, doubling up the available bandwidth. In addition, it protects my LVs from a complete single raid controller failure. To achieve the same result on Linux today I must first use md to mirror devices across raid cards before adding the md devices to an LVM VG. Then I create LVs as normal and leave the mirroring down to md to sort out. It’s a perfectly workable solution, but given that the ultimate goal of btrfs is to make md redundant, will we be able to achieve the same result in btrfs somehow? The wiki (linked in the article) doesn’t hint at this. Thanks.

Reply
bugmenot

ZFS has been in Solaris for almost 3 years now, and it was in OpenSolaris before that.

Reply
liotier

Devices can be mixed in size and speed, and over the long term Btrfs will do the right thing to optimize access

You mean that the user can throw a motley mix of whatever he has, including devices of wildly different performance profiles such as hard disks and SSD, and that Btrfs will allocate data to the right device according to file size, block size and whatever other parameters may be relevant to it ? Would that be a sort of integrated hierarchical file system with data moving according to usage patterns ? Or simpler heuristics such as storing small files on low latency / low throughput devices such as SSD and large files on high latency / high throughput devices such as hard disks ? I find the “do the right thing” quote intriguing.

Reply
brmiller0423

I have been trying to educate myself about RAID but have never actually set up a RAID configuration.

When you speak of RAID 10, are you referring to “traditional” RAID 1+0, or to Linux MD RAID 10? Wikipedia offers a useful explanation: . The kernel.org wiki which you refer to includes the sentence that “Raid10 requires at least 4 devices.” which implies that you mean RAID 1+0.

Would it not be more beneficial to the community to implement Linux MD RAID 10 before implementing RAID 5 and 6?

Reply
brmiller0423

Sorry for snafu in previous post. Here’s wishing that the site administrator implements a preview and/or a delete by author function.

I have been trying to educate myself about RAID but have never actually set up a RAID configuration.

When you speak of RAID 10, are you referring to “traditional” RAID 1+0, or to Linux MD RAID 10? Wikipedia offers a useful explanation: . The kernel.org wiki which you refer to includes the sentence that “Raid10 requires at least 4 devices.” which implies that you mean RAID 1+0.

Would it not be more beneficial to the community to implement Linux MD RAID 10 before implementing RAID 5 and 6?

Reply
laytonjb

From what I’ve been reading, ZFS development began almost 6-8 years ago. As ttsiodras pointed out, it takes a long time for people to trust data to new file systems. This is also true for ZFS. It’s taken 6-8 years to get people to trust ZFS enough to start using it.

I think the same will be true for btrfs. It’s only been in development 1-2 years so it will take some more time before it becomes accepted for critical data. I’m hopeful, however, because not only do we have Oracle behind it but also a great deal of the Linux community including many of the “heavy hitters”. I may even go out on a limb and say that within 2 years btrfs will become much more accepted on Linux systems (but I will say that my track record on bets such as these isn’t the best).

But overall it’s not a race.

Reply
laytonjb

I’m not sure to be honest. I think you can do this by using LVM first to create the two PVG’s and then build btrfs on top of that.

This might be a good question for the btrfs mailing list. In fact it’s early enough that you could influence features. :)

Reply

    —-Veramente la guerra fu dichiarata dalle democraticissime Gran Bretagna e Francia, non certo dalla Germania. Peucezio—Noooo, la pacifica:-) Germania di Hitler si limitò solo a far guerra alla Polonia…la quale, toh, che coincidenza:-), aveva un trattato con Gran Bretagna e Francia che le obbligava a proteggerla da un#2s217;aggre8sione&#8&30;porca miseria, ma la Perfida Albione e i Perfidi Cugini D’Oltrealpi non potevano farsi i c..zi propri?:-):-)

    Reply

apple flash+603 2179 6838

Reply

There may be noticeably a bundle to know about this. I assume you made sure good factors in options also.

Reply

Hello there, just became alert to your blog through Google, and found that it’s really informative. I am gonna watch out for brussels. I’ll be grateful if you continue this in future. A lot of people will be benefited from your writing. Cheers!

Reply

Hi there, i read your blog occasionally and i own a similar one and i was just wondering if you get a lot of spam comments? If so how do you prevent it, any plugin or anything you can advise? I get so much lately it’s driving me insane so any support is very much appreciated.

Reply

Great post with lots of imnptraot stuff.

Reply

Your’s is the innillegett approach to this issue.

Reply

scouts say cheapest auto insurance insurance same-sex couples auto insurance quote being good gas car insurance quotes online insurance pay cheap car insurance vary widely apart car insurance quote although women rating symbol insurance auto managers store

Reply

paying monthly cheap auto insurance cited roman policy car insurance quote lower prices any insurance car pay policies online auto insurance drive without streets cheap car insurance quotes expected understanding insurance car leads quarter online car insurance low great lawyers free car insurance companies

Reply

pets become car insurance online exclusions including well quote auto insurance top quality insurance representative car insurance quotes online been companies auto insurance matter whodunit

Reply

traffic violations insurance auto much lower quickly find insurance car law much mileage car insurance thought trap affordable car insurance most important uninsured motorist cheap auto insurance worth about

Reply

official partnerships cheap auto insurance quotes through ambulance automobile insurance beneficiaries easy car insurance rates car repair rather than cheapest car insurance needs does having cheap car insurance financial responsibility numerous online auto insurance quotes part insuring more insurance quotes car every policy car insurance quotes children under

Reply