Jeff Layton talks to Valerie Aurora, file system developer and open source evangelist, about a wide range of subjects including her background in file systems, ChunkFS, the Union file system and how the developer ecosystem can chip in.
JL There was a recent article on lwn that talked about ChunkFS. For readers here, can you do a quick recap of ChunkFS? Also, has there been much effort to incorporate the ideas of ChunkFS into other file systems?
VA I noticed two things: (1) file systems always get corrupted somehow, (2) the time it takes to check and repair an “average” file system keeps growing, due to uneven evolution of hardware. Chunkfs is more of a useful technique than a complete file system architecture; the idea is that a file system should be divided up into many independent pieces which can be checked and repaired with very little reference to other parts of the file system. It turns fsck from a very long painful process requiring the whole file system to be off-line for hours to something that can be done incrementally, on-line, and in the background. We prototyped it and concluded that (a) the idea works, (b) the division of the file system needs to be part of the design of the file system from the beginning to be practical.
Not many file systems have been designed since chunkfs, primarily btrfs and Tux3 (and HAMMER, but I don’t think they noticed). I’m not sure how much chunkfs influenced other file systems, but I know that the back pointers in btrfs – e.g., each data block has a pointer back to the metadata that references it – were inspired by chunkfs.
JL I’ve read your UnionFS articles on lwn and really enjoyed them. Can you talk about UnionFS concepts and how useful (or not) they are and in what situations they work well?
VA The concept of unioning the namespaces of two file systems has been around for more than a decade. Plan 9 used it, among other things, to eliminate the need for all the various *_PATH environment variables listing all the directories that might contain a desired binary, library, man page, etc. – all the possible directories were unioned together into one directory. The Plan 9 approach was limited and didn’t handle certain cases – it operated on single directories only not whole file systems, what if you want different search ordering for different apps, duplicate entries weren’t eliminated, etc. – but it gave a feeling for the power of the operation of unioning namespaces.
What I find is that when you have a way of unioning namespaces, it very quickly becomes a hammer and many problems suddenly start to look like nails (even if they are actually screws, or wine glasses). The concept is extremely powerful and you could use it to replace very large swathes of UNIX kernel and user utilities (see the Plan 9 search path case above), so the question becomes not “Can I solve this problem with unions?” and instead “Are unions the best way to solve this problem, or is there a superior solution?” One case in which unions appear to be the best solution is in maintaining a writable overlay on top of a read-only base file system. There are two competing solutions to this problem: one is to run around bind-mounting writable files and directories as needed, or replacing them with symlinks to a read-write file system mounted elsewhere in the namespace (I’ve done it, it’s even worse than it sounds), the other is to use a copy-on-write block device. The COW block device is a reasonable solution for things like root file systems for many instances of virtual machines running on one host, but it breaks down with long-running clients or thin clients accessing their read-only base file system across a network.
One case where I don’t think unions are the optimal solution is the desire to quickly assemble a file system image with a particular set of software and configuration files installed. It’s a pain to do this by hand; you have to run the installer, pick the set of packages you want, then hand-edit all your configuration files for your local network, users, etc. The unioning approach involves creating a set of building block file systems which can be unioned together to produce file system images containing different subsets of software. For example, you have a base server file system image, and then an overlay file system that includes all the files needed to run an SMTP server, and then another overlay file system that includes all the files needed for Apache, etc., and then you union together the set of file systems with the desired software and slap a read-write layer on top. Hurray, all shared now, right?
One obvious gotcha goes straight to the heart of why this is the wrong solution: The RPM/apt/whatever database file is going to be different for each set of installed software, and unioning is only going to give you the database file for the topmost layer, which will only contain the package information for the files it knew about when that layer was created. So if you install postfix on one layer, and Apache on another, and union them together, your package management system will only know about postfix or Apache, not both. You are trying to solve a package management and configuration problem with a file system solution – not going to work. What you should be using instead is something like Puppet, which not only automates installation and configuration of systems so you can quickly generate a specific disk image; it also gives you all sorts of other system management and software configuration goodies, like checking that
/etc/passwd hasn’t been modified by a rootkit, or that your new sysadmin trainee hasn’t reconfigured the mail server to accept all incoming connections.
If you look at the history of unioning file systems, you’ll see that people once thought it was a great way to implement source control. Just install a kernel tree, mount a read-write layer on it, and do all your development there. When you are done, diff it against the original. In the era of git, Mercurial, SVN, and all the many, many full-featured source control systems we have now, this seems utterly ridiculous. Back when all we had was CVS, SCCS, and RCS, unioning file systems seemed a tempting way to implement source control because the competition was so bad. I think that software and configuration management is at a similar place right now, transitioning from the old world of crappy tools that caused more headaches than they solved to useful new tools that are clearly superior to any file system-based hack.
JL Any comments on distributed and/or parallel file systems?
VA I have a well-rehearsed reply to almost any distributed file system question: You can’t optimize a distributed file system for every use case, so find a distributed file system that is optimized for something like your workload – and use it only for that workload. This advice applies to distributed file system developers too: Don’t try to make your file system good at everything, pick a workload and optimize it for that. So if you want to run an Oracle database on a shared-disk cluster, use OCFS2. If you want to run a MapReduce style workload, run Hadoop. If you want massively parallel IO to file data, use Lustre. If you want good performance for single-writer, multiple-reader, use NFS. Etc.
I think of this reply as a total cop-out – I’m just not that interested in distributed file systems – but upon reflection, it might qualify as deep wisdom. At least it gets me out of long boring distributed file system discussions quickly.
JL From your vantage point what do you see for the future of file systems for Linux beyond perhaps btrfs and ext4?
VA Wow! I hadn’t even thought about beyond the upcoming generation of file systems – I’m just glad there is one at all, since it wasn’t at all obvious there would be back in 2006. With regard to local file systems, I think btrfs is flexible enough to handle any projected hardware changes for the next decade, both in performance and capacity – in other words, SSDs and truly enormous quantities of storage. I also think we may see more flash-based devices exporting access to the hardware directly as the SSD market commoditizes, in which case LogFS and NILFS become even more interesting, perhaps as part of an open source wear-leveling layer that can integrate tightly with other Linux file systems. On the parallel, distributed file system front, I’m keeping an eye on Ceph. The architect, Sage Weil, really knows his stuff and spent several years getting the design right. On the single-writer/many-reader optimization side of distributed file systems, CRFS is super awesome if you can get over the requirement for btrfs as the local file system (and you should).
JL Lastly, what can the Linux community do to help maintain file system development and evolution? Also, any last parting comments?
VA I’d say, don’t get complacent. A particular file system will usually do well for a decade, okay for another decade, and degrade to unusability in the third decade. You ideally should start working on the next file system in the middle of the second decade. But if you look at it from the point of view of your population of developers, by that time you’ve lost most of your original file system developers, and the new developers don’t have much experience working on file systems because the current one just works. Management has also lost institutional memory and has a hard time believing that developers have to start working on a new file system, now, even though it won’t be ready for 5 years and the current one will keep working fine for another 10 years. One of the nice things about Linux is that we have so many file systems in active development that it’s easier to keep that institutional memory going, but nonetheless, Linux has been late to the file systems party several times now.
I suppose it’s more interesting to talk about how to not become complacent… Keep talking to other file system developers face to face, keep experimenting with new file systems, keep talking to people in research and academia, keep paying attention to hardware trends. The way to avoid the “file systems are a solved problem” echo chamber is to stay in touch with both each other and the outside world.
My parting comment is, wow, interviews are way more fun to write than articles. :) Thanks for the opportunity.