I Like My File Systems Chunky: UnionsFS and ChunkFS

Diving deeper into UnionFS: walking through how to create and manage large file systems using the principles of ChunkFS and UnionFS.

Given the size of today’s hard drives, a question often asked is how to create and manage large file systems. Many times, this question is asked around ext3 which can be fairly limiting in size. The corollary to this question is how one manages large file systems. In this article, an approach to creating file systems using the concepts of ChunkFS will be presented. In particular, UnionFS will be used to create a large file system from “chunks” using UnionFS and at the same time helps with check and repair times.

I Want a Bigger File System but I Need to Manage It

Everyone has seen the reports about the huge explosion of data. While this article isn’t intended to present the data around this explosion, it is important to understand that data storage is growing at a huge rate. In July 2006 IBM Global Technology Services published a seminal paper entitled “The Toxic Terabyte”. The paper discussed the rise in the amount of data being produced by companies. From the report,


It is projected that just four years from now, the world’s information base will be doubling in size every 11 hours. So rapid is the growth in the global stock of digital data that the very vocabulary used to indicate quantities has had to expand to keep pace. A decade or two ago, professional computer users and managers worked in kilobytes and megabytes. Now schoolchildren have access to laptops with tens of gigabytes of storage, and network managers have to think in terms of the terabyte (1,000 gigabytes) and the petabyte (1,000 terabytes). Beyond those lie the exabyte, zettabyte and yottabyte, each a thousand times bigger than the last.

An easy way to think about the explosion of data is to look at your own desktop or laptop. Think back 4 years ago to the size of the drive you had. Now think of the size drive you have in your current system. Even better, look at the ads in the Sunday paper and look at the size of drives and the cost.

Recent file systems such as ext4 (production ready), btrfs and nilfs (both experimental), and existing file systems such as XFS and JFS are capable of very large file systems. At the same time there are potential problems with large file systems. They have to be planned carefully and tested on many different levels to ensure that performance is maintained. Moreover, Henry Newman pointed out in this article that one of the problems with large file systems is the amount of time it can take to fsck them in the event of corruption.

Fsck-ing

Everyone reading this article raise your hand if you’ve had to run fsck on a file system. Now put your hand down if you’ve done this and it took under 10 minutes. There are a number of people with their hand up and it is almost certain they have horror stories of having to perform an fsck and it lasting a very long time. The author remembers an fsck on a 1TB file system around 2002. It took almost 2 days to complete and brought all work to a halt. It’s pretty evident that having to perform a fsck on a large file system can take quite a bit of time.

There may be some arguments that certain file systems don’t have or need a fsck. However there is a difference between replaying a journal to regain consistency and actually having to check the entire file system. Journals did solve the problem of the previous generation of file systems that had do a full scan to repair any errors in the event of an unclean unmount (e.g. partially finished writes). Journaled file systems keep a journal of write operations on disk so in the event of an unclean unmount the journal is just replayed on the next mount, either completing partial writes or finishing writes in the journal.

However, journals don’t help with the problem of a corrupt file system. Corruption can come from a number of sources such as disk errors (hardware problems), file system bugs (nah, those never happen), and administrator error (admins are always right – just ask them). In the event of one of these problems or others, then a fsck is needed that scans the file system and fixes problems. Despite of all kinds of protection techniques, file system corruption can still and do happen. So a fsck is required. A very good explanation of fsck and it’s development or use over time is in this article.

In looking at file systems and fsck Valerie Aurora was intrigued by the enormous disparity between seek time and capacity of disks. She took some typical values for disks from a talk in 2006 and projected the increase in fsck times to 2013. Her conclusion was the following,


  • From 2006 to 2013 capacity will increase 16x
  • From 2006 to 2013 bandwidth increases 5x
  • From 2006 to 2013 seek time increases 1.2x
  • fsck time increases 10x!

This means that it will take 10 times more time to fsck a file system in 2013 compared to 2006. So if it takes 2 hours in 2006, it will take 20 hours in just a few years.

ChunkFS

From Valerie’s analysis she concluded that file systems needed to be designed from the start with a good fsck capability. She discussed some concepts in a paper. One of the concepts that was explored was called ChunkFS

ChunkFS is an architecture that is built on the assumption that at some point file systems will have to run fsck. This means that it has to be designed to have a fast and reliable file system check and repair (fsck). To achieve this, ChunkFS breaks a file system into pieces (chunks) where each chunk can be checked and repaired independently of the others. These chunks are assembled into a coherent file system.

The challenge comes with the details in the design (it’s always in the details). In particular, files should be able to span the chunks while still having the ability to check and repair the chunks independently.

In Valerie’s recent article she discussed ChunkFS in some detail. In particular she said that there were three implementations that were developed and tested over time. From these implementations she made the following conclusions (taken from the article):


The three chunkfs prototypes and our estimates of cross-chunk references using real-world file systems showed that the chunkfs architecture works as advertised. The prototypes also convinced us that it would be difficult to retrofit existing journaling file systems to the chunkfs architecture. Features that make file system check and repair are best when designed into the architecture of the file system from the beginning. Btrfs is an example of a file system designed from the ground up with the goal of fast, reliable check and repair.

So ChunkFS may not appear in the kernel in the near future, but as is pointed out, it has had influence on the design of the next generation of file systems. In addition, the concepts embedded within ChunkFS can be exploited to construct large file systems that have a reasonably fast file system check and repair capability.

Using UnionFS to Link the Chunks

The ChunkFS concept of “stitching” together chunks into a single file system can actually be accomplished using UnionFS. In a previous article UnionFS was introduced in conjunction with SquashFS. UnionFS is a stackable unification file system that merges the contents of several file systems (called branches) into a single coherent file system view. The various branches are overlaid on top of each other and they can be mixed in a read-only and/or a read-write mode.

There are some limitations with using UnionFS to create a single large file system. Unlike ChunkFS, you can’t have a file greater than the size of each chunk. This is an important limitation that you must understand before creating the union. An additional limitation is that if a directory on one of the chunks fills up, then you cannot write any more to that chunk although you can continue writing to the other chunks.

While there are limitations, there is one big benefit, the ability to create what users see a single large file system. However, at the same time, there is an additional benefit that people may not notice – the ability to fsck each chunk independently of the others. As pointed out earlier, as file systems get larger and larger, the amount of time it takes to check and repair a file system grows dramatically. The concept that ChunkFS put forth, breaking a file system into independent pieces, applies to using UnionFS to bring together a large file system. Each chunk in the union can be checked and repaired independent of the others. You can even check and repair the chunks in parallel allowing a much faster fsck of the entire file system. This is a HUGE benefit to administrators and while users may not realize it, it’s a big boost to them because the file system will be off-line a much shorter amount of time.

There are many people who use ext3 as their primary file system. But ext3 has limitations, in particular a 16 TiB (basically 16 TBs) volume limit. Given today’s 2 TB drives, it’s fairly easy to build a 16TB file system, even in home systems. However, people still want to use ext3 for larger file systems. How do you get around this problem? The answer is to use a union. But at the same time you have to understand the limitations previously mentioned.

Simple Example – /home

To better understand how UnionFS can work, a simple 4 chunk UnionFS file system is created using ext3 from two 500 GB drives (two chunks per drive). Assuming that this process is starting from scratch, the first step is to create two partitions on each drive (/dev/sdb and /dev/sdc). After that an ext3 file system is created on each partition. As an example, here is the output for /dev/sdb1.

[root@test64 laytonjb]# /sbin/mke2fs -t ext3 /dev/sdb1
mke2fs 1.41.7 (29-June-2009)
warning: 208 blocks unused.

Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
15291504 inodes, 61046784 blocks
3052349 blocks (5.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=4294967296
1863 block groups
32768 blocks per group, 32768 fragments per group
8208 inodes per group
Superblock backups stored on blocks:
        32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
        4096000, 7962624, 11239424, 20480000, 23887872

Writing inode tables: done
Creating journal (32768 blocks): done
Writing superblocks and filesystem accounting information: done

This filesystem will be automatically checked every 26 mounts or
180 days, whichever comes first.  Use tune2fs -c or -i to override.

This output is very common (nothing new to see – move along, move along) but it is included for completeness.

This process is repeated for all four partitions: /dev/sdb1, /dev/sdb2, /dev/sdc1, /dev/sdc2. There is one important consideration to note at this point. Look at the very last 2 lines of the output when making the ext3 file system. The default for ext3 is that every 26 mounts or every 180 days the file system is fsck-ed. One can gather information about the file system using tune2fs to illustrate this a bit more.

[root@test64 laytonjb]# /sbin/tune2fs -l /dev/sdb1 | more
tune2fs 1.41.7 (29-June-2009)
Filesystem volume name:
Last mounted on:
Filesystem UUID:          f35218ca-981c-4208-bca6-3b61e000a7dc
Filesystem magic number:  0xEF53
Filesystem revision #:    1 (dynamic)
Filesystem features:      has_journal ext_attr resize_inode dir_index filetype sparse_super large_file
Filesystem flags:         signed_directory_hash
Default mount options:    (none)
Filesystem state:         clean
Errors behavior:          Continue
Filesystem OS type:       Linux
Inode count:              15291504
Block count:              61046784
Reserved block count:     3052349
Free blocks:              60039157
Free inodes:              15291493
First block:              0
Block size:               4096
Fragment size:            4096
Reserved GDT blocks:      1009
Blocks per group:         32768
Fragments per group:      32768
Inodes per group:         8208
Inode blocks per group:   513
Filesystem created:       Sun Jul 19 08:25:09 2009
Last mount time:          n/a
Last write time:          Sun Jul 19 08:26:10 2009
Mount count:              0
Maximum mount count:      26
Last checked:             Sun Jul 19 08:25:09 2009
Check interval:           15552000 (6 months)
Next check after:         Fri Jan 15 07:25:09 2010
Reserved blocks uid:      0 (user root)
Reserved blocks gid:      0 (group root)
First inode:              11
Inode size:               256
Required extra isize:     28
Desired extra isize:      28
Journal inode:            8
Default directory hash:   half_md4
Directory Hash Seed:      e80032b8-2549-4c17-8460-a5fe02c3bb26
Journal backup:           inode blocks

There are two lines in the tune2fs that are important.

Maximum mount count:      26
Check interval:           15552000 (6 months)

Comments on "I Like My File Systems Chunky: UnionsFS and ChunkFS"

typhoidmary

Why would you not just use device mapper with LVM or EVM?

Reply
laytonjb

There are two reasons you don\’t want to use LVM.

1. File systems such as ext3 have limited sizes but there are people who want to use ext3 for larger file systems. LVM doesn\’t help in this context.

2. LVM doesn\’t help with the fsck times. Breaking up the file system into chunks can greatly reduce fsck time.

However, I am a proponent of using LVM if the file system is capable of growing to use added space. Emotionally I like the concept of shrinking a file system to gain back some space that I then can allocate somewhere else. But I have yet to do this myself and no one I\’ve spoken with has done it yet (I\’m sure there are people who would like to do – if so, let us know).

If you are referring to the \”homework\” of using LVM, then I think that is a good solution (another person has emailed me about that as well).

BTW – my email address in the original article is incorrect. It should be jlayton _at_ linux-mag.com. I fixed the article but the cache may trip up people.

Thanks for the post!

Jeff

Reply
caletronics

For a long time I\’ve been using bind mounts (see below). I get the manageability of individual disks, one coherent /home, but, also (unlike chunkFS?) the ability to restrict the \”view\” to different NFS clients. I also share the pitfall of filling one disk while another may have lots of free space.

My question: what does chunkFS get me compared to bind mounts?

Thanks,
Chris D

For clarity I\’m just showing excerpts. It\’s worth pointing out that except for serval and ocelot other clients are unable to see my home directory and therefore the music directory inside that. But using bind I can also mount the music disk where all clients can see it.
/etc/fstab:

/dev/k01/01.3 /disk/01.3/ xfs rw 0 0
/dev/k02/02.3 /disk/02.3/ xfs rw 0 0
/dev/k03/03.1 /disk/03.1/ xfs rw 0 0
/disk/02.3/home/chrisd /home/chrisd none rw,bind 0 0
/disk/01.3/mythtv /home/mythtv none rw,bind 0 0
/disk/03.1/music /home/chrisd/music none rw,bind 0 0
/disk/03.1/music /home/mythtv/music none ro,bind 0 0

/etc/exports:
/home *(ro,fsid=0,no_root_squash,no_subtree_check,insecure)
/home/chrisd serval.zoo(rw,nohide,no_root_squash,no_subtree_check) \\
ocelot.zoo(ro,nohide,no_root_squash,no_subtree_check)
/home/chrisd/music serval.zoo(rw,nohide,no_root_squash,no_subtree_check) \\
ocelot.zoo(ro,nohide,no_root_squash,no_subtree_check)
/home/mythtv *(ro,nohide,no_root_squash,no_subtree_check,insecure) \\
serval.zoo(rw,nohide,no_root_squash,no_subtree_check)
/home/mythtv/music *(ro,nohide,no_root_squash,no_subtree_check,insecure)

Reply
drogo

I\’ve shrunken an LVM device before.

I wanted to backup a smallish RAID-5 array (3x200G drives) and came across the snapshot ability. Since I had originally used all the extents when I first created the array, I had to shrink the filesystem, then free up a few extents for the snapshot.

I was successful, but I did have a fresh backup sitting right next to the system. Heck, the backup was probably the voodoo I needed to ensure success. :D

Reply
typhoidmary

I think my point about LVM was missed. The idea of a chunky FS is that you manage the fact that ext3 becomes less and less practical the bigger the span it has to cover. So a chunky FS system is really several smaller ext3 FS working \”seamlessly\” together. This is one of the things LVM does. While LVM is designed to grow and shrink and also span disks, there is nothing to stop it from spanning volumes on a disk.

So take that 1 TB drive, partition it in 10 GB (to take a size at random) sections, and combine these sections as 1 logical volume. ext3 then takes care of a FS section closer to it\’s \”comfort\” level, while LVM handles the issue of files spanning partitions.

The question remaining is whether or not fsck can run on the individual partitions, or if it must run on the logical volume. If it can\’t handle just the partition, then this is a great feature request for the LVM project.

Reply
laytonjb

@typhoidmary
I don\’t think I missed your point but maybe you don\’t see the difference between the two concepts. With your concept you combine partitions using LVM into a logical volume that you then use ext3. So for example, you could take five 1TB drives into a single 5TB LV. But when you run an fsck on the file system you are still running it across a single file system.

Using the principles of ChunkFS you can combine separate file systems into a single logical file system using UnionFS. In this approach, for example, you would create an ext3 file system on each of the 5 drives, then combine them using UnionFS into a seemingly single file system. If you need to run an fsck you can run it on one of 5 pieces without having to run it on all 5.

Note that you can still use LVM to create the LV\’s for each of the \”chunks\” and combine them with UnionFS.

So the big difference between your approach the approach in the article is that your approach creates a single ext3 file system and the article creates multiple ext3 file systems and combine them using UnionFS. Your approach allows you to have files that fill up the entire file system but the fsck is slow. In the approach in the article you can fill a chunk without filling the entire union, possibly causing problems. But the fsck is much faster than your approach.

Does this make sense or did I make it worse?

Jeff

Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>