Diving deeper into UnionFS: walking through how to create and manage large file systems using the principles of ChunkFS and UnionFS.
Given the size of today’s hard drives, a question often asked is how to create and manage large file systems. Many times, this question is asked around ext3 which can be fairly limiting in size. The corollary to this question is how one manages large file systems. In this article, an approach to creating file systems using the concepts of ChunkFS will be presented. In particular, UnionFS will be used to create a large file system from “chunks” using UnionFS and at the same time helps with check and repair times.
I Want a Bigger File System but I Need to Manage It
Everyone has seen the reports about the huge explosion of data. While this article isn’t intended to present the data around this explosion, it is important to understand that data storage is growing at a huge rate. In July 2006 IBM Global Technology Services published a seminal paper entitled “The Toxic Terabyte”. The paper discussed the rise in the amount of data being produced by companies. From the report,
It is projected that just four years from now, the world’s information base will be doubling in size every 11 hours. So rapid is the growth in the global stock of digital data that the very vocabulary used to indicate quantities has had to expand to keep pace. A decade or two ago, professional computer users and managers worked in kilobytes and megabytes. Now schoolchildren have access to laptops with tens of gigabytes of storage, and network managers have to think in terms of the terabyte (1,000 gigabytes) and the petabyte (1,000 terabytes). Beyond those lie the exabyte, zettabyte and yottabyte, each a thousand times bigger than the last.
An easy way to think about the explosion of data is to look at your own desktop or laptop. Think back 4 years ago to the size of the drive you had. Now think of the size drive you have in your current system. Even better, look at the ads in the Sunday paper and look at the size of drives and the cost.
Recent file systems such as ext4 (production ready), btrfs and nilfs (both experimental), and existing file systems such as XFS and JFS are capable of very large file systems. At the same time there are potential problems with large file systems. They have to be planned carefully and tested on many different levels to ensure that performance is maintained. Moreover, Henry Newman pointed out in this article that one of the problems with large file systems is the amount of time it can take to
fsck them in the event of corruption.
Everyone reading this article raise your hand if you’ve had to run
fsck on a file system. Now put your hand down if you’ve done this and it took under 10 minutes. There are a number of people with their hand up and it is almost certain they have horror stories of having to perform an
fsck and it lasting a very long time. The author remembers an
fsck on a 1TB file system around 2002. It took almost 2 days to complete and brought all work to a halt. It’s pretty evident that having to perform a
fsck on a large file system can take quite a bit of time.
There may be some arguments that certain file systems don’t have or need a
fsck. However there is a difference between replaying a journal to regain consistency and actually having to check the entire file system. Journals did solve the problem of the previous generation of file systems that had do a full scan to repair any errors in the event of an unclean unmount (e.g. partially finished writes). Journaled file systems keep a journal of write operations on disk so in the event of an unclean unmount the journal is just replayed on the next mount, either completing partial writes or finishing writes in the journal.
However, journals don’t help with the problem of a corrupt file system. Corruption can come from a number of sources such as disk errors (hardware problems), file system bugs (nah, those never happen), and administrator error (admins are always right – just ask them). In the event of one of these problems or others, then a
fsck is needed that scans the file system and fixes problems. Despite of all kinds of protection techniques, file system corruption can still and do happen. So a
fsck is required. A very good explanation of
fsck and it’s development or use over time is in this article.
In looking at file systems and
fsck Valerie Aurora was intrigued by the enormous disparity between seek time and capacity of disks. She took some typical values for disks from a talk in 2006 and projected the increase in
fsck times to 2013. Her conclusion was the following,
- From 2006 to 2013 capacity will increase 16x
- From 2006 to 2013 bandwidth increases 5x
- From 2006 to 2013 seek time increases 1.2x
- fsck time increases 10x!
This means that it will take 10 times more time to
fsck a file system in 2013 compared to 2006. So if it takes 2 hours in 2006, it will take 20 hours in just a few years.
From Valerie’s analysis she concluded that file systems needed to be designed from the start with a good
fsck capability. She discussed some concepts in a paper. One of the concepts that was explored was called ChunkFS
ChunkFS is an architecture that is built on the assumption that at some point file systems will have to run
fsck. This means that it has to be designed to have a fast and reliable file system check and repair (
fsck). To achieve this, ChunkFS breaks a file system into pieces (chunks) where each chunk can be checked and repaired independently of the others. These chunks are assembled into a coherent file system.
The challenge comes with the details in the design (it’s always in the details). In particular, files should be able to span the chunks while still having the ability to check and repair the chunks independently.
In Valerie’s recent article she discussed ChunkFS in some detail. In particular she said that there were three implementations that were developed and tested over time. From these implementations she made the following conclusions (taken from the article):
The three chunkfs prototypes and our estimates of cross-chunk references using real-world file systems showed that the chunkfs architecture works as advertised. The prototypes also convinced us that it would be difficult to retrofit existing journaling file systems to the chunkfs architecture. Features that make file system check and repair are best when designed into the architecture of the file system from the beginning. Btrfs is an example of a file system designed from the ground up with the goal of fast, reliable check and repair.
So ChunkFS may not appear in the kernel in the near future, but as is pointed out, it has had influence on the design of the next generation of file systems. In addition, the concepts embedded within ChunkFS can be exploited to construct large file systems that have a reasonably fast file system check and repair capability.
Using UnionFS to Link the Chunks
The ChunkFS concept of “stitching” together chunks into a single file system can actually be accomplished using UnionFS. In a previous article UnionFS was introduced in conjunction with SquashFS. UnionFS is a stackable unification file system that merges the contents of several file systems (called branches) into a single coherent file system view. The various branches are overlaid on top of each other and they can be mixed in a read-only and/or a read-write mode.
There are some limitations with using UnionFS to create a single large file system. Unlike ChunkFS, you can’t have a file greater than the size of each chunk. This is an important limitation that you must understand before creating the union. An additional limitation is that if a directory on one of the chunks fills up, then you cannot write any more to that chunk although you can continue writing to the other chunks.
While there are limitations, there is one big benefit, the ability to create what users see a single large file system. However, at the same time, there is an additional benefit that people may not notice – the ability to
fsck each chunk independently of the others. As pointed out earlier, as file systems get larger and larger, the amount of time it takes to check and repair a file system grows dramatically. The concept that ChunkFS put forth, breaking a file system into independent pieces, applies to using UnionFS to bring together a large file system. Each chunk in the union can be checked and repaired independent of the others. You can even check and repair the chunks in parallel allowing a much faster
fsck of the entire file system. This is a HUGE benefit to administrators and while users may not realize it, it’s a big boost to them because the file system will be off-line a much shorter amount of time.
There are many people who use ext3 as their primary file system. But ext3 has limitations, in particular a 16 TiB (basically 16 TBs) volume limit. Given today’s 2 TB drives, it’s fairly easy to build a 16TB file system, even in home systems. However, people still want to use ext3 for larger file systems. How do you get around this problem? The answer is to use a union. But at the same time you have to understand the limitations previously mentioned.
Simple Example – /home
To better understand how UnionFS can work, a simple 4 chunk UnionFS file system is created using ext3 from two 500 GB drives (two chunks per drive). Assuming that this process is starting from scratch, the first step is to create two partitions on each drive (
/dev/sdc). After that an ext3 file system is created on each partition. As an example, here is the output for
[root@test64 laytonjb]# /sbin/mke2fs -t ext3 /dev/sdb1
mke2fs 1.41.7 (29-June-2009)
warning: 208 blocks unused.
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
15291504 inodes, 61046784 blocks
3052349 blocks (5.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=4294967296
1863 block groups
32768 blocks per group, 32768 fragments per group
8208 inodes per group
Superblock backups stored on blocks:
32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
4096000, 7962624, 11239424, 20480000, 23887872
Writing inode tables: done
Creating journal (32768 blocks): done
Writing superblocks and filesystem accounting information: done
This filesystem will be automatically checked every 26 mounts or
180 days, whichever comes first. Use tune2fs -c or -i to override.
This output is very common (nothing new to see – move along, move along) but it is included for completeness.
This process is repeated for all four partitions:
/dev/sdc2. There is one important consideration to note at this point. Look at the very last 2 lines of the output when making the ext3 file system. The default for ext3 is that every 26 mounts or every 180 days the file system is
fsck-ed. One can gather information about the file system using
tune2fs to illustrate this a bit more.
[root@test64 laytonjb]# /sbin/tune2fs -l /dev/sdb1 | more
tune2fs 1.41.7 (29-June-2009)
Filesystem volume name:
Last mounted on:
Filesystem UUID: f35218ca-981c-4208-bca6-3b61e000a7dc
Filesystem magic number: 0xEF53
Filesystem revision #: 1 (dynamic)
Filesystem features: has_journal ext_attr resize_inode dir_index filetype sparse_super large_file
Filesystem flags: signed_directory_hash
Default mount options: (none)
Filesystem state: clean
Errors behavior: Continue
Filesystem OS type: Linux
Inode count: 15291504
Block count: 61046784
Reserved block count: 3052349
Free blocks: 60039157
Free inodes: 15291493
First block: 0
Block size: 4096
Fragment size: 4096
Reserved GDT blocks: 1009
Blocks per group: 32768
Fragments per group: 32768
Inodes per group: 8208
Inode blocks per group: 513
Filesystem created: Sun Jul 19 08:25:09 2009
Last mount time: n/a
Last write time: Sun Jul 19 08:26:10 2009
Mount count: 0
Maximum mount count: 26
Last checked: Sun Jul 19 08:25:09 2009
Check interval: 15552000 (6 months)
Next check after: Fri Jan 15 07:25:09 2010
Reserved blocks uid: 0 (user root)
Reserved blocks gid: 0 (group root)
First inode: 11
Inode size: 256
Required extra isize: 28
Desired extra isize: 28
Journal inode: 8
Default directory hash: half_md4
Directory Hash Seed: e80032b8-2549-4c17-8460-a5fe02c3bb26
Journal backup: inode blocks
There are two lines in the
tune2fs that are important.
Maximum mount count: 26
Check interval: 15552000 (6 months)