I Like My File Systems Chunky: UnionsFS and ChunkFS

Diving deeper into UnionFS: walking through how to create and manage large file systems using the principles of ChunkFS and UnionFS.

For a file system built from 4 chunks that have identical times between mandatory fsck all four chunks will be checked at the same time. One of the key principles behind ChunkFS is to improve the check and repair time and having all of the chunks fsck-ed at the same time is something of an antithesis. How is that problem fixed?

There is a simple way to change the number of mounts and time between fsck using tune2fs. The number of mounts and time between mounts are staggered across the four chunks.

[root@test64 laytonjb]# /sbin/tune2fs -c 10 /dev/sdb1
tune2fs 1.41.7 (29-June-2009)
Setting maximal mount count to 10
[root@test64 laytonjb]# /sbin/tune2fs -i 60d /dev/sdb1
tune2fs 1.41.7 (29-June-2009)
Setting interval between checks to 5184000 seconds
[root@test64 laytonjb]# /sbin/tune2fs -c 12 /dev/sdb2
tune2fs 1.41.7 (29-June-2009)
Setting maximal mount count to 12
[root@test64 laytonjb]# /sbin/tune2fs -i 90d /dev/sdb2
tune2fs 1.41.7 (29-June-2009)
Setting interval between checks to 7776000 seconds
[root@test64 laytonjb]# /sbin/tune2fs -c 14 /dev/sdc1
tune2fs 1.41.7 (29-June-2009)
Setting maximal mount count to 14
[root@test64 laytonjb]# /sbin/tune2fs -i 120d /dev/sdc1
tune2fs 1.41.7 (29-June-2009)
Setting interval between checks to 10368000 seconds
[root@test64 laytonjb]# /sbin/tune2fs -c 16 /dev/sdc2
tune2fs 1.41.7 (29-June-2009)
Setting maximal mount count to 16
[root@test64 laytonjb]# /sbin/tune2fs -i 150d /dev/sdc2
tune2fs 1.41.7 (29-June-2009)
Setting interval between checks to 12960000 seconds

There are two days between mounts for each partition (10, 12, 14, 16) and there are 30 days between checks for each partition (60, 90, 120, 150). This way during a reboot or a remount only one partition at a time will have a forced fsck. Although if admins don’t like this forced fsck process, it can easily be disabled but then it’s entirely up to the administrator to periodically perform an fsck if so desired.

For this example, the intent of the union is to be for /home on the system. For this test system, /home is already used, so for this example the union mount point will be /BIGhome. However, before creating the union mount, the four chunks much be mounted. First the mount points for each of the four chunks is created.

[root@test64 laytonjb]# mkdir /mnt/home1
[root@test64 laytonjb]# mkdir /mnt/home2
[root@test64 laytonjb]# mkdir /mnt/home3
[root@test64 laytonjb]# mkdir /mnt/home4

Then /etc/fstab is modified as below,

LABEL=/                 /                       ext3    defaults        1 1
LABEL=/home             /home                   ext3    defaults        1 2
LABEL=/boot1            /boot                   ext2    defaults        1 2
tmpfs                   /dev/shm                tmpfs   defaults        0 0
devpts                  /dev/pts                devpts  gid=5,mode=620  0 0
sysfs                   /sys                    sysfs   defaults        0 0
proc                    /proc                   proc    defaults        0 0
LABEL=SWAP-hda2         swap                    swap    defaults        0 0
/dev/sdb1               /mnt/home1              ext3    defaults,data=ordered   0 0
/dev/sdb2               /mnt/home2              ext3    defaults,data=ordered   0 0
/dev/sdc1               /mnt/home3              ext3    defaults,data=ordered   0 0
/dev/sdc2               /mnt/home4              ext3    defaults,data=ordered   0 0

Note that the mounts use the option, “data=ordered” on the recommendation of Valerie Aurora with the new 2.6.30 kernel. Then the chunks are mounted.

[root@test64 laytonjb]# mount -a
[root@test64 laytonjb]# mount
/dev/hda3 on / type ext3 (rw)
proc on /proc type proc (rw)
sysfs on /sys type sysfs (rw)
devpts on /dev/pts type devpts (rw,gid=5,mode=620)
/dev/sda1 on /home type ext3 (rw)
/dev/hda1 on /boot type ext2 (rw)
tmpfs on /dev/shm type tmpfs (rw)
none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw)
sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw)
/dev/sdb1 on /mnt/home1 type ext3 (rw,data=ordered)
/dev/sdb2 on /mnt/home2 type ext3 (rw,data=ordered)
/dev/sdc1 on /mnt/home3 type ext3 (rw,data=ordered)
/dev/sdc2 on /mnt/home4 type ext3 (rw,data=ordered)

Finally the union mount is created in /etc/fstab as show below.

LABEL=/                 /                       ext3    defaults        1 1
LABEL=/home             /home                   ext3    defaults        1 2
LABEL=/boot1            /boot                   ext2    defaults        1 2
tmpfs                   /dev/shm                tmpfs   defaults        0 0
devpts                  /dev/pts                devpts  gid=5,mode=620  0 0
sysfs                   /sys                    sysfs   defaults        0 0
proc                    /proc                   proc    defaults        0 0
LABEL=SWAP-hda2         swap                    swap    defaults        0 0
/dev/sdb1               /mnt/home1              ext3    defaults,data=ordered   0 0
/dev/sdb2               /mnt/home2              ext3    defaults,data=ordered   0 0
/dev/sdc1               /mnt/home3              ext3    defaults,data=ordered   0 0
/dev/sdc2               /mnt/home4              ext3    defaults,data=ordered   0 0
unionfs                 /BIGhome                unionfs dirs=/mnt/home1=rw:/mnt/home2=rw:/mnt/home3=rw:/mnt/home4=rw  0 0

Notice that all four chunks are mounted read-write since they are intended for /home. The order of the chunks, /mnt/home1, /mnt/home2, /mnt/home3, /mnt/home4 is completely arbitrary. Finally the union is mounted.

[root@test64 laytonjb]# mount -a
[root@test64 laytonjb]# mount
/dev/hda3 on / type ext3 (rw)
proc on /proc type proc (rw)
sysfs on /sys type sysfs (rw)
devpts on /dev/pts type devpts (rw,gid=5,mode=620)
/dev/sda1 on /home type ext3 (rw)
/dev/hda1 on /boot type ext2 (rw)
tmpfs on /dev/shm type tmpfs (rw)
none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw)
sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw)
/dev/sdb1 on /mnt/home1 type ext3 (rw,data=ordered)
/dev/sdb2 on /mnt/home2 type ext3 (rw,data=ordered)
/dev/sdc1 on /mnt/home3 type ext3 (rw,data=ordered)
/dev/sdc2 on /mnt/home4 type ext3 (rw,data=ordered)
unionfs on /BIGhome type unionfs (rw,dirs=/mnt/home1=rw:/mnt/home2=rw:/mnt/home3=rw:/mnt/home4=rw)

One More Step – Adding Users

Now that the “chunky” file system has been created, there remains the small step of adding users. So, being the good admins that we are, we use the useradd command to add a user. For this example, user2 is added to the system.

[root@test64 laytonjb]# /usr/sbin/useradd -d /BIGhome/user2 -m -s /bin/bash user2
[root@test64 laytonjb]# su user2
[user2@test64 laytonjb]$ cd
[user2@test64 ~]$ ls -s
total 0
[user2@test64 ~]$ pwd
/BIGhome/user2
[user2@test64 ~]$

Notice that the home directory for the user is specified as /BIGhome/user2. By changing the directory to the home directory you can see that the home directory is /BIGhome/user2. But which partition is actually used?

[root@test64 laytonjb]# ls -s /mnt/home1
total 20
16 lost+found   4 user2
[root@test64 laytonjb]# ls -s /mnt/home2
total 16
16 lost+found
[root@test64 laytonjb]# ls -s /mnt/home3
total 16
16 lost+found
[root@test64 laytonjb]# ls -s /mnt/home4
total 16
16 lost+found

The home directory for user2 is actually in /mnt/home1, the first partition. If you added a second user they too would go into the first directory (/mnt/home1) listed in the union mount. A third user would do the same. This would continue until the first directory is filled and then would move to the second directory (/mnt/home2). However, it is much more likely that we would like to spread the users across the four directories in this example. So how can this be done?

A perhaps better approach is to specify the user’s home directory as the one of the directories in the union. Since the first added user ended up into the first directory (/mnt/home1) it seems logical to put the next user in one of the other directories (e.g. /mnt/home3).

[root@test64 laytonjb]# /usr/sbin/useradd -d /mnt/home3/user3 -m -s /bin/bash user3
[root@test64 laytonjb]# su user3
[user3@test64 laytonjb]$ cd
[user3@test64 ~]$ pwd
/mnt/home3/user3
[root@test64 laytonjb]# ls -s /BIGhome/
total 24
16 lost+found   4 user2   4 user3

What is interesting is the the home directory is /mnt/home3/user3 rather than /BIGhome/user3.

To better understand how files are created for user3 within the “chunky” union, let’s create a couple of simple test files. First, let’s cd to the user’s home directory and create a simple file, junk.

[user3@test64 ~]$ pwd
/mnt/home3/user3
[user3@test64 ~]$ vi junk
[user3@test64 ~]$ ls -s
total 4
4 junk

Notice that the “pwd” is /mnt/home3/user3. Then a second file, junk2 is created but this time the directory is changed to /BIGhome/user3.

[user3@test64 ~]$ cd /BIGhome/user3
[user3@test64 user3]$ pwd
/BIGhome/user3
[user3@test64 user3]$ vi junk2
[user3@test64 user3]$ ls -s
total 8
4 junk  4 junk2

So with the union mount, the user can write to the directory where their home directory is located (in this case, /mnt/home3/user3) and also the union directory, /BIGhome/user3.

Final Comments

As capacities increase the prediction is that the time to perform a fsck will increase at a dramatic rate. The concepts behind ChunkFS were developed in direct response to this increase in check and repair times. Fundamentally ChunkFS breaks up the file system into “chunks” that can be checked and repaired independently while allowing data files to extend across the chunks.

This article takes some of the principles of ChunkFS and uses UnionFS to create “large” file systems that are easier to fsck than a typical single file system. While ChunkFS allows data to extend across chunks as needed, using UnionFS restricts the data to the chunk where it’s located. If you can accept that restriction along with a little extra planning of where data is located then this approach can be used to great advantage.

As homework for this article, think about the following problem: Using this example, how could you adapt the layout if a user exceeded the space in their chunk? I have my own solution(s) to this problem but I’m interested in your solutions. Please submit them to me at jlayton _at_ linux-mag.com and I will publish them next week. Until then, just remember, file systems are good when they are chunky.

Jeff Layton is an Enterprise Technologist for HPC at Dell. He can be found lounging around at a nearby Frys enjoying the coffee and waiting for sales (but never during working hours).

Comments on "I Like My File Systems Chunky: UnionsFS and ChunkFS"

typhoidmary

Why would you not just use device mapper with LVM or EVM?

Reply
laytonjb

There are two reasons you don\’t want to use LVM.

1. File systems such as ext3 have limited sizes but there are people who want to use ext3 for larger file systems. LVM doesn\’t help in this context.

2. LVM doesn\’t help with the fsck times. Breaking up the file system into chunks can greatly reduce fsck time.

However, I am a proponent of using LVM if the file system is capable of growing to use added space. Emotionally I like the concept of shrinking a file system to gain back some space that I then can allocate somewhere else. But I have yet to do this myself and no one I\’ve spoken with has done it yet (I\’m sure there are people who would like to do – if so, let us know).

If you are referring to the \”homework\” of using LVM, then I think that is a good solution (another person has emailed me about that as well).

BTW – my email address in the original article is incorrect. It should be jlayton _at_ linux-mag.com. I fixed the article but the cache may trip up people.

Thanks for the post!

Jeff

Reply
caletronics

For a long time I\’ve been using bind mounts (see below). I get the manageability of individual disks, one coherent /home, but, also (unlike chunkFS?) the ability to restrict the \”view\” to different NFS clients. I also share the pitfall of filling one disk while another may have lots of free space.

My question: what does chunkFS get me compared to bind mounts?

Thanks,
Chris D

For clarity I\’m just showing excerpts. It\’s worth pointing out that except for serval and ocelot other clients are unable to see my home directory and therefore the music directory inside that. But using bind I can also mount the music disk where all clients can see it.
/etc/fstab:

/dev/k01/01.3 /disk/01.3/ xfs rw 0 0
/dev/k02/02.3 /disk/02.3/ xfs rw 0 0
/dev/k03/03.1 /disk/03.1/ xfs rw 0 0
/disk/02.3/home/chrisd /home/chrisd none rw,bind 0 0
/disk/01.3/mythtv /home/mythtv none rw,bind 0 0
/disk/03.1/music /home/chrisd/music none rw,bind 0 0
/disk/03.1/music /home/mythtv/music none ro,bind 0 0

/etc/exports:
/home *(ro,fsid=0,no_root_squash,no_subtree_check,insecure)
/home/chrisd serval.zoo(rw,nohide,no_root_squash,no_subtree_check) \\
ocelot.zoo(ro,nohide,no_root_squash,no_subtree_check)
/home/chrisd/music serval.zoo(rw,nohide,no_root_squash,no_subtree_check) \\
ocelot.zoo(ro,nohide,no_root_squash,no_subtree_check)
/home/mythtv *(ro,nohide,no_root_squash,no_subtree_check,insecure) \\
serval.zoo(rw,nohide,no_root_squash,no_subtree_check)
/home/mythtv/music *(ro,nohide,no_root_squash,no_subtree_check,insecure)

Reply
drogo

I\’ve shrunken an LVM device before.

I wanted to backup a smallish RAID-5 array (3x200G drives) and came across the snapshot ability. Since I had originally used all the extents when I first created the array, I had to shrink the filesystem, then free up a few extents for the snapshot.

I was successful, but I did have a fresh backup sitting right next to the system. Heck, the backup was probably the voodoo I needed to ensure success. :D

Reply
typhoidmary

I think my point about LVM was missed. The idea of a chunky FS is that you manage the fact that ext3 becomes less and less practical the bigger the span it has to cover. So a chunky FS system is really several smaller ext3 FS working \”seamlessly\” together. This is one of the things LVM does. While LVM is designed to grow and shrink and also span disks, there is nothing to stop it from spanning volumes on a disk.

So take that 1 TB drive, partition it in 10 GB (to take a size at random) sections, and combine these sections as 1 logical volume. ext3 then takes care of a FS section closer to it\’s \”comfort\” level, while LVM handles the issue of files spanning partitions.

The question remaining is whether or not fsck can run on the individual partitions, or if it must run on the logical volume. If it can\’t handle just the partition, then this is a great feature request for the LVM project.

Reply
laytonjb

@typhoidmary
I don\’t think I missed your point but maybe you don\’t see the difference between the two concepts. With your concept you combine partitions using LVM into a logical volume that you then use ext3. So for example, you could take five 1TB drives into a single 5TB LV. But when you run an fsck on the file system you are still running it across a single file system.

Using the principles of ChunkFS you can combine separate file systems into a single logical file system using UnionFS. In this approach, for example, you would create an ext3 file system on each of the 5 drives, then combine them using UnionFS into a seemingly single file system. If you need to run an fsck you can run it on one of 5 pieces without having to run it on all 5.

Note that you can still use LVM to create the LV\’s for each of the \”chunks\” and combine them with UnionFS.

So the big difference between your approach the approach in the article is that your approach creates a single ext3 file system and the article creates multiple ext3 file systems and combine them using UnionFS. Your approach allows you to have files that fill up the entire file system but the fsck is slow. In the approach in the article you can fill a chunk without filling the entire union, possibly causing problems. But the fsck is much faster than your approach.

Does this make sense or did I make it worse?

Jeff

Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>