Improving MetaData Performance of the Ext4 Journaling Device

In the never-ending quest for more performance, we examine three different journaling device options for ext4 with an eye toward improving metadata performance. Who doesn't like speed?

There is always a relentless pursuit of more performance from our storage systems. This includes more performance from hardware (faster disks, SSD’s), network (bigger pipes, larger MTU’s), operating systems (caching, IO schedulers), and file systems. There are many levers than can be moved to improve performance but this article will look at one particular piece – the file system journal device. In particular, the metadata performance of ext4 will be considered as the journal is moved to different devices.

Journaling for File Systems

Sometimes bad things such as power failures happen to systems. Power interruptions or failures can cause a file system to become corrupt very quickly because an IO operation is interrupted and not completed. Consequently, the file system has to be checked (fsck) which means the entire file system has to be checked (walked) to find and correct any problems. As file systems grew the amount of time it takes to walk the file system greatly increased. For example, the author remembers performing an fsck on a 1TB file system in 2002-2003 that took several days. Having the system down for this amount of time is very painful.

One way to help improve fsck times is to use a journaled file system. Rather than IO operations happening directly to the file system, the operations are added to the journal (typically a log) in the order they are supposed to happen. Then the file system grabs the operation from the head of the journal and completes it, erasing the operation from the journal only after the operation is finished and the file system is satisfied that the operation is complete.

If the power is lost during the operation on a journaled file system, when the system comes back up, the journal is just “replayed,” i.e. the operations in the journal are performed one at a time starting at the beginning. This means that the entire file system doesn’t necessarily have to be checked (walked). The primary reason this can be done is that the interruption happens before the operation is removed from the journal. Even if the operation wasn’t completed on the file system, replaying the operation ensures that the IO operation actually occurs. If the interruption happened while the operation was being deleted from the journal, the file system can assume that the operation happened and it just deletes the “corrupted” operation from the head of the journal. As a result, you should not have to walk the entire file system to repair problems. Only the journal needs to be replayed. This means that instead of spending a couple of days waiting for an fsck to finish, a very fast replay of the journal is performed taking just minutes.

The journal can theoretically reside anywhere within the system on any device. It can be on the drive containing the file system or it can use a partition on another drive or any other block device you have laying around. But choosing the “best device” is important. The journal is very important to the integrity of the file system so making sure that the journal is on a device of some resiliency is very important (resiliency in this case means the ability to tolerate errors or problems). At the same time, everyone loves more performance (there is likely no one who has said, “you know, I want my storage to go slower.”). Since the performance of the journal can be key to the performance of the file system, perhaps improving the performance of the journaling device and the journal itself can help overall file system performance.

Testing the Metadata Performance

In this article three options for the journal device will be tested to determine the impact of journal device location on the metadata performance of ext4. The three device options are:

  • Journal on the same disk as the file system
  • Journal on a different disk from the file system
  • Journal on a ram disk

The last option, using a ramdisk for the journal, is designed to measure the pinnacle of performance. But it is not likely to be the most resilient solution (it would be better to use a battery backup of the ram disk with the ability to dump it to a storage device, drive or SSD). However, it is included as an “upper bound” on performance.

One of the ways that journal performance can impact overall file system performance is in metadata performance. This article will focus on metadata performance as measured by fdtree. This benchmark has been used before to examine the metadata performance of various Linux file systems. To read about fdtree and how it was used for benchmarking please see read the original article.

As a quick recap, the benchmark, fdtree, is a simple bash script that performs four different metadata tests:

  • Directory creation
  • File creation
  • File removal
  • Directory Removal

It creates a specified number of files of a given size (in blocks) in a top-level directory. Then it creates a specified number of sub-directories and then in turn sub-directories are recursively created up to a specified number of levels and are populated with files.

Fdtree was used in 4 different approaches to stressing the metadata capability:

  • Small files (4 KiB)
    • Shallow directory structure
    • Deep directory structure

  • Larger files (4 MiB)
    • Shallow directory structure
    • Deep directory structure

The two file sizes, 4 KiB (1 block) and 4 MiB (1,000 blocks) were used to get some feel for a range of performance as a function of the amount of data. The two directory structures were used to stress the metadata in different ways to discover if there is any impact on the metadata performance. The shallow directory structure means that there are many directories but not very many levels down. The deep directory structure means that there are not many directories at a particular level but that there are many levels.

The command lines for the four combinations are:

Small Files – Shallow Directory Structure

./fdtree.bash -d 20 -f 40 -s 1 -l 3

This command creates 20 sub-directories from each upper level directory at each level (“-d 20″) and there are 3 levels (“-l 3″). It’s a basic tree structure. This is a total of 8,421 directories. In each directory there are 40 files (“-f 40″) each sized at 1 block (4 KiB) denoted by “-s 1″. This is a total of 336,840 files and 1,347,360 KiB total data.

Small Files – Deep Directory Structure

./fdtree.bash -d 3 -f 4 -s 1 -l 10

This command creates 3 sub-directories from each upper level directory at each level (“-d 3″) and there are 10 levels (“-l 10″). This is a total of 88,573 directories. In each directory there are 4 files each sized at 1 block (4 KiB). This is a total of 354,292 files and 1,417,168 KiB total data.

Medium Files – Shallow Directory Structure

./fdtree.bash -d 17 -f 10 -s 1000 -l 2

This command creates 17 sub-directories from each upper level directory at each level (“-d 17″) and there are 2 levels (“-l 2″). This is a total of 307 directories. In each directory there are 10 files each sized at 1,000 blocks (4 MiB). This is a total of 3,070 files and 12,280,000 KiB total data.

Medium Files – Deep Directory Structure

./fdtree.bash -d 2 -f 2 -s 1000 -l 10

This command creates 2 sub-directories from each upper level directory at each level (“-d 2″) and there are 10 levels (“-l 10″). This is a total of 2,047 directories. In each directory there are 2 files each sized at 1,000 blocks (4 MiB). This is a total of 4,094 files and 16,376,000 KiB total data.

Each test was run 10 times with the four combinations for the three journal devices. The test system used for these tests was a stock CentOS 5.3 distribution but with a 2.6.30 kernel and e2fsprogs was upgraded to 1.41.9. The tests were run on the following system:

  • GigaByte MAA78GM-US2H motherboard
  • An AMD Phenom II X4 920 CPU
  • 8GB of memory
  • Linux 2.6.30 kernel
  • The OS and boot drive are on an IBM DTLA-307020 (20GB drive at Ulta ATA/100)
  • /home is on a Seagate ST1360827AS
  • There are two drives for testing. They are Seagate ST3500641AS-RK with 16 MB cache each. These are /dev/sdb and /dev/sdc.

Only the first Seagate drive was used, /dev/sdb for all of the tests. The second drive, /dev/sdc was used only for the second test where the journal was placed on a second drive.

Journaling Device Details

All three journal device options used the same size journal file, 16MB. The reason that this size is used is that CentOS boots with a number of ramdisks already created. However, these devices are limited to 16MB in size. To make any comparisons fair the size of the journal was kept constant for all three cases.

The first journal device option was to keep the journal on the same disk as the file system. The drive was partitioned so that the first partition was used for the file system itself (/dev/sdb1) and the remaining approximately 16MB of the drive was used for the journal (/dev/sdb2). The first step was to build the file system on /dev/sdb1.

# mke2fs -t ext4 /dev/sdb1
mke2fs 1.41.9 (22-Aug-2009)
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
29548544 inodes, 118180156 blocks
5909007 blocks (5.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=4294967296
3607 block groups
32768 blocks per group, 32768 fragments per group
8192 inodes per group
Superblock backups stored on blocks:
        32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
        4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
        102400000

Writing inode tables: done
Creating journal (32768 blocks): done
Writing superblocks and filesystem accounting information: done

This filesystem will be automatically checked every 28 mounts or
180 days, whichever comes first.  Use tune2fs -c or -i to override.

The second step is to prepare the journal partition for journaling. Recall that the second partition on the drive (/dev/sdb2) is used for this.

# mke2fs -O journal_dev /dev/sdb2
mke2fs 1.41.9 (22-Aug-2009)
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
0 inodes, 6024 blocks
0 blocks (0.00%) reserved for the super user
First data block=0
0 block group
32768 blocks per group, 32768 fragments per group
0 inodes per group
Superblock backups stored on blocks:

Zeroing journal device: done

The third step is to tell the file system that it no longer has a journal in the file system (this is a precursor to telling it that the journal is located somewhere else).

# tune2fs -O ^has_journal /dev/sdb1
tune2fs 1.41.9 (22-Aug-2009)
# tune2fs -l /dev/sdb1
tune2fs 1.41.9 (22-Aug-2009)
Filesystem volume name:   
Last mounted on:          
Filesystem UUID:          99486587-5d38-4896-bf0a-ec79f9ac1d88
Filesystem magic number:  0xEF53
Filesystem revision #:    1 (dynamic)
Filesystem features:      ext_attr resize_inode dir_index filetype extent flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize
Filesystem flags:         signed_directory_hash
Default mount options:    (none)
Filesystem state:         clean
Errors behavior:          Continue
Filesystem OS type:       Linux
Inode count:              29548544
Block count:              118180156
Reserved block count:     5909007
Free blocks:              116307702
Free inodes:              29548533
First block:              0
Block size:               4096
Fragment size:            4096
Reserved GDT blocks:      995
Blocks per group:         32768
Fragments per group:      32768
Inodes per group:         8192
Inode blocks per group:   512
Flex block group size:    16
Filesystem created:       Mon Dec  7 11:07:20 2009
Last mount time:          n/a
Last write time:          Mon Dec  7 11:10:12 2009
Mount count:              0
Maximum mount count:      36
Last checked:             Mon Dec  7 11:07:20 2009
Check interval:           15552000 (6 months)
Next check after:         Sat Jun  5 12:07:20 2010
Lifetime writes:          7350 MB
Reserved blocks uid:      0 (user root)
Reserved blocks gid:      0 (group root)
First inode:              11
Inode size:               256
Required extra isize:     28
Desired extra isize:      28
Default directory hash:   half_md4
Directory Hash Seed:      ed707821-9ec0-44c7-9c4a-15812b753939
Journal backup:           inode blocks

Notice that the line “Filesystem features” does not have the entry “has_journal” indicating that the file system no longer has a journal. The last step is to tell the file system that it has a journal and it is on the second partition of the drive.

# tune2fs -o journal_data -j -J device=/dev/sdb2 /dev/sdb1
tune2fs 1.41.9 (22-Aug-2009)
Creating journal on device /dev/sdb2: done
This filesystem will be automatically checked every 36 mounts or
180 days, whichever comes first.  Use tune2fs -c or -i to override.
# tune2fs -l /dev/sdb1
tune2fs 1.41.9 (22-Aug-2009)
Filesystem volume name:   
Last mounted on:          
Filesystem UUID:          99486587-5d38-4896-bf0a-ec79f9ac1d88
Filesystem magic number:  0xEF53
Filesystem revision #:    1 (dynamic)
Filesystem features:      has_journal ext_attr resize_inode dir_index filetype extent flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize
Filesystem flags:         signed_directory_hash
Default mount options:    journal_data
Filesystem state:         clean
Errors behavior:          Continue
Filesystem OS type:       Linux
Inode count:              29548544
Block count:              118180156
Reserved block count:     5909007
Free blocks:              116307702
Free inodes:              29548533
First block:              0
Block size:               4096
Fragment size:            4096
Reserved GDT blocks:      995
Blocks per group:         32768
Fragments per group:      32768
Inodes per group:         8192
Inode blocks per group:   512
Flex block group size:    16
Filesystem created:       Mon Dec  7 11:07:20 2009
Last mount time:          n/a
Last write time:          Mon Dec  7 11:11:12 2009
Mount count:              0
Maximum mount count:      36
Last checked:             Mon Dec  7 11:07:20 2009
Check interval:           15552000 (6 months)
Next check after:         Sat Jun  5 12:07:20 2010
Lifetime writes:          7350 MB
Reserved blocks uid:      0 (user root)
Reserved blocks gid:      0 (group root)
First inode:              11
Inode size:               256
Required extra isize:     28
Desired extra isize:      28
Journal UUID:             b71b315f-40e8-4e93-b868-7ad19f7fee8b
Journal device:           0x0812
Default directory hash:   half_md4
Directory Hash Seed:      ed707821-9ec0-44c7-9c4a-15812b753939
Journal backup:           inode blocks

Notice that the line “Filesystem features” has the value “has_journal” and that the line “Journal device:” has a value 0×0812 that is pointing to the second partition on the drive.

The second journal device option where the journal is placed on a second hard drive is created using several steps. The first step is to create the file system on /dev/sdb1.

# mke2fs -t ext4 /dev/sdb1
mke2fs 1.41.9 (22-Aug-2009)
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
30531584 inodes, 122096000 blocks
6104800 blocks (5.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=4294967296
3727 block groups
32768 blocks per group, 32768 fragments per group
8192 inodes per group
Superblock backups stored on blocks:
        32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
        4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
        102400000

Writing inode tables: done
Creating journal (32768 blocks): done
Writing superblocks and filesystem accounting information: done

This filesystem will be automatically checked every 28 mounts or
180 days, whichever comes first.  Use tune2fs -c or -i to override.

The second step is to create a journal on the second drive /dev/sdc1. This partition was create to be 16MB in size.

# mke2fs -O journal_dev /dev/sdc1
mke2fs 1.41.9 (22-Aug-2009)
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
0 inodes, 6016 blocks
0 blocks (0.00%) reserved for the super user
First data block=0
0 block group
32768 blocks per group, 32768 fragments per group
0 inodes per group
Superblock backups stored on blocks:

Zeroing journal device: done

The third step is to then use tune2fs to tell the file system that it doesn’t have a journal on /dev/sdb1.

# tune2fs -O ^has_journal /dev/sdb1
tune2fs 1.41.9 (22-Aug-2009)

# tune2fs -l /dev/sdb1
tune2fs 1.41.9 (22-Aug-2009)
Filesystem volume name:   
Last mounted on:          
Filesystem UUID:          14a11690-76a6-4a3d-997a-abf85bd4d4ad
Filesystem magic number:  0xEF53
Filesystem revision #:    1 (dynamic)
Filesystem features:      ext_attr resize_inode dir_index filetype extent flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize
Filesystem flags:         signed_directory_hash
Default mount options:    (none)
Filesystem state:         clean
Errors behavior:          Continue
Filesystem OS type:       Linux
Inode count:              30531584
Block count:              122096000
Reserved block count:     6104800
Free blocks:              120161866
Free inodes:              30531573
First block:              0
Block size:               4096
Fragment size:            4096
Reserved GDT blocks:      994
Blocks per group:         32768
Fragments per group:      32768
Inodes per group:         8192
Inode blocks per group:   512
Flex block group size:    16
Filesystem created:       Sun Dec  6 07:22:57 2009
Last mount time:          n/a
Last write time:          Sun Dec  6 07:26:36 2009
Mount count:              0
Maximum mount count:      28
Last checked:             Sun Dec  6 07:22:57 2009
Check interval:           15552000 (6 months)
Next check after:         Fri Jun  4 08:22:57 2010
Lifetime writes:          7590 MB
Reserved blocks uid:      0 (user root)
Reserved blocks gid:      0 (group root)
First inode:              11
Inode size:               256
Required extra isize:     28
Desired extra isize:      28
Default directory hash:   half_md4
Directory Hash Seed:      7d24bc9d-db4a-4c0c-b15d-f0959af6edde
Journal backup:           inode blocks

Notice on the line “Filesystem features” that the features “has_journal” is not listed. This indicates that the journal has been “removed” from the file system. The final steps is to tell the file system that it has a journal that is on a specific device – in this case /dev/sdc1.

# tune2fs -o journal_data -j -J device=/dev/sdc1 /dev/sdb1
tune2fs 1.41.9 (22-Aug-2009)
Creating journal on device /dev/sdc1: done
This filesystem will be automatically checked every 28 mounts or

# tune2fs -l /dev/sdb1
tune2fs 1.41.9 (22-Aug-2009)
Filesystem volume name:   
Last mounted on:          
Filesystem UUID:          14a11690-76a6-4a3d-997a-abf85bd4d4ad
Filesystem magic number:  0xEF53
Filesystem revision #:    1 (dynamic)
Filesystem features:      has_journal ext_attr resize_inode dir_index filetype extent flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize
Filesystem flags:         signed_directory_hash
Default mount options:    journal_data
Filesystem state:         clean
Errors behavior:          Continue
Filesystem OS type:       Linux
Inode count:              30531584
Block count:              122096000
Reserved block count:     6104800
Free blocks:              120161866
Free inodes:              30531573
First block:              0
Block size:               4096
Fragment size:            4096
Reserved GDT blocks:      994
Blocks per group:         32768
Fragments per group:      32768
Inodes per group:         8192
Inode blocks per group:   512
Flex block group size:    16
Filesystem created:       Sun Dec  6 07:22:57 2009
Last mount time:          n/a
Last write time:          Sun Dec  6 07:27:20 2009
Mount count:              0
Maximum mount count:      28
Last checked:             Sun Dec  6 07:22:57 2009
Check interval:           15552000 (6 months)
Next check after:         Fri Jun  4 08:22:57 2010
Lifetime writes:          7590 MB
Reserved blocks uid:      0 (user root)
Reserved blocks gid:      0 (group root)
First inode:              11
Inode size:               256
Required extra isize:     28
Desired extra isize:      28
Journal UUID:             c3d3c7e7-f465-41c7-a556-80a9cdc865c3
Journal device:           0x0821
Default directory hash:   half_md4
Directory Hash Seed:      7d24bc9d-db4a-4c0c-b15d-f0959af6edde
Journal backup:           inode blocks

Looking through the listing you can see that the file system has a journal again (“has_journal” on the line “Filesystem features”) and that the journal device is listed as “0×0821″ near the bottom of the listing.

The third journal device option is to place it on a ram drive. It is done in a similar fashion to the previous option where the journal was put on a second drive. But recall that the external journal has to be a block device. The technique used for a ramdisk block device is fairly simple and is based on this article. Despite the article being based on a 2.4 kernel, the techniques are the same.

The first step is to use examine what ramdisks are already created.

Comments on "Improving MetaData Performance of the Ext4 Journaling Device"

mark_w

This is interesting and, in some ways, surprising.

I would have expected the Ramdisk journal to always be faster, or, at the very least, as fast, and this is not always the case, as, on some workloads, it is substantially slower.

The one question that did occur to me was that it was possible that the metadata would all fit into cache on the second hard drive. This might increase performance of the \’metadata on a separate hard drive\’ solution more than anticipated, but it still wouldn\’t explain why the ramdisk solution was worse.

I was wondering which IO scheduler you had used. Going out of your way to provide elevator seeks wouldn\’t help a ramdisk and NOOP would probably be a better strategy for the ramdisk. As far as I know, though, you can only set one scheduler on the system (as opposed to one scheduler per device) so you might lose overall by setting noop for the system.

In these days in which an SSD is also a plausible choice, and can be used to good effect in, eg, ZFS in enhancing overall array performance, I wonder whether this is optimal, although it probably only matters in odd corner cases.

Reply
timsmall

Can someone please start making cheap pci-e battery-backed ram disks, so that we can finally throw all those flakey proprietary hardware RAID cards in the bin, and just use software RAID?

Chicken and egg.

Reply
laytonjb

@mark_w: I\’m surprised as well. I\’m still trying to determine where the performance difference is coming from. Granted the differences are somewhat small, they are still noticeable.

BTW – you can change the IO scheduler on a per mount basis. If you look back at some of the IO scheduler articles you will see how to do it.

@timsmall: Amen! The only option I know of is the ACARD-9010 (http://www.acard.com/english/fb01-product.jsp?prod_no=ANS-9010&type1_title=%20Solid%20State%20Drive&idno_no=270). It\’s a little below $400 without the memory or flash card. It\’s not a bad solution but it\’s limited by the SATA connection (I\’m hoping they come out with a 6Gbps SAS or SATA connection – even better would be a PCIe connection!)

Reply
dgc

Jeff,

I think the reason you are not seeing the external journal improve performance is that fdtree.bash is single threaded and CPU bound rather than being IO bound. An external journal only improves performance when the workload is IO bound.

I ran fdtree.bash on XFS to see how it compared to your ext4 numbers, but I got the same numbers on XFS. That raised a red flag – ext4 should wipe the floor with XFS in these tests. Notably, though, my own test scripts that do very similar operations give an order of magnitude better performance on XFS than fdtree.bash.

Just to check, I spent an hour and rewrote fdtree.bash in C and this version produces numbers that match my own test scripts. e.g. directory creates measuring about 5,000/s instead of 250/s that fdtree.bash was measuring….

IOWs, it appears that the benchmark is the problem, not the filesystem you are testing or your methodology. This is one of the reasons I always monitor system performance (CPU, memory usage, etc) while benchmarks are running so that I can tell when performance is not what it should be… ;)

Cheers,

Dave.

Reply
laytonjb

@dgc,

Wow! You did a great deal of work! I was looking at the benchmark a bit more and one thing I\’ve noticed is that when I run it, all of the cores are busy. So if it\’s single threaded I should only see a single core loaded. The script uses recursion so I\’m wondering if bash spawns threads (or forks) when it does recursion. Any ideas?

I\’m looking for better metadata codes (I really like the word benchmark but that\’s the concept). Do you have any recommendations? Metarates has been recommended to me – any experience with this one?

BTW – I ran the same basic tests with a wider range of options for the journal partition (disk and ramdisk). I also ran IOZone for the same configurations. Look for an upcoming article series on the results I\’m still post-processing the results).

Dave – thanks again for your comments and insight. Greatly appreciated.

Jeff

Reply
craigecowen

Why not use tmpfs to create a larger ramdisk?

Reply
ssbrshei

Jeff,

1) It\’s not clear where you are benchmarking on a 32-bit or 64-bit kernel. Could you please repeat on the one that\’s missing?
1) If you created the partition sequentially, your journal partition, dev/sdb2, will reside on the inner tracks of the HD where it\’s much less efficient. If this is the case, can you rerun the benchmarks by moving the journal partition to the beginning of the HD?

Thanks

Reply

Given what I know of journalling filesystems, the issues of battery backed write caches and the ridiculous cost of battery-backed RAM, I was very interested in this article.

However like mark_w I find it amazing that there is not a clear perfromance benefit using ramfs. However I also note that there is no single mention of barriers (and particularly whether they were enabled or disabled in mount) in the whole article!

Reply

The 2 disk based journal size were 32768 X 4k blocks. It should be 128M while the ramdisk is 16M.

Reply

If you wish for to get much from this article then you have to apply such
strategies to your won webpage.

Reply

Volleyball free betting and Volleyball free bet bonus code. Volleyball free bet promotional codes, Free bet no deposit in the Volleyball. UK’s Best Online Volleyball Betting Site.

Reply

Great challenges here. I’m just prepared to expert your site. Thank you a great deal and i am taking a look forward to touch anyone. Are you going to please shed me a snail mail?

Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>