dcsimg

Improving MetaData Performance of the Ext4 Journaling Device

In the never-ending quest for more performance, we examine three different journaling device options for ext4 with an eye toward improving metadata performance. Who doesn't like speed?

There is always a relentless pursuit of more performance from our storage systems. This includes more performance from hardware (faster disks, SSD’s), network (bigger pipes, larger MTU’s), operating systems (caching, IO schedulers), and file systems. There are many levers than can be moved to improve performance but this article will look at one particular piece – the file system journal device. In particular, the metadata performance of ext4 will be considered as the journal is moved to different devices.

Journaling for File Systems

Sometimes bad things such as power failures happen to systems. Power interruptions or failures can cause a file system to become corrupt very quickly because an IO operation is interrupted and not completed. Consequently, the file system has to be checked (fsck) which means the entire file system has to be checked (walked) to find and correct any problems. As file systems grew the amount of time it takes to walk the file system greatly increased. For example, the author remembers performing an fsck on a 1TB file system in 2002-2003 that took several days. Having the system down for this amount of time is very painful.

One way to help improve fsck times is to use a journaled file system. Rather than IO operations happening directly to the file system, the operations are added to the journal (typically a log) in the order they are supposed to happen. Then the file system grabs the operation from the head of the journal and completes it, erasing the operation from the journal only after the operation is finished and the file system is satisfied that the operation is complete.

If the power is lost during the operation on a journaled file system, when the system comes back up, the journal is just “replayed,” i.e. the operations in the journal are performed one at a time starting at the beginning. This means that the entire file system doesn’t necessarily have to be checked (walked). The primary reason this can be done is that the interruption happens before the operation is removed from the journal. Even if the operation wasn’t completed on the file system, replaying the operation ensures that the IO operation actually occurs. If the interruption happened while the operation was being deleted from the journal, the file system can assume that the operation happened and it just deletes the “corrupted” operation from the head of the journal. As a result, you should not have to walk the entire file system to repair problems. Only the journal needs to be replayed. This means that instead of spending a couple of days waiting for an fsck to finish, a very fast replay of the journal is performed taking just minutes.

The journal can theoretically reside anywhere within the system on any device. It can be on the drive containing the file system or it can use a partition on another drive or any other block device you have laying around. But choosing the “best device” is important. The journal is very important to the integrity of the file system so making sure that the journal is on a device of some resiliency is very important (resiliency in this case means the ability to tolerate errors or problems). At the same time, everyone loves more performance (there is likely no one who has said, “you know, I want my storage to go slower.”). Since the performance of the journal can be key to the performance of the file system, perhaps improving the performance of the journaling device and the journal itself can help overall file system performance.

Testing the Metadata Performance

In this article three options for the journal device will be tested to determine the impact of journal device location on the metadata performance of ext4. The three device options are:

  • Journal on the same disk as the file system
  • Journal on a different disk from the file system
  • Journal on a ram disk

The last option, using a ramdisk for the journal, is designed to measure the pinnacle of performance. But it is not likely to be the most resilient solution (it would be better to use a battery backup of the ram disk with the ability to dump it to a storage device, drive or SSD). However, it is included as an “upper bound” on performance.

One of the ways that journal performance can impact overall file system performance is in metadata performance. This article will focus on metadata performance as measured by fdtree. This benchmark has been used before to examine the metadata performance of various Linux file systems. To read about fdtree and how it was used for benchmarking please see read the original article.

As a quick recap, the benchmark, fdtree, is a simple bash script that performs four different metadata tests:

  • Directory creation
  • File creation
  • File removal
  • Directory Removal

It creates a specified number of files of a given size (in blocks) in a top-level directory. Then it creates a specified number of sub-directories and then in turn sub-directories are recursively created up to a specified number of levels and are populated with files.

Fdtree was used in 4 different approaches to stressing the metadata capability:

  • Small files (4 KiB)
    • Shallow directory structure
    • Deep directory structure

  • Larger files (4 MiB)
    • Shallow directory structure
    • Deep directory structure

The two file sizes, 4 KiB (1 block) and 4 MiB (1,000 blocks) were used to get some feel for a range of performance as a function of the amount of data. The two directory structures were used to stress the metadata in different ways to discover if there is any impact on the metadata performance. The shallow directory structure means that there are many directories but not very many levels down. The deep directory structure means that there are not many directories at a particular level but that there are many levels.

The command lines for the four combinations are:

Small Files – Shallow Directory Structure

./fdtree.bash -d 20 -f 40 -s 1 -l 3

This command creates 20 sub-directories from each upper level directory at each level (“-d 20″) and there are 3 levels (“-l 3″). It’s a basic tree structure. This is a total of 8,421 directories. In each directory there are 40 files (“-f 40″) each sized at 1 block (4 KiB) denoted by “-s 1″. This is a total of 336,840 files and 1,347,360 KiB total data.

Small Files – Deep Directory Structure

./fdtree.bash -d 3 -f 4 -s 1 -l 10

This command creates 3 sub-directories from each upper level directory at each level (“-d 3″) and there are 10 levels (“-l 10″). This is a total of 88,573 directories. In each directory there are 4 files each sized at 1 block (4 KiB). This is a total of 354,292 files and 1,417,168 KiB total data.

Medium Files – Shallow Directory Structure

./fdtree.bash -d 17 -f 10 -s 1000 -l 2

This command creates 17 sub-directories from each upper level directory at each level (“-d 17″) and there are 2 levels (“-l 2″). This is a total of 307 directories. In each directory there are 10 files each sized at 1,000 blocks (4 MiB). This is a total of 3,070 files and 12,280,000 KiB total data.

Medium Files – Deep Directory Structure

./fdtree.bash -d 2 -f 2 -s 1000 -l 10

This command creates 2 sub-directories from each upper level directory at each level (“-d 2″) and there are 10 levels (“-l 10″). This is a total of 2,047 directories. In each directory there are 2 files each sized at 1,000 blocks (4 MiB). This is a total of 4,094 files and 16,376,000 KiB total data.

Each test was run 10 times with the four combinations for the three journal devices. The test system used for these tests was a stock CentOS 5.3 distribution but with a 2.6.30 kernel and e2fsprogs was upgraded to 1.41.9. The tests were run on the following system:

  • GigaByte MAA78GM-US2H motherboard
  • An AMD Phenom II X4 920 CPU
  • 8GB of memory
  • Linux 2.6.30 kernel
  • The OS and boot drive are on an IBM DTLA-307020 (20GB drive at Ulta ATA/100)
  • /home is on a Seagate ST1360827AS
  • There are two drives for testing. They are Seagate ST3500641AS-RK with 16 MB cache each. These are /dev/sdb and /dev/sdc.

Only the first Seagate drive was used, /dev/sdb for all of the tests. The second drive, /dev/sdc was used only for the second test where the journal was placed on a second drive.

Journaling Device Details

All three journal device options used the same size journal file, 16MB. The reason that this size is used is that CentOS boots with a number of ramdisks already created. However, these devices are limited to 16MB in size. To make any comparisons fair the size of the journal was kept constant for all three cases.

The first journal device option was to keep the journal on the same disk as the file system. The drive was partitioned so that the first partition was used for the file system itself (/dev/sdb1) and the remaining approximately 16MB of the drive was used for the journal (/dev/sdb2). The first step was to build the file system on /dev/sdb1.

# mke2fs -t ext4 /dev/sdb1
mke2fs 1.41.9 (22-Aug-2009)
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
29548544 inodes, 118180156 blocks
5909007 blocks (5.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=4294967296
3607 block groups
32768 blocks per group, 32768 fragments per group
8192 inodes per group
Superblock backups stored on blocks:
        32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
        4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
        102400000

Writing inode tables: done
Creating journal (32768 blocks): done
Writing superblocks and filesystem accounting information: done

This filesystem will be automatically checked every 28 mounts or
180 days, whichever comes first.  Use tune2fs -c or -i to override.

The second step is to prepare the journal partition for journaling. Recall that the second partition on the drive (/dev/sdb2) is used for this.

# mke2fs -O journal_dev /dev/sdb2
mke2fs 1.41.9 (22-Aug-2009)
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
0 inodes, 6024 blocks
0 blocks (0.00%) reserved for the super user
First data block=0
0 block group
32768 blocks per group, 32768 fragments per group
0 inodes per group
Superblock backups stored on blocks:

Zeroing journal device: done

The third step is to tell the file system that it no longer has a journal in the file system (this is a precursor to telling it that the journal is located somewhere else).

# tune2fs -O ^has_journal /dev/sdb1
tune2fs 1.41.9 (22-Aug-2009)
# tune2fs -l /dev/sdb1
tune2fs 1.41.9 (22-Aug-2009)
Filesystem volume name:   
Last mounted on:          
Filesystem UUID:          99486587-5d38-4896-bf0a-ec79f9ac1d88
Filesystem magic number:  0xEF53
Filesystem revision #:    1 (dynamic)
Filesystem features:      ext_attr resize_inode dir_index filetype extent flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize
Filesystem flags:         signed_directory_hash
Default mount options:    (none)
Filesystem state:         clean
Errors behavior:          Continue
Filesystem OS type:       Linux
Inode count:              29548544
Block count:              118180156
Reserved block count:     5909007
Free blocks:              116307702
Free inodes:              29548533
First block:              0
Block size:               4096
Fragment size:            4096
Reserved GDT blocks:      995
Blocks per group:         32768
Fragments per group:      32768
Inodes per group:         8192
Inode blocks per group:   512
Flex block group size:    16
Filesystem created:       Mon Dec  7 11:07:20 2009
Last mount time:          n/a
Last write time:          Mon Dec  7 11:10:12 2009
Mount count:              0
Maximum mount count:      36
Last checked:             Mon Dec  7 11:07:20 2009
Check interval:           15552000 (6 months)
Next check after:         Sat Jun  5 12:07:20 2010
Lifetime writes:          7350 MB
Reserved blocks uid:      0 (user root)
Reserved blocks gid:      0 (group root)
First inode:              11
Inode size:               256
Required extra isize:     28
Desired extra isize:      28
Default directory hash:   half_md4
Directory Hash Seed:      ed707821-9ec0-44c7-9c4a-15812b753939
Journal backup:           inode blocks

Notice that the line “Filesystem features” does not have the entry “has_journal” indicating that the file system no longer has a journal. The last step is to tell the file system that it has a journal and it is on the second partition of the drive.

# tune2fs -o journal_data -j -J device=/dev/sdb2 /dev/sdb1
tune2fs 1.41.9 (22-Aug-2009)
Creating journal on device /dev/sdb2: done
This filesystem will be automatically checked every 36 mounts or
180 days, whichever comes first.  Use tune2fs -c or -i to override.
# tune2fs -l /dev/sdb1
tune2fs 1.41.9 (22-Aug-2009)
Filesystem volume name:   
Last mounted on:          
Filesystem UUID:          99486587-5d38-4896-bf0a-ec79f9ac1d88
Filesystem magic number:  0xEF53
Filesystem revision #:    1 (dynamic)
Filesystem features:      has_journal ext_attr resize_inode dir_index filetype extent flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize
Filesystem flags:         signed_directory_hash
Default mount options:    journal_data
Filesystem state:         clean
Errors behavior:          Continue
Filesystem OS type:       Linux
Inode count:              29548544
Block count:              118180156
Reserved block count:     5909007
Free blocks:              116307702
Free inodes:              29548533
First block:              0
Block size:               4096
Fragment size:            4096
Reserved GDT blocks:      995
Blocks per group:         32768
Fragments per group:      32768
Inodes per group:         8192
Inode blocks per group:   512
Flex block group size:    16
Filesystem created:       Mon Dec  7 11:07:20 2009
Last mount time:          n/a
Last write time:          Mon Dec  7 11:11:12 2009
Mount count:              0
Maximum mount count:      36
Last checked:             Mon Dec  7 11:07:20 2009
Check interval:           15552000 (6 months)
Next check after:         Sat Jun  5 12:07:20 2010
Lifetime writes:          7350 MB
Reserved blocks uid:      0 (user root)
Reserved blocks gid:      0 (group root)
First inode:              11
Inode size:               256
Required extra isize:     28
Desired extra isize:      28
Journal UUID:             b71b315f-40e8-4e93-b868-7ad19f7fee8b
Journal device:           0x0812
Default directory hash:   half_md4
Directory Hash Seed:      ed707821-9ec0-44c7-9c4a-15812b753939
Journal backup:           inode blocks

Notice that the line “Filesystem features” has the value “has_journal” and that the line “Journal device:” has a value 0×0812 that is pointing to the second partition on the drive.

The second journal device option where the journal is placed on a second hard drive is created using several steps. The first step is to create the file system on /dev/sdb1.

# mke2fs -t ext4 /dev/sdb1
mke2fs 1.41.9 (22-Aug-2009)
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
30531584 inodes, 122096000 blocks
6104800 blocks (5.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=4294967296
3727 block groups
32768 blocks per group, 32768 fragments per group
8192 inodes per group
Superblock backups stored on blocks:
        32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
        4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
        102400000

Writing inode tables: done
Creating journal (32768 blocks): done
Writing superblocks and filesystem accounting information: done

This filesystem will be automatically checked every 28 mounts or
180 days, whichever comes first.  Use tune2fs -c or -i to override.

The second step is to create a journal on the second drive /dev/sdc1. This partition was create to be 16MB in size.

# mke2fs -O journal_dev /dev/sdc1
mke2fs 1.41.9 (22-Aug-2009)
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
0 inodes, 6016 blocks
0 blocks (0.00%) reserved for the super user
First data block=0
0 block group
32768 blocks per group, 32768 fragments per group
0 inodes per group
Superblock backups stored on blocks:

Zeroing journal device: done

The third step is to then use tune2fs to tell the file system that it doesn’t have a journal on /dev/sdb1.

# tune2fs -O ^has_journal /dev/sdb1
tune2fs 1.41.9 (22-Aug-2009)

# tune2fs -l /dev/sdb1
tune2fs 1.41.9 (22-Aug-2009)
Filesystem volume name:   
Last mounted on:          
Filesystem UUID:          14a11690-76a6-4a3d-997a-abf85bd4d4ad
Filesystem magic number:  0xEF53
Filesystem revision #:    1 (dynamic)
Filesystem features:      ext_attr resize_inode dir_index filetype extent flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize
Filesystem flags:         signed_directory_hash
Default mount options:    (none)
Filesystem state:         clean
Errors behavior:          Continue
Filesystem OS type:       Linux
Inode count:              30531584
Block count:              122096000
Reserved block count:     6104800
Free blocks:              120161866
Free inodes:              30531573
First block:              0
Block size:               4096
Fragment size:            4096
Reserved GDT blocks:      994
Blocks per group:         32768
Fragments per group:      32768
Inodes per group:         8192
Inode blocks per group:   512
Flex block group size:    16
Filesystem created:       Sun Dec  6 07:22:57 2009
Last mount time:          n/a
Last write time:          Sun Dec  6 07:26:36 2009
Mount count:              0
Maximum mount count:      28
Last checked:             Sun Dec  6 07:22:57 2009
Check interval:           15552000 (6 months)
Next check after:         Fri Jun  4 08:22:57 2010
Lifetime writes:          7590 MB
Reserved blocks uid:      0 (user root)
Reserved blocks gid:      0 (group root)
First inode:              11
Inode size:               256
Required extra isize:     28
Desired extra isize:      28
Default directory hash:   half_md4
Directory Hash Seed:      7d24bc9d-db4a-4c0c-b15d-f0959af6edde
Journal backup:           inode blocks

Notice on the line “Filesystem features” that the features “has_journal” is not listed. This indicates that the journal has been “removed” from the file system. The final steps is to tell the file system that it has a journal that is on a specific device – in this case /dev/sdc1.

# tune2fs -o journal_data -j -J device=/dev/sdc1 /dev/sdb1
tune2fs 1.41.9 (22-Aug-2009)
Creating journal on device /dev/sdc1: done
This filesystem will be automatically checked every 28 mounts or

# tune2fs -l /dev/sdb1
tune2fs 1.41.9 (22-Aug-2009)
Filesystem volume name:   
Last mounted on:          
Filesystem UUID:          14a11690-76a6-4a3d-997a-abf85bd4d4ad
Filesystem magic number:  0xEF53
Filesystem revision #:    1 (dynamic)
Filesystem features:      has_journal ext_attr resize_inode dir_index filetype extent flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize
Filesystem flags:         signed_directory_hash
Default mount options:    journal_data
Filesystem state:         clean
Errors behavior:          Continue
Filesystem OS type:       Linux
Inode count:              30531584
Block count:              122096000
Reserved block count:     6104800
Free blocks:              120161866
Free inodes:              30531573
First block:              0
Block size:               4096
Fragment size:            4096
Reserved GDT blocks:      994
Blocks per group:         32768
Fragments per group:      32768
Inodes per group:         8192
Inode blocks per group:   512
Flex block group size:    16
Filesystem created:       Sun Dec  6 07:22:57 2009
Last mount time:          n/a
Last write time:          Sun Dec  6 07:27:20 2009
Mount count:              0
Maximum mount count:      28
Last checked:             Sun Dec  6 07:22:57 2009
Check interval:           15552000 (6 months)
Next check after:         Fri Jun  4 08:22:57 2010
Lifetime writes:          7590 MB
Reserved blocks uid:      0 (user root)
Reserved blocks gid:      0 (group root)
First inode:              11
Inode size:               256
Required extra isize:     28
Desired extra isize:      28
Journal UUID:             c3d3c7e7-f465-41c7-a556-80a9cdc865c3
Journal device:           0x0821
Default directory hash:   half_md4
Directory Hash Seed:      7d24bc9d-db4a-4c0c-b15d-f0959af6edde
Journal backup:           inode blocks

Looking through the listing you can see that the file system has a journal again (“has_journal” on the line “Filesystem features”) and that the journal device is listed as “0×0821″ near the bottom of the listing.

The third journal device option is to place it on a ram drive. It is done in a similar fashion to the previous option where the journal was put on a second drive. But recall that the external journal has to be a block device. The technique used for a ramdisk block device is fairly simple and is based on this article. Despite the article being based on a 2.4 kernel, the techniques are the same.

The first step is to use examine what ramdisks are already created.

Comments on "Improving MetaData Performance of the Ext4 Journaling Device"

There are various types of Remedial Medicine for
male impotency and erectile dysfunction. The drug has adverse drug reactions like dizziness and
sneezing among others. The most common side effects of taking Levitra are
headaches, flushing, stuffy or running nose.

Here is my website; https://www.levitradosageus24.com/levitra-vs-viagra-zu/

Examining watch The 33 full movie (bearpark-online.net) numerous roles inhabited by females in Iranian society, Kiarostami’s
MO, focusing on a single deal with at just one time, allows Akbari’s tale
to steadily evolve and creates some incredible moments – her son’s
15 minute tirade followed by her priceless reaction.

Every the moment inside a although we choose blogs that we read. Listed beneath would be the most recent websites that we opt for.

Leave a Reply