Metadata Performance of Four Linux File Systems

Using the principles of good benchmarking, we explore the metadata performance of four linux file systems using a simple benchmark, fdtree.

Each test was run 10 times with the four combinations of the file systems (ext3, ext4, btrfs, nilfs). The test system used for these tests was a stock CentOS 5.3 distribution but with a 2.6.30 kernel and e3fsprogs was upgraded to the latest version as of the writing of this article, 1.41.9. The tests were run on the following system:


  • GigaByte MAA78GM-US2H motherboard
  • An AMD Phenom II X4 920 CPU
  • 8GB of memory
  • Linux 2.6.30 kernel
  • The OS and boot drive are on an IBM DTLA-307020 (20GB drive at Ulta ATA/100)
  • /home is on a Seagate ST1360827AS
  • There are two drives for testing. They are Seagate ST3500641AS-RK with 16 MB cache each. These are /dev/sdb and /dev/sdc.

Only the first Seagate drive was used, /dev/sdb for all of the tests.

For all 4 file systems, the defaults were used in building the file systems. For btrfs, version btrfs-progs v0.18. For nilfs2, nilfs-utils-2.0.14 was used. Both ext3 and ext4 were mounted with “data=ordered” since this is recommended practice to prevent data loss.

Benchmark Results

This section presents the results of the testing (exploration). The results are presented in tables listing the average value and just below it, in red, is the standard deviation. This is done for each of the four combinations and each of the four file systems.

The first combination tested was for small files (4 KiB) with a shallow directory structure. Table 1 below lists the results with an average value and just below it, in red, is the standard deviation.

Table 1 – Benchmark Times Small Files (4 KiB) – Shallow Directory Structure

File System Directory Create
(secs.)
File Create
(secs.)
File Remove
(secs.)
Directory Remove
(secs.)
ext3 13.00
3.61
342.90
42.69
69.40
6.92
1.30
0.46
ext4 10.60
0.92
327.20
4.89
58.10
1.87
1.40
0.92
btrfs 8.80
0.40
335.00
1.00
65.30
0.78
1.40
0.66
nilfs2 9.10
0.30
345.70
8.14
51.60
0.92
1.20
0.40

The first test, directory creates, had an average run time of 12 seconds for all four file systems, so the results may not be that meaningful. In addition, the directory remove test ran in about 1 second or less. Consequently, this test may not have much value.

Table 2 below lists the performance results with an average value and just below it, in red, is the standard deviation.

Table 2 – Performance Results of Small Files (4 KiB) – Shallow Directory Structure

File System Directory Create
(Dirs/sec)
File Create
(Files/sec)
File Create
(KiB/sec)
File Remove
(Files/sec)
Directory Remove
(Dirs/sec)
ext3 695.20
177.37
993.70
94.66
3,975.90
378.91
4,900.30
473.28
7,578.80
1,684.40
ext4 800.00
69.88
1,029.10
15.21
4,118.30
60.59
5,803.40
111.90
7,368.30
2,157.37
btrfs 958.40
46.80
1,005.00
3.00
4,021.70
12.01
5,167.10
78.22
7,017.40
2,174.42
nilfs2 925.70
27.90
974.70
21.67
3,889.20
88.54
6,529.20
112.75
7,578.80
1,684.40

The second combination tested was for small files (4 KiB) with a deep directory structure. Table 3 below lists the benchmark times with an average value and just below it, in red, is the standard deviation.

Table 3 – Benchmark Times Small Files (4 KiB) – Deep Directory Structure

File System Directory Create
(secs.)
File Create
(secs.)
File Remove
(secs.)
Directory Remove
(secs.)
ext3 46.20
26.97
182.40
72.55
53.70
24.78
14.60
7.55
ext4 187.00
11.22
443.20
7.69
192.50
12.51
73.30
42.09
btrfs 102.40
0.66
398.6
1.91
132.50
0.67
38.10
0.70
nilfs2 108.20
2.68
417.30
6.48
122.10
3.39
37.20
0.60

For these tests, the first test, directory creates, took about 40 seconds for ext3 (the fastest). This time is fairly small and, consequently, the results may not be as applicable the other tests which had a much longer run time. The last test, directory removes, took 11 seconds for ext3 (the fastest). Again, this time is fairly quick so it may not be as useful because the time is so short.

Table 4 below lists the performance results with an average value and just below it, in red, is the standard deviation.

Table 4 – Performance Results of Small Files (4 KiB) – Deep Directory Structure

File System Directory Create
(Dirs/sec)
File Create
(Files/sec)
File Create
(KiB/sec)
File Remove
(Files/sec)
Directory Remove
(Dirs/sec)
ext3 783.90
39.08
927.90
16.58
3,713.00
65.88
3,180.70
209.90
2,452.40
207.90
ext4 475.00
29.05
799.10
13.45
3,198.00
53.73
1,848.00
124.31
1,539.60
201.76
btrfs 864.30
5.76
888.10
4.23
3,554.80
16.92
2,673.60
13.87
2,324.90
42.57
nilfs2 818.60
19.11
848.50
12.71
3,396.40
51.73
2,903.60
75.52
2,380.80
36.60

The third combination tested was for medium files (4 MiB) with a shallow directory structure. Table 5 below lists the benchmark times with an average value and just below it, in red, is the standard deviation.

Table 5 – Benchmark Times Medium Files (4 MiB) – Shallow Directory Structure

File System Directory Create
(secs.)
File Create
(secs.)
File Remove
(secs.)
Directory Remove
(secs.)
ext3 0.30
0.46
174.90
17.46
17.40
3.47
0.00
0.00
ext4 0.20
0.40
156.80
4.75
11.80
2.99
0.20
0.40
btrfs 0.50
0.50
114.40
1.11
15.60
0.49
0.10
0.30
nilfs2 0.70
0.78
196.30
3.07
7.50
2.87
0.20
0.40

For these tests, the first test, directory creates, took less than 1 second. This time is very small and, consequently, the results are not as applicable as some of the other tests. The file removes test took about 10-15 seconds. Again this is a very short time and the results may not be as applicable. The last test, directory removes, took 0-1.4 seconds. This time too, is very short.

Table 6 below lists the performance results with an average value and just below it, in red, is the standard deviation.

Table 6 – Performance Results of Medium Files (4 MiB) – Shallow Directory Structure

File System Directory Create
(Dirs/sec)
File Create
(Files/sec)
File Create
(KiB/sec)
File Remove
(Files/sec)
Directory Remove
(Dirs/sec)
ext3 92.10
140.69
17.30
1.90
70,889.80
6,798.06
182.30
32.53
0.00
0.00
ext4 61.40
122.80
18.90
0.54
78,393.20
2,252.90
278.30
75.69
61.40
122.80
btrfs 153.50
153.50
26.20
0.60
107,342.50
1,063.70
196.20
6.37
30.70
92.10
nilfs2 122.70
133.80
15.00
0.00
62,572.00
968.91
442.50
90.62
61.40
122.80

The fourth and final combination tested was for medium files (4 MiB) with a deep directory structure. Table 7 below lists the benchmark times with an average value and just below it, in red, is the standard deviation.

Table 7 – Benchmark Times Medium Files (4 MiB) – Deep Directory Structure

File System Directory Create
(secs.)
File Create
(secs.)
File Remove
(secs.)
Directory Remove
(secs.)
ext3 2.70
0.78
248.30
9.99
18.80
4.07
1.80
1.08
ext4 3.20
0.75
219.50
1.12
13.40
4.72
1.20
0.40
btrfs 2.40
0.49
159.30
1.42
16.20
1.17
1.10
0.30
nilfs2 2.50
0.50
287.70
10.67
11.50
0.50
1.40
0.49

The first test, directory creates, took 2-3 seconds, which is very short. The time for the third test, file removal, was also fairly short at 11-19 seconds. The last test, directory removes, was extremely fast at less than 2 seconds. These three results are somewhat suspect because of short run time.

Table 8 below lists the performance results with an average value and just below it, in red, is the standard deviation.

Table 8 – Results of Medium Files (4 MiB) – Deep Directory Structure

File System Directory Create
(Dirs/sec)
File Create
(Files/sec)
File Create
(KiB/sec)
File Remove
(Files/sec)
Directory Remove
(Dirs/sec)
ext3 818.30
213.10
16.20
0.60
66,053.10
2,515.72
225.60
35.42
1,518.00
658.48
ext4 671.70
147.98
18.10
0.30
74,607.50
380.54
331.50
112.06
1,842.20
409.60
btrfs 886.60
167.06
25.20
0.40
102,807.40
917.56
253.20
17.72
1,944.60
307.20
nilfs2 852.50
170.50
13.70
0.64
56,998.60
2,122,26
356.50
15.50
1,637.40
501.66

Discussion of Results

There are four different combinations that were tested for the four file systems. Comparing the file systems is interesting but also comparing the same file system for the different tests is interesting as well.

Firstly, let’s examine the shallow directory structure results (Tables 2 and 6). For small files (4 KiB) the four file systems performed about the same for the file create and file removal tests (the directory create and remove tests ran too quickly to be really useful). All four file systems achieved about 1,000 file creates per second or about 4,000 KiB per second (see Table 2). But medium files (4 MiB) produced very different results. For this case, btrfs was almost twice as fast as nilfs and 50% faster than ext3 or ext3 with respect to the number of files creates per second or throughput (KiB/s) (see Table 6).

However, the small file case produced 109X more files, 27X more directories, but only produced about 1/10-th the total amount of data. This points out the extreme pressure that is put on metadata performance of file systems for small files.

Second, we can perform the same comparison for deep directory structure (Tables 4 and 8). Small files (1 KiB) put extreme pressure on file system metadata performance as with the shallow directory structure.

Examining the results, the following observations are made:


  • Small files put extreme pressure on metadata performance regardless of file systems

  • For small files, a shallow or deep directory structure did not appreciably impact metadata performance

  • For larger files, a shallow or deep directory structure also did not appreciably impact metadata performance

  • For small files, btrfs has good file creation performance but file removal performance is not as good as ext3 and ext4 at this time

  • For larger files, btrfs has both excellent file creation and removal performance relative to the other 3 file systems

  • Log-base file systems such as nilfs2 should work well with metadata tests. But the developers are evolving the garbage collection (gc) algorithm which should improve performance.

  • The standard deviation, or the spread in the data was much greater for ext3 than the other three file systems. The reason(s) for this are not known.

This is the first attempt at useful benchmarks and analysis for Linux file systems using the approach outlined in a previous article around benchmarking. There will be future articles that use the same basic tenants. But be ready because this approach introduces a great deal of data into the article. But overall it does give much more information that a quick table and a conclusion “file system X is better” (usually followed by a run for cover). Please let me know in the forums if this type of article is useful.

Jeff Layton is an Enterprise Technologist for HPC at Dell. He can be found lounging around at a nearby Frys enjoying the coffee and waiting for sales (but never during working hours).

Comments on "Metadata Performance of Four Linux File Systems"

typhoidmary

I thought this was interesting and useful. I would like to know how much Hardware makes a difference. SATA vs. PATA vs. SCSI. Do different chipsets perform differently? And of course different drives with different cache sizes, RPMs etc…

Reply
mbainter

Definitely interesting, though I would\’ve liked to have seen xfs and the reiser3/reiser4 filesystems compared as well. There are some significant differences there that are worth considering.

You should also include large files. Particularly with the advent of media-servers and the like being able to perform efficiently with large files is important, and that\’s not covered here.

Last but not least I\’d like to see some comparison of storage efficiency for these different types of files. If you can move them fast, that\’s great, but if you can\’t store a particular type of file efficiently and I\’m going to lose say 20% of my storage because of it that\’s important to consider when making the choice.

Reply
laytonjb

In general I agree with both your comments (typhoidmary and mbainter). But let me comment really quickly on the details.

@typhoidmary:
I would love to test different chipsets and different drives. I just need the money to buy it :)

BTW – thanks for all of your comments. I\’ve noticed you read my articles and post comments. That\’s always appreciated.

@mbainter:
I wanted to test xfs and the reiser\’s but I ran out of time and the article was getting a little long. I will try to do a follow-up at some point with those numbers (maybe next week).

I also didn\’t do large files (400 MiB+?) because of time, but I do want to do those runs.

For both of you – thanks for the comments.

Jeff

Reply
chrisjoelly

Thanks for that comparison.

Is it possible to include some other, not so often used filesystems as well? e.g. GFS or GFS2 with various storage systems below like DRBD? And tuning opportunities for filesystems in typical scenarios would be a great article too :-)

Chris

Reply
mdavid

hi Jeff
I have read the article carefully, and have also read the review article about 9 years of FS and Storage benchmarking.

Let me make some remarks which I hope are constructive criticism.
I have downloaded the fdtree source.
First off, it\’s a single thread
Your machine has 8 GB ram.

In my opinion, and from experience (also done some FS benchmarking), testing with a total file size <= 1.5 x amount of RAM can go to caches first. Your tests with small sizes amount to around 1.3-1.4 GB

While the tests with 4MB size the total ranges between 12-16GB.

I think the results of \”creation\” are not so bound by caches, while removal can be cached, and that\’s why I think removal of files and dirs, have timings which are quite small to the extent of not being able to draw conclusions in certain cases.

For metadata, and AFAIK, each file or dir has 4KB for the inode, (at least for the ext3), don\’t know for the others. One could imagine testing \”pure\” metadata with a \”touch\” *nix command, instead of dd, and being careful to make a total of 12GB/4KB = 3 million files+dirs for example.

Furthermore, you mention in the beginning why you are benchmarking metadata, but the fdtree misses completely one very important operation which is \”stat\” or a read of the inode, there are some workloads where you write once and read many (even if it\’s small files).

this leads to some sugestions:
I have used recently bonnie++ 1.03e, which does also metadata benchmarking including the \”stat\” operation.
The iozone tarball includes an exec called fileop (though I never tried it).

Though I imagine that you don\’t have a long time to do the testing (as some of us can), just to give you an example, one of my last tests with the above bonnie++ version, each run could take between 2 to 4 hours depending on the filesystem, and I also run it 10 times.

finally, if in your future tests you include the read/stat operation, just try out mount the FS with and without atime,diratime.

OK, that\’s it, sorry if I was too obvious in some things, or to strong, as I said I tried to be constructive, and continue your good work.

I read your other article at a time when I was starting some benchmarks, and I stopped to read first

regards

Mario David

Reply
laytonjb

@mdavid,

I think you have some interesting points but let me explain a few things.

fdtree, while a simple bash benchmark, also uses all of the cores on my test box. While I didn\’t show the image, I have a picture of gkrellm while the benchmark is running. All 4 cores are being used. I\’m not entirely sure how this works but I think it\’s because of the recursion in the script. But this shows how little I know about bash.

Second, fdtree is not an all encompassing benchmark. It only tests file and directory create and removal in a specific order. I\’m hoping to test another benchmark named mdtree which also stresses other aspects of metadata performance.

Third, to be honest, I\’m not sure about the caching aspect of fdtree. Linux might cache the file operations, but since there are so many, I\’m not sure if it does or doesn\’t. Perhaps the recursion affects the caching. Something to look into. (thanks for pointing that out).

One thing I didn\’t do and should have done was to watch the CPU load during the runs. I sort of watched it using gkrellm but I didn\’t gather any statistics.

But you do correctly point out that almost any benchmark doesn\’t always stress all aspects that you are interested in. As you correctly point out, fdtree doesn\’t stress stat. Other benchmarks will stress the file systems in a different manner. For example, as you mention, Bonnie++ does stress metadata operations and is perhaps a reasonable benchmark to test.

Thanks for your comments. They are really appreciated. Don\’t hesitate to post.

Thanks!

Jeff

Reply

You understand thus significantly relating to this matter, made me personally consider it from numerous numerous angles. Its like men and women are not interested until it is one thing to do with Girl gaga! Your personal stuffs outstanding. At all times take care of it up!

Reply

Leave a Reply to alphalipidsd2 Cancel reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>