Turning from Metadata performance to throughput performance, we examines the impact of journal size on ext4 when the journal is disk-based. Dig into the numbers and see what you can do to improve throughput performance.
Metadata performance is one of the most overlooked aspects of file system performance. It can have a tremendous impact on how the performance of a file system “feels” and, perhaps more importantly, can affect the execution time of applications that do a great deal of metadata (and, believe me, there are some). However, there are applications that do a great deal of streaming of data to the file system. Consequently, there are driven by throughput performance, not metadata performance. Besides, most people quote the performance of systems in terms of throughput so it’s always good to present results where people have an innate understanding of the results.
In this article we use IOzone to measure throughput performance for various journal sizes when the journal is located on a separate disk. The block size is also varied to understand the impact of block size on throughput performance. The results can be a bit more complex to interpret because we now have two variables – journal size and block size. Plus the IOzone tests run 13 different IO tests as executed in this article.
It’s been a while since the IOzone has been used in an article here so let’s begin with a quick review.
IOzone is one of the more popular throughput benchmarks. It’s open source and is written in very plain ANSI C. It’s capable of single thread, multi-threaded, and multi-client testing. The basic concept of IOzone is to break up a file of a given size into records. Records are written or read in some fashion until the file size is reached. Using this concept, IOzone has a number of tests that can be performed:
This is a fairly simple test that simulates writing to a new file. Because of the need to create new metadata for the file, many times the writing of a new file can be slower than rewriting to an existing file. The file is written using records of a specific length (either specified by the user or chosen automatically by IOzone) until the total file length has been reached.
This test is similar to the write test but measures the performance of writing to a file that already exists. Since the file already exists and the metadata is present, it is commonly expected for the re-write performance to be greater than the write performance. This particular test opens the file, puts the file pointer at the beginning of the file, and then writes to the open file descriptor using records of a specified length until the total file size is reached. Then it closes the file which updates the metadata.
This test reads an existing file. It reads the entire file, one record at a time.
This test reads a file that was recently read. This test is useful because operating systems and file systems will maintain parts of a recently read file in cache. Consequently, re-read performance should be better than read performance because of the cache effects. However, sometimes the cache effect can be mitigated by making the file much larger than the amount of memory in the system.
- Random Read
This test reads a file with the accesses being made to random locations within the file. The reads are done in record units until the total reads are the size of the file. The performance of this test is impacted by many factors including the OS cache(s), the number of disks and their configuration, disk seek latency, and disk cache among others.
- Random Write
The random write test measures the performance when writing a file with the accesses being made to random locations with the file. The file is opened to the total file size and then the data is written in record sizes to random locations within the file.
- Backwards Read
This is a unique file system test that reads a file backwards. There are several applications, notably, MSC Nastran, that read files backwards. There are some file systems and even OS’s that can detect this type of access pattern and enhance the performance of the access. In this test a file is opened and the file pointer is moved 1 record forward and then the file is read backward one record. Then the file pointer is moved 2 records forward in the file, and the process continues.
- Record Rewrite
This test meaures the performance when writing and re-writing a particular spot with a file. The test is interesting because it can highlight “hot spot” capabilities within a file system and/or an OS. If the spot is small enough to fit into the various cache sizes; CPU data cache, TLB, OS cache, file system cache, etc., then the performance will be very good.
- Strided Read
This test reads a file in what is called a strided manner. For example, you could read at a file offset of zero for a length of 4 Kbytes, then seek 200 Kbytes forward, then read for 4 Kbytes, then seek 200 Kbytes, and so on. The constant pattern is important and the “distance” between the reads is called the stride (in this case it is 200 Kbytes). This access pattern is used by many applications that are reading certain data structures. This test can highlight interesting issues in file systems and storage because the stride could cause the data to miss any striping in a RAID configuration, resulting in poor performance.
This test measures the performance of writing a file using a library function “fwrite()”. It is a binary stream function (examine the man pages on your system to learn more). Equally important, the routine performs a buffered write operation. This buffer is in user space (i.e. not part of the system caches). This test is performed with a record length buffer being created in a user-space buffer and then written to the file. This is repeated until the entire file is created. This test is similar to the “write” test in that it creates a new file, possibly stressing the metadata performance.
This test is similar to the “rewrite” test but using the fwrite() library function. Ideally the performance should be better than “Fwrite” because it uses an existing file so the metadata performance is not stressed in this case.
This is a test that uses the fread() libary function to read a file. It opens a file, and reads it in record lengths into a buffer that is in user space. This continues until the entire file is read.
This test is similar to the “reread” test but uses the “fread()” library function. It reads a recently read file which may allow file system or OS cache buffers to be used, improving performance.
There are other options that can be tested, but for this exploration only the previously mentioned tests will be examined. However, even this list of tests is fairly extensive and covers a large number of application access patterns that you are likely to see (but not all of them).
There are a large number of command line options available for IOzone, far more than will be covered here. The next section will present the test system as well as the specific IOzone commands used.
The tests were run on the same system as the metadata tests. The system highlights of the system are:
- GigaByte MAA78GM-US2H motherboard
- An AMD Phenom II X4 920 CPU
- 8GB of memory (DDR2-800)
- Linux 2.6.30 kernel (with reiser4 patches only)
- The OS and boot drive are on an IBM DTLA-307020 (20GB drive at Ultra ATA/100)
- /home is on a Seagate ST1360827AS
- There are two drives for testing. They are Seagate ST3500641AS-RK with 16 MB cache each. These are
Only the first Seagate drive was used, /dev/sdb, for the file system. The second hard drive,
/dev/sdc was used for the journal portion of the file system. It was partitioned to the correct size and only that partition was used for the journal (
For IOzone the system specifications are fairly important. In particular, the amount of system memory is important because this can have a large impact on the caching effects. If the problem sizes are small enough to fit into the system or file system cache (or at least partially), it can skew results. Comparing the results of one system where the cache effects are fairly large to a system where cache effects are not large, is comparing the proverbial apples to oranges. For example, if you run the same problem size on a system with 1GB of memory versus a system with 8GB you will get much different results.
With that in mind, the next section presents the actual IOzone commands used in the testing.
IOzone Command Parameters
As mentioned previously there are a huge number of options available with IOzone (that is one reason it is so popular and powerful). For this exploration, the basic tests are run are: write, re-write, read, re-read, random read, random write, backwards read, record re-write, strided read, fwrite, refwrite, fread, and refread.
One of the most important considerations for this test is whether cache effects want to be considered in the results or not. Including cache effects in the results can be very useful because it can point out certain aspects of the OS and file system cache sizes and how the caches function. On the other hand, including cache effects limits the usefulness of the data in comparison to other results.
For this article, cache effects will be limited as much as possible so that the impact of the file system designs on performance can be better observed. Cache effects can’t be eliminated entirely without running extremely large problems and forcing the OS to eliminate all caches. However, it is almost impossible to eliminate the hardware caches such as those in the CPU, so trying to eliminate all cache effects is virtually impossible (but never say never). But, one way to minimize the cache effects is to make the file size much bigger than the main memory. For this article, the file size is chosen to be 16GB which is twice the size of main memory. This is chosen arbitrarily based on experience and some urban legends floating around the Internet.
Recall that most of the IOzone tests break up a file into records of a specific length. For example, a 1GB file can be broken into 1MB record so there are a total of 1,000 records in the file. IOzone can either run an automatic sweep of record sizes or the user can fix the record size. If done automatically IOzone starts at 1KB (1,024 bytes) and then doubles the record size until it reaches a maximum of 16 MB (16,777,216 bytes). Optionally, the user can specify the lower record size and the upper record size and IOzone will vary the record sizes in between.
For this article, with 16GB and 1KB record sizes, 1,000,000 records will be used for each of the 13 tests. The run times for this test are very large. Using our good benchmarking skills where each test is run at least 10 times, the total run time would be so large that, perhaps, only 1 benchmark every 2-4 weeks could be published. Consequently, to meet editorial deadlines (and you don’t want to be late for the editor), the record sizes will be larger. For this article, only four record sizes are tested: (1) 1MB, (2) 4MB, (3) 8MB, and (4) 16MB. For a file size of 16GB that is (1) 16,000 records, (2) 4,000 records, (3) 2,000 records, (4) 1,000 records. These record sizes and number of records do correspond to a number of applications so they do produce relevant results.
The command line for the first record size (1MB) is,
./IOzone -Rb spreadsheet_output_1M.wks -s 16G -r 1M > output_1M.txt
The command line for the second record size (4MB) is,
./IOzone -Rb spreadsheet_output_4M.wks -s 16G -r 4 > output_4M.txt
The command line for the third record size (28MB) is,
./IOzone -Rb spreadsheet_output_8M.wks -s 16G -r 8M > output_8M.txt
The command line for the fourth record size (16MB) is,
./IOzone -Rb spreadsheet_output_16M.wks -s 16G -r 16M > output_16M.txt
This article will consider 4 journal sizes:
- 16MB (0.0032% of file system size)
- 64MB (0.0128% of file system size)
- 256MB (0.0512% of file system size)
- 1GB (0.2% of file system size)
A separate hard drive partition on an identical hard drive is created and used for the journal for the ext4 file system. The details on creating a file system with a journal located on a different device is covered in a previous article.
The results are plotted using a bar chart to make them easier to compare. However, carefully examine the y-axis since the major and minor divisions are not the same for every graph.
The plots are of the average values with error bars representing the standard deviation. Each plot has four groups of four bars each with a different color. Each bar is a different block size for IOzone. The legend tells you what color corresponds to what block size. Each group of bars represents the specific journal size (16MB, 64MB, 256MB, 1GB). Finally, each chart represents one of the 13 tests. The first 6 charts are for the write tests and the last 7 charts are for read tests.
Figure 1 below is the write test for the four block sizes and the four journal sizes.
Figure 1: Average Write Throughput (KB per second) for the Four Blocks Sizes and the Four Journal Sizes
Notice that there is little variation in performance with changes in either block size or journal size.
Figure 2 below is the re-write test for the four block sizes and the four journal sizes.
Figure 2: Average Re-Write Throughput (KB per second) for the Four Blocks Sizes and the Four Journal Sizes
Notice that there is little variation in performance with changes in either block size or journal size. There is a slight decrease in the average re-write throughput performance as the block size increases (somewhat expected since there are fewer blocks). However there is little difference in performance for the four journal sizes.
Figure 3 below is the random write test for the four block sizes and the four journal sizes.
Figure 3: Average Random Write Throughput (KB per second) for the Four Blocks Sizes and the Four Journal Sizes
The test shows a fair amount of improvement in performance as the block size increases. This is to be expected since the number of blocks decreases as the block size increases. The throughput increases by about 24% from a 1MB block size to a 16MB block size (16MB journal size). But, as you can see, there is little change in throughput performance for changes in journal size.
Figure 4 below is the record rewrite test for the four block sizes and the four journal sizes.
Figure 4: Average Record Rewrite Throughput (KB per second) for the Four Blocks Sizes and the Four Journal Sizes
The test shows quite a bit of throughput change as a function of the block size. The smallest block size tested, 1MB, has the best throughput, about 97% faster than the 16MB block size. This is possibly due to caching effects. As with the previous three charts, there is little variation in performance with an increase in journal size.
Figure 5 below is the fwrite test for the four block sizes and the four journal sizes.
Figure 5: Average fwrite() Throughput (KB per second) for the Four Blocks Sizes and the Four Journal Sizes
The results indicate little variation in performance as a function of block size or journal size for the parameters tested.
Figure 6 below is the re-fwrite test for the four block sizes and the four journal sizes.
Figure 6: Average re-fwrite() Throughput (KB per second) for the Four Blocks Sizes and the Four Journal Sizes
The results are interesting in that there are a few anomalies. The standard deviation for a block size of 4MB and a journal size of 64MB and for a block size of 16MB and a journal size of 1GB are much larger than any other values.
In addition, there is little variation in performance except for these cases:
- Block Size = 4MB, Journal Size = 64MB
- Block Size = 1MB, Journal Size = 1GB
- Block Size = 16MB, Journal Size = 1GB
The reason(s) for the performance difference for these cases is unknown at this time. However, additional testing could indicate that there were just a few cases with below normal performance.
Figure 7 below is the read test for the four block sizes and the four journal sizes.
Figure 7: Average Read Throughput (KB per second) for the Four Blocks Sizes and the Four Journal Sizes
There is little variation in performance as a function of block size or journal size.
Figure 8 below is the re-read test for the four block sizes and the four journal sizes.
Figure 8: Average Re-Read Throughput (KB per second) for the Four Blocks Sizes and the Four Journal Sizes
There is little variation in performance as a function of block size or journal size.
Figure 9 below is the random read test for the four block sizes and the four journal sizes.
Figure 9: Average Random Read Throughput (KB per second) for the Four Blocks Sizes and the Four Journal Sizes
There is considerable variation in the throughput as a function of the block size. For the 16MB journal size case, from 1MB block size to 16MB block size ther throughput increases 94%. However, there is little variation in performance as a function of journal size.
Figure 10 below is the backwards read test for the four block sizes and the four journal sizes.
Figure 10: Average Backwards Read Throughput (KB per second) for the Four Blocks Sizes and the Four Journal Sizes
For this test there is considerable variation in performance due to block size but very little change in performance due to journal size. For the 16MB journal size case, increase the block size from 1MB to 16MB improved performance by 53%.
Figure 11 below is the stride read test for the four block sizes and the four journal sizes.
Figure 11: Average Stride Read Throughput (KB per second) for the Four Blocks Sizes and the Four Journal Sizes
For this test there is considerable variation in performance due to block size but very little change in performance due to journal size. For the 16MB journal size case, increase the block size from 1MB to 16MB improved performance by 114%.
Figure 12 below is the fread test for the four block sizes and the four journal sizes.
Figure 12: Average Stride fread() Throughput (KB per second) for the Four Blocks Sizes and the Four Journal Sizes
There is little variation in throughput as a function of the block size or the journal size. There is a very small decrease in performance as the block size is increased but it is very small.
Figure 13 below is the re-fread test for the four block sizes and the four journal sizes.
Figure 13: Average Stride Re-fread() Throughput (KB per second) for the Four Blocks Sizes and the Four Journal Sizes
There is little variation in throughput as a function of the block size or the journal size.
There are some interesting observations one can derive from the charts.
- There is very little change in throughput performance as the journal size is changed for the parameters tested.
- Changing the block size has the biggest impact on performance – sometimes increasing throughput as the block increases and sometimes decreasing performance as the block size is increased.