Size Can Matter: Would You Prefer the Hard Drive or the Ramdisk this Evening? Part 3

The past couple of weeks we ran the numbers on metadata performance for ramdisks and hard drive-based journals for ext4. Now let's compare/contrast the two journal devices and see what trends emerge.

In part 1 of this series we looked at the metadata performance when your journal was on a separate disk. Part 2 explored the ramdisk option. Now comes the favorite part of every high school English class: Compare and contrast.

These tests are not intended as “benchmarks” per se. They are intended more as experiments or explorations to determine how we can influence storage performance by changing options in the file system. This means we are not looking for a “winner” in the comparison. Rather, we are looking for differences or the lack of differences in metadata performance as a function of journal size and journal device to perhaps tell us something about how we can improve performance.

Testing Review

Recall that four journal sizes were tested to understand the impact of journal size on metadata performance. The four journal sizes are:

  • 16MB (0.0032% of file system size)
  • 64MB (0.0128% of file system size)
  • 256MB (0.0512% of file system size)
  • 1GB (0.2% of file system size)

Both a separate hard drive partition and a ramdisk of the appropriate size were created and then utilized for the journal for an ext4 file system.

To understand the impact of both journal size and the type of device (disk or ramdisk), the fdtree benchmark was used to test metadata performance. This benchmarks has been used in a number of previous articles to measure metadata performance because it is simple to use and offers a number of scenarios that can be used to match usage cases. For this examination, fdtree was used in 4 different scenarios in stressing the metadata capability:

  • Small files (4 KiB)
    • Shallow directory structure
    • Deep directory structure
  • Medium files (4 MiB)
    • Shallow directory structure
    • Deep directory structure

The two file sizes, 4 KiB (1 block) and 4 MiB (1,000 blocks) were used to get some feel for a range of performance as a function of the amount of data. The two directory structures were used to stress the metadata in different ways to discover if there is any impact on the metadata performance. The shallow directory structure means that there are many directories but not very many levels down. The deep directory structure means that there are not many directories at a particular level but that there are many levels. Further details of the metadata testing can be found in the first article.

Each test was run 10 times for the four journal sizes and for the two journal devices (hard disk and ramdisk). The test system used for these tests was a stock CentOS 5.3 distribution but with a 2.6.30 kernel. In addition, e2fsprogs was upgraded to 1.41.9. The tests were run on the following system:

  • GigaByte MAA78GM-US2H motherboard
  • An AMD Phenom II X4 920 CPU
  • 8GB of memory (DDR2-800)
  • Linux 2.6.30 kernel
  • The OS and boot drive are on an IBM DTLA-307020 (20GB drive at Ulta ATA/100)
  • /home is on a Seagate ST1360827AS drive
  • There are two drives for testing. They are both Seagate ST3500641AS-RK drives with a 16 MB cache each. These drives show up as devices, /dev/sdb and /dev/sdc.

The first Seagate drive, /dev/sdb, was used for the file system and was used exclusively in these tests. The second device, /dev/sdc was used for the journal for the hard drive based tests.

The details of creating an ext4 file system with a journal on a separate device are contained in a previous article. The basic steps are to first create the file system assuming the journal is located with the file system on the drive. Second, a new journal is created on the specific device (/dev/sdc1 or /dev/ram0). Finally, the file system is told that that it no longer has a journal and then it is told that it’s journal is on the specific device (the hard drive or the ramdisk).

Benchmark Results

This section presents the comparison of the results for the four scenarios for both devices. The hard drive and ramdisk results are plotted side by side for the same journal size along with the error bars to allow easy comparison. The full results are available in tabular form in the previous two articles.

The first test is for the “small file, shallow structure” scenario for the four journal sizes. Figure 1 below plots the average file create performance in KiB per second for the four journal sizes for both the hard drive device and the ramdisk device. Also note that error bars representing the standard deviation are shown.
compare_small_shallow_file_creates_updated.png
Figure 1: Average File Create Performance (KiB per second) for the Small File, Shallow Structure Scenario for the Four Journal Sizes for the Hard Drive based Journal and the Ramdisk based Journal

Figure 2 below plots the average “File Remove” results in “File Removes per second” for the four journal sizes for the small file, shallow structure scenario for both devices. Again, there are error bars representing the standard deviation in the plot as well.
compare_small_shallow_file_removes_updated.png
Figure 2: Average File Remove Performance (File Removes per second) for the Small File, Shallow Structure Test for the Four Journal Sizes for the Hard Drive based Journal and the Ramdisk based Journal

The next scenario uses small files but with a deep directory structure. For this scenario all four tests had run times long enough for consideration. Figure 3 below plots the average “Directory Create” results in “creates per second” for both journal devices for the four journal sizes. Again, there are error bars representing the standard deviation in the plot as well.
compare_small_deep_dir_creates_updated.png
Figure 3: Average Directory Create Performance (creates per second) for the Small File, Deep Structure Test for the Four Journal Sizes for the Hard Drive based Journal and the Ramdisk based Journal

Figure 4 below plots the average “File Create” results in KiB per second for the four journal sizes for the small file, deep structure scenario for both journal devices. Again, there are error bars representing the standard deviation in the plot as well.

compare_small_deep_file_creates_updated.png
Figure 4: Average File Create Performance (creates per second) for the Small File, Deep Structure Test for the Four Journal Sizes for the Hard Drive based Journal and the Ramdisk based Journal

Figure 5 below plots the average “File Remove” results in removes per second for the four journal sizes for the small file, deep structure test for both journal device types.

compare_small_deep_file_removes_updated.png
Figure 5: Average File Remove Performance (removes per second) for the Small File, Deep Structure Test for the Four Journal Sizes for the Hard Drive based Journal and the Ramdisk based Journal

Figure 6 below plots the average “Directory Remove” results in removes per second for the four journal sizes for the small file, deep structure test for both journal device types.

compare_small_deep_file_removes_updated.png
Figure 6: Average Directory Remove Performance (removes per second) for the Small File, Deep Structure Test for the Four Journal Sizes for the Hard Drive based Journal and the Ramdisk based Journal

The next test was the medium files, shallow directory structure scenario where only the file create test had a meaningful run time. Figure 7 below plots the the file create performance in KiB per second for the four journal sizes for both journal device types. Also note that the error bars are plotted as well.

compare_medium_shallow_file_creates_updated.png
Figure 7: Average File Create Performance (KiB per second) for the Medium File, Shallow Structure Test for the Four Journal Sizes for the Hard Drive based Journal and the Ramdisk based Journal

The final test was the medium files, deep directory structure scenario. The only result that had meaningful times was the file create test. Figure 8 below plots the the file create performance in KiB per second for the four journal sizes for both journal device types. Also note that the error bars are plotted as well.

compare_medium_deep_file_creates_updated.png
Figure 8: Average File Create Performance (KiB per second) for the Medium File, Deep Structure Test for the Four Journal Sizes for the Hard Drive based Journal and the Ramdisk based Journal

Observations (Compare/Contrast)

The benchmark results are very interesting since we actually see some variation in the results whereas in the first article we did not seem much variation. A quick summary of the results is given below.

  • Small files, shallow directory structure:
    • From Figure 1, the average file create performance for the ramdisk journal is slightly slower than the hard disk based journal for the 16MB journal size. However from 64MB on, the performance of the hard drive based journal is approximately the same as the ramdisk based journal.The average file create performance increased approximately 6% going from the 16MB journal size to the 256MB size. From 256MB to 1GB the performance didn’t increase appreciably.
    • From Figure 2, the average file remove performance for both the hard drive based journal and the ramdisk based journal are approximately the same for each journal size.
    • The average file remove performance increased by about 25% from the 16MB journal size to the 256MB journal size. However, increasing the journal size to 1GB didn’t increase performance any appreciable amount.
  • Small files, deep directory structure:
    • At the 16MB journal size, the hard drive and ramdisk journal device average directory create performance is about the same (see Figure 3). However, at 64MB, the ramdisk has 38.2% better performance than the hard drive. At 256MB the average ramdisk performance is 47.3% better, and at 1GB the average ramdisk performance is 29.8% than the hard drive.
    • For the ramdisk journal device, increasing the journal size from 16MB to 1GB increased the average directory create performance by 163%. For the hard drive journal device the same increase in journal size increased the average directory create performance by 115%.
    • The average file creation performance as seen in Figure 4 is also interesting. For a 16MB journal size the performance of both devices is about the same. But from 64MB to 1GB the performance of the ramdisk is much greater than the hard drive. With a journal size of 64MB, the ramdisk journal is 40.2% faster, for a 256MB journal size the ramdisk is 22.5% faster, and for a journal size of 1GB the ramdisk is 17.4% faster.
    • For both the ramdisk journal and the hard drive journal, the average file creation performance increased as the journal size increased for this scenario. The average file creation performance increased by 58% for the ramdisk based journal as the journal size was increased from 16MB to 1GB. For the hard drive based journal the performance increased by 56%.
    • The average file removal performance for both the ramdisk and the hard drive journals increased as the journal size increased (see Figure 5). The ramdisk performance increased by 144% and the hard drive performance increased by 126%
    • In general the average file removal performance of the two devices was about the same for a given journal size. At 64MB, the hard drive based journal was slightly faster and at 1GB, the ramdisk based journal was slightly faster.
    • Figure 6 compared the average directory removal performance for both devices for the four journal sizes. The performance of the hard drive was better than the ramdisk when the journal size was 64MB and 1GB (although the standard deviation at 1GB is very large and the differences are well within the standard deviation). So overall, with the exception of the 64MB journal case, the performance of the two devices was about the same.
    • However, the improvement in the average directory remove performance for both devices as the size of the journal is increased is very dramatic. The average directory removal performance of the ramdisk based journal increased by 410% in going from a journal of 16MB to a journal of 1GB. For the hard drive based journal the performance increased by 367% for the same change in journal sizes.
  • Medium files, shallow directory structure
    • Comparing the ramdisk based journal to the disk based journal is more difficult for the average file create performance (Figure 7) because the average performance for the various journal sizes and device options are all within the standard deviation of the tests. This means that it is difficult to determine if one case has better performance then the others. Even throwing good statistics out the window, sometimes the ramdisk journal is slightly faster and sometimes the disk journal is faster. In addition, the performance doesn’t vary too much as a function of the size of the journal.
  • Medium Files, deep directory structure
    • The average file creation performance was about the same for the ramdisk journal and the hard drive journal (see Figure 8). The performance for all journal sizes was within the standard deviation of the other results making it difficult to observe any statistical difference between journal sizes or devices. But once again, even if we toss our use of good statistics there still isn’t much of a trend in the results. Sometimes the ramdisk journal is faster than the hard drive journal and sometimes not. Also, there doesn’t seem much of a variation in the performance for either journal device as a function of the journal size.

Conclusion

One would have expected the ramdisk to the be run away favorite to have the best metadata performance because one would assume it has the best IOPS and throughput performance of the two devices. However, the comparison in this article showed that in several scenarios and several performance measures, the hard drive based journal had about the same performance or even better performance than the ramdisk based journal. At the same time, there are also scenarios and tests where a ramdisk based journal clearly had better performance than a disk based journal.>

Knowing that there is a performance difference between the ramdisk journal and the disk journal is good information but does not go deep enough to allow us to truly understand what is driving the metadata performance. It is fairly safe to assume that even without testing, the ramdisk journal has superior IOPS and throughput (and latency) than the hard drive. However, it is unclear which aspect of device performance or which combinations are driving the metadata performance differences. But, in my opinion, there is enough of a performance difference between the ramdisk based journal and the hard drive based journal to warrant an examination of using SSD’s for file system journals.

Why SSD’s and not ramdisks? Arguably ramdisks will give you better performance than SSD’s (for the most part), but using a ramdisk from the system memory has some issues that must be addressed (see the previous article). SSD’s are gaining in performance compared to their first incarnations and have very good IOPS performance. More importantly, they can survive a reboot of the system (ramdisks cannot). Consequently, it is worthwhile to test an SSD as an external journal device for ext4.

“Batman, now that we’ve drawn to the end of comparing metadata performance for journal sizes and devices for ext4, what does the future hold?”

“Actually it holds more testing Robin. Commissioner Gordon has asked us to examine the impact of journal size on a separate hard drive on throughput performance measured by IOZone. So hike up your leotard Robin and get ready for our next round of tests – to the Batcave!”

And… scene! Apologies for slipping into character for a moment. I’ve been watching reruns of “The Big Bang Theory” and I’m waiting for Sheldon to appear in a Batman costume and I find I’m starting to identify with him a bit more than I should.

Jeff Layton is an Enterprise Technologist for HPC at Dell. He can be found lounging around at a nearby Frys enjoying the coffee and waiting for sales (but never during working hours).

Comments on "Size Can Matter: Would You Prefer the Hard Drive or the Ramdisk this Evening? Part 3"

genghiskhat

In your summary charts, shouldn\’t the 6th bar be 256 MB, not 64?

Reply
ironarmadillo

Don\’t you think you should go ahead and add SSDs to this comparison? After all, they are growing in storage and performance. And you mentioned them as a future player in this article. Just because you didn\’t start out with an SSD in these articles doesn\’t mean you can\’t add it now. I know when I\’ve read these articles I was wondering where an SSD fit into these comparisons.

Reply
ctryon

Isn\’t putting the journal on something like a ramdisk sort of antithetical to the whole idea of a journaling file system, where one of the main purposes is to make the file system more robust in the case of a crash? If you\’ve gone down hard for any reason, there is no way to recover and replay the journal to ensure that, at least the file system is intact, even if you might possibly lose some a few of the last file writes before the crash. Putting a journal on a different disk might still carry some of those risks, but it seems like you\’re going to be a lot better off.

Not much sense writing the data Really Really Fast, if you end up losing the entire file system and all the data when someone trips over the power cord…

Reply
jmoondoggie

I am not surprised by the results. I would expect smaller block, deep file system results to be not as good as disk because of the amount of metadata created from the write process.

The manipulation of this data adds to the latency of the SSD. The larger the block size, the better the write latency results, because the control information is relatively low.

Reads, of course, remain almost instantaneous.

Since SSD\’s are still very expensive, a customer needs to correctly characterize his transaction processing to make sure solid state storage will pay for itself in the long run.

Some popular SSD manufacturers don\’t tell you up front that for every 40 Gig of SSD storage, you need to allocate 4 Gig of system ram to process the metadata overhead of small block writes.

On top of that, systems are tuned to allow for the latency of mechanical disk drives. So, routines for buffering need to be identified and turned off for SSD\’s. My question is: Was that done for these tests?

Reply
jmoondoggie

My mistake, SSD\’s were not used, but the principals are still similar. In addition, I agree with ctryon about the security of putting transaction data in ram.

Reply
laytonjb

If you go back and read the original chain of articles the intent of using the ramdisk is to \”bound\” the performance. Ramdisks have theoretically better performance than anything else and are used to bound the upper end of journal device performance. So at the \”high-end\” you have ramdisks and at the \”low-end\” you have a plain disk. SSD\’s should be somewhere in between.

I\’m not advocating using ramdisks for the journal. You can do it if you like and there are things you need to do ensure data integrity if you do it. But it is possible. There are some DRAM devices you can use for this approach: ACARD has a cool box, Texas Memory, Violin Memory, etc. All of them should be coupled with a UPS and a mechanism to flush the journal completely, unmounting the file system, and then dumping the contents of the DRAM device to permanent media all in the event of a power failure. Reversing the process on start up involves bringing up the DRAM device, restoring the journal contents from permanent media to the DRAM device, bringing up the file system and mounting it. I\’ve tried this process one as an experiment and ext4 didn\’t mind so I\’m assuming everything went correctly (didn\’t do an fsck but I should have).

But again, it is possible to use a DRAM device. As with anything though, there are tradeoffs: you can potentially get better performance but it\’s much more of an administrative task.

The reason I haven\’t tested an SSD is, well, I don\’t have one. I have looked at buying one but my price range is fairly low right now and I didn\’t want to test a substandard SSD (then we get the ensuing argument about \”… that SSD is a piece of garbage and doesn\’t reflect what a REAL SSD can do.. blah, blah\”.

Some specific answers:

@genghiskhat – you are correct. I\’m surprised that slipped by – my bad. I will get those fixed.

@jmoondoogle – I don\’t know of any file systems where the buffering is tuned for mechanical drives. I don\’t think they get that specific although I can ask Eric Sandeen or Theodor Ts\’o. Just remember that there are a number of layers and buffering that happen between the application and the actual drive. Flipping on/off these buffers is not always trivial and does not always produce the desired affects (unless you are a kernel hacker and can read the code – Larry McVoy is quite good at doing this). There are buffers in the file system, there are buffers in the actual VFS as controlled by the kernel, there are buffers in the IO scheduler, there can potentially be buffers in the driver layer, and there are buffers (cache) in the drives themselves. Determining how they all interact is, well, difficult.

Also, I wanted to examine the impact on performance using just the default options. Trying to determine the impact of \”tuning\” is load dependent and difficult to analyze a priori\’.

@jmoondoggie: I\’m not aware of any SSD drives using system RAM for buffering. I\’m not exactly sure how they do that because it would have to be in the driver and I haven\’t seen any drivers with buffering but I could definitely be wrong. Can you point to some examples of this behavior?

Jeff

Reply
jmoondoggie

As far as LINUX system tuning for SSD\’s it is good to turn on O_Direct, which bypasses the page/buffer cache. Normally, this would decrease performance on a relatively slow mechanical drive, but in the case of NAND Flash based SSD\’s it significantly INCREASES performance, on reads and writes.

Fusion-io is such a product. But the problem with NAND Flash is that it inherently has latency on writes due to the way it commits the write internally. So, the smaller the write block, the more control overhead to handle, thus the need to overflow into system resources because it can\’t handle it on such a small card. Again, reads are not a problem. It is the write process inherent in NAND Flash.

One area Fusion-io is different, is that they don\’t use standard disk drive protocol like SCSI or SATA. They pass data directly from the \”drive\” to the PCIe bus. So there is no latency injected from the standard disk channel protocol. To compensate for the increased speed through PCIe, it is necessary to set O_Direct.

This is a double edged sword, because although the performance is fast, the drive is not SNIA-compliant or SMART compliant, and a disk controller card such as Promise can\’t manage it, or use it for hardware based RAID. It also means it\’s not bootable (it can\’t be used to load an OS).

Reply
markseger

I fear you\’ve made a common basic mistake many people seem to make and that is evaluating a benchmark on its runtime alone. While those numbers certainly provide a good first level approximation, they only tell part of the story. For example, were there any unexpected spikes in CPU during the test? Maybe there were unexpected stalls in the disk I/O during the run? Or maybe something else that was unexpected.

Whenever I run an I/O benchmark I always run collectl in parallel, measuring a wide variety of metrics every 10 seconds, and graph the output. Often times something jumps out to indicate an invalid test that might reveal a kernel bug or mistuned system. If you can\’t get a relatively smooth I/O rate, you\’re not reporting valid numbers in your result OR simply identifying a system limitation that itself can be important to note.

-mark

Reply
jmoondoggie

Mark,
Thanks for the collectl tip. I immediately downloaded it and ran it while running a fio benchmark in another terminal. Watching them side by side was very enlightening. Fio is about the most flexible open source benchmarking tool I\’ve seen. With collectl I can see a lot more metrics at play. I can\’t wait to set up samba and start watching some differences in network load.
But I do agree that runtime alone doesn\’t tell the whole story. Thanks again.

Reply

Leave a Reply to genghiskhat Cancel reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>