Bcache Testing: Throughput

Get your wetsuit on, we're going data diving. Throughput benchmarks using IOzone on a common SATA disk, an Intel X25-E SSD, and Bcache, using the SSD to cache a single drive.

Introduction

In a previous article I presented two new patch concepts (bcache and flashcache) for improving performance but using SSD’s as a caching mechanism for hard drives. In reality the caching is achieved by using a block device to cache another block device, but practically it’s using an SSD to cache hard drives. I’ve been waiting for some time for a patch that increases performance by using a cache that is larger than the disk cache and faster than the disk. SSD’s fit that bill pretty well, especially given their fantastic read performance compared to disks.

This article series tests these patches against some common benchmarks to discover if they can actually improve performance. But before you get your hopes up too high, these patches are still in early development so performance could be all over the map. Just like any patch in the kernel, it takes time to test them, tune them, and integrate them into the kernel. But we can help the developers by testing, which is precisely what this article does.

In particular, this article tests bcache (pulled from the developer’s git repository on July 15 of this year) for throughput, IOPS, and metadata performance on ext4 using four storage configurations:

  1. Single SATA II disk (7,200 rpm 500GB with 16MB cache)
  2. Single Intel X25-E SLC disk (64GB)
  3. Bcache combination that uses the Intel X25-E as a cache for the SATA drive and uses the CFQ (Completely Fair Queuing) IO Scheduler that is the default for most distributions
  4. Bcache combination that is the same as the previous but uses the NOOP IO Scheduler for the SSD that many people think could help SSD performance.

The details of the configurations are below as are the details of the benchmarks and tests run. It’s been a while since I’ve presented benchmarks, but we need to use our good benchmarking techniques.

In particular, I’m going to use IOzone for testing throughput as well as IOPS. Plus I’m going to use metarates to test metadata performance. I will be using ext4 as the file system but I will tweak some parameters for the SSD recommended by Theodore Ts’o. But this article will only examine the results for throughput testing using IOzone. Future articles will cover IOPS and metadata testing.

It’s been a while since the IOzone has been used in an article here so let’s begin by going back over IOzone.

IOzone

IOzone is one of the most popular throughput benchmarks. It’s open-source and is written in very plain ANSI C (not an insult but a compliment). It is capable of single thread, multi-threaded, and multi-client testing. The basic concept of IOzone is to break up a file of a given size into records. Records are written or read in some fashion until the file size is reached. Using this concept, IOzone has a number of tests that can be performed:

  • Write
    This is a fairly simple test that simulates writing to a new file. Because of the need to create new metadata for the file, many times the writing of a new file can be slower than rewriting to an existing file. The file is written using records of a specific length (either specified by the user or chosen automatically by IOzone) until the total file length has been reached.
  • Re-write
    This test is similar to the write test but measures the performance of writing to a file that already exists. Since the file already exists and the metadata is present, it is commonly expected for the re-write performance to be greater than the write performance. This particular test opens the file, puts the file pointer at the beginning of the file, and then writes to the open file descriptor using records of a specified length until the total file size is reached. Then it closes the file which updates the metadata.
  • Read
    This test reads an existing file. It reads the entire file, one record at a time.
  • Re-read
    This test reads a file that was recently read. This test is useful because operating systems and file systems will maintain parts of a recently read file in cache. Consequently, re-read performance should be better than read performance because of the cache effects. However, sometimes the cache effect can be mitigated by making the file much larger than the amount of memory in the system.
  • Random Read
    This test reads a file with the accesses being made to random locations within the file. The reads are done in record units until the total reads are the size of the file. The performance of this test is impacted by many factors including the OS cache(s), the number of disks and their configuration, disk seek latency, and disk cache among others.
  • Random Write
    The random write test measures the performance when writing a file with the accesses being made to random locations with the file. The file is opened to the total file size and then the data is written in record sizes to random locations within the file.
  • Backwards Read
    This is a unique file system test that reads a file backwards. There are several applications, notably, MSC Nastran, that read files backwards. There are some file systems and even OS’s that can detect this type of access pattern and enhance the performance of the access. In this test a file is opened and the file pointer is moved 1 record forward and then the file is read backward one record. Then the file pointer is moved 2 records forward in the file, and the process continues.
  • Record Rewrite
    This test measures the performance when writing and re-writing a particular spot with a file. The test is interesting because it can highlight “hot spot” capabilities within a file system and/or an OS. If the spot is small enough to fit into the various cache sizes; CPU data cache, TLB, OS cache, file system cache, etc., then the performance will be very good.
  • Strided Read
    This test reads a file in what is called a strided manner. For example, you could read at a file offset of zero for a length of 4 Kbytes, then seek 200 Kbytes forward, then read for 4 Kbytes, then seek 200 Kbytes, and so on. The constant pattern is important and the “distance” between the reads is called the stride (in this case it is 200 Kbytes). This access pattern is used by many applications that are reading certain data structures. This test can highlight interesting issues in file systems and storage because the stride could cause the data to miss any striping in a RAID configuration, resulting in poor performance.
  • Fwrite
    This test measures the performance of writing a file using a library function “fwrite()”. It is a binary stream function (examine the man pages on your system to learn more). Equally important, the routine performs a buffered write operation. This buffer is in user space (i.e. not part of the system caches). This test is performed with a record length buffer being created in a user-space buffer and then written to the file. This is repeated until the entire file is created. This test is similar to the “write” test in that it creates a new file, possibly stressing the metadata performance.
  • Refwrite
    This test is similar to the “rewrite” test but using the fwrite() library function. Ideally the performance should be better than “Fwrite” because it uses an existing file so the metadata performance is not stressed in this case.
  • Fread
    This is a test that uses the fread() libary function to read a file. It opens a file, and reads it in record lengths into a buffer that is in user space. This continues until the entire file is read.
  • Refread
    This test is similar to the “reread” test but uses the “fread()” library function. It reads a recently read file which may allow file system or OS cache buffers to be used, improving performance.

There are other options that can be tested, but for this exploration only the previously mentioned tests will be examined. However, even this list of tests is fairly extensive and covers a large number of application access patterns that you are likely to see (but not all of them).

There are a large number of command line options available for IOzone, far more than will be covered here. The next section will present the test system as well as the specific IOzone commands used.

Test System

The tests were run on the same system as previous tests. The system highlights of the system are:

  • GigaByte MAA78GM-US2H motherboard
  • An AMD Phenom II X4 920 CPU
  • 8GB of memory (DDR2-800)
  • Linux 2.6.34 kernel (with bcache patches only)
  • The OS and boot drive are on an IBM DTLA-307020 (20GB drive at Ultra ATA/100)
  • /home is on a Seagate ST1360827AS
  • There are two drives for testing. They are Seagate ST3500641AS-RK with 16 MB cache each. These are /dev/sdb and /dev/sdc.

Only the second Seagate drive was used, /dev/sdc, for the file system. Since the version of bcache I used could not yet cache a block partition, I used the whole device for the file system (/dev/sdc).

For IOzone the system specifications are fairly important. In particular, the amount of system memory is important because this can have a large impact on the caching effects. If the problem sizes are small enough to fit into the system or file system cache (or at least partially), it can skew results. Comparing the results of one system where the cache effects are fairly large to a system where cache effects are not large, is comparing the proverbial apples to oranges. For example, if you run the same problem size on a system with 1GB of memory versus a system with 8GB you will get much different results.

With that in mind, the next section presents the actual IOzone commands used in the testing.

IOzone Command Parameters

As mentioned previously there are a huge number of options available with IOzone (that is one reason it is so popular and powerful). For this exploration, the basic tests are run are: write, re-write, read, re-read, random read, random write, backwards read, record re-write, strided read, fwrite, frewrite, fread, and refread.

One of the most important considerations for this test is whether cache effects want to be considered in the results or not. Including cache effects in the results can be very useful because it can point out certain aspects of the OS and file system cache sizes and how the caches function. On the other hand, including cache effects limits the usefulness of the data in comparison to other results.

For this article, cache effects will be limited as much as possible so that the impact of the file system designs on performance can be better observed. Cache effects can’t be eliminated entirely without running extremely large problems and forcing the OS to eliminate all caches. However, it is almost impossible to eliminate the hardware caches such as those in the CPU, so trying to eliminate all cache effects is virtually impossible (but never say never). But, one of the best ways to minimize the cache effects is to make the file size much bigger than the main memory. For this article, the file size is chosen to be 16GB which is twice the size of main memory. This is chosen arbitrarily based on experience and some urban legends floating around the Internet.

Recall that most of the IOzone tests break up a file into records of a specific length. For example, a 1GB file can be broken into 1MB record so there are a total of 1,000 records in the file. IOzone can either run an automatic sweep of record sizes or the user can fix the record size. If done automatically IOzone starts at 1KB (1,024 bytes) and then doubles the record size until it reaches a maximum of 16 MB (16,777,216 bytes). Optionally, the user can specify the lower record size and the upper record size and IOzone will vary the record sizes in between.

For this article, with 16GB and 1KB record sizes, 1,000,000 records will be used for each of the 13 tests. The run times for this test are very large. Using our good benchmarking skills where each test is run at least 10 times, the total run time would be so large that, perhaps, only 1 benchmark every 2-4 weeks could be published. Consequently, to meet editorial deadlines (and you don’t want to be late for the editor), the record sizes will be larger. For this article, only four record sizes are tested: (1) 1MB, (2) 4MB, (3) 8MB, and (4) 16MB. For a file size of 16GB that is (1) 16,000 records, (2) 4,000 records, (3) 2,000 records, (4) 1,000 records. These record sizes and number of records do correspond to a number of applications so they do produce relevant results.

The command line for the first record size (1MB) is,

./IOzone -Rb spreadsheet_output_1M.wks -s 16G -r 1M > output_1M.txt

The command line for the second record size (4MB) is,

./IOzone -Rb spreadsheet_output_4M.wks -s 16G -r 4M > output_4M.txt

The command line for the third record size (8MB) is,

./IOzone -Rb spreadsheet_output_8M.wks -s 16G -r 8M > output_8M.txt

The command line for the fourth record size (16MB) is,

./IOzone -Rb spreadsheet_output_16M.wks -s 16G -r 16M > output_16M.txt

Building ext4

For the tests in this article I used CentOS 5.4 but I used my own kernel. For all four configurations, I used the 2.6.34 kernel but with patches for bcache . I will refer to this kernel as 2.6.34+. One good thing about this kernel is that it has support for the TRIM function that most modern SSD’s have built-in. Ext4 and btrfs already have extensions to take advantage of the command.

Since the new wrinkle in this article is testing of the SSD, I did some research on options for building ext4 on an SSD. So I went to the source, Theodore Ts’o’ blog that discusses how he did it for an Intel SSD. The first thing I did was partition the SSD to align partitions on 128KB boundaries (following Theodore’s advice). This is accomplished by the common fdisk command:

[root@test64 ~]# fdisk -H 224 -S 56 /dev/sdd

where the -H option is the number of “heads” and the -S option is the number of sectors per track. Don’t forget that fdisk still thinks of everything like a spinning disk so while these options perhaps don’t make any sense for an SSD, aligning the partitions on 128KB boundaries is important for best performance. Then I used the following command to build the file system as recommend by Theodore,

[root@test64 ~]# mke2fs -t ext4 -E stripe-width=32 resize=500G /dev/sdd1

The first option “stripe-width=32″ was recommended as a way to improve performance and the second option “resize=500G” is used to reduce any wasted space in anticipation of growing the file system beyond 500GB’s.

Notice that I left ext4 to select the journal size it wanted and it also placed it on the SSD.

Building an ext4 file system for a single disk is fairly easy. I didn’t do anything fancy except for one option. The particular command I used was:

[root@test64 ~]# mke2fs -t ext4 -E resize=500G /dev/sdc

I chose the option “-E resize=500G” because I used it for testing the SSD and I wanted to make sure the comparison was as “apples to apples” as possible.

Building ext4 with bcache is a little more involved but not too bad. If you follow the instructions on the wiki, then things go fairly smoothly. The first step is to create a fresh file system on the hard drive,
/dev/sdc
(not strictly necessary but I like to start with a clean file system when testing).

[root@test64 ~]# mke2fs -t ext4 -E resize=500G /dev/sdc

Then I mounted the file system. The next step is to create the bcache using the bcache -tools.

[root@test64 ~]# ./make-bcache -b128k /dev/sdd1
device is 125038536 sectors
block_size:             8
bucket_size:            256
journal_start:          12
first_bucket:           12
nbuckets:               488419

The next step is to tell bcache which device you are using as the cache.

[root@test64 ~]# echo "/dev/sdd1" > /sys/kernel/bcache/register_cache
bcache: Loaded cache device /dev/sdd1

The final step is to associate the bcache with a file system,

echo "`blkid /dev/sdc -s UUID -o value` /dev/sdc" > /sys/kernel/bcache/register_dev

Now we are ready to use bcache for caching: /dev/sdc with /dev/sdd1.

On to the results!

Next: Results

Comments on "Bcache Testing: Throughput"

paul.dorman

Hi Jeff, thanks for the thorough tests. I’ve been seriously considering a 50-70GB SSD as a second drive in my laptop to reap the potential benefits of BCache and similar projects.

As a BTRFS user I am very interested in the recent BTRFS patches for hot data relocation (http://lwn.net/Articles/400029/). It will certainly be interesting to see how effective these are at increasing performance for a relatively low outlay.

Irrespective of which approach gives the best result, I think that tiered storage using complementary SSDs will be a big win for everyone – particularly Linux users, and especially those with high-performance laptops that suffer most significantly from the disk I/O bottleneck.

Regards,
Paul

Reply
laytonjb

@Paul,

I agree with you. While Bcache and flashcache are still in their infancy the concept of using SSD’s for caching is a good one. I can imagine taking several inexpensive SSD’s, creating a RAID-0 and using that to cache a RAID-6 or something similar.

I too am watching the btrfs patches. Next concept. In fact I think it’s good enough that the authors can make it generic and put it in the vfs (at least I hope they do).

Thanks for the post!

Jeff

Reply
lkliu

I believe the block size is too big for such test. When the block size >1MB, the disk gets a lot of benefit from the sequential read or write. In one words, I think this test underestimate the benefits of using ssd and bcahce to improve performance.

Reply
bugmenot3

In this blog:
http://virtualgeek.typepad.com/virtual_geek/2010/05/emc-unified-storage-next-generation-efficiency-details.html
benefits of “FAST Cache” feature of EMC disk-arrays is shown.
“FAST Cache” uses Flash as second-level cache of traditiona disks and it seems that it gives good improvements.

Reply
dbbd

The results presented in the article are very poor. I thought bcache makes a lot of sense, but the results demonstrate it is far from ready for prime time.

Reply

    Try this… (copied from my wobsehts wiki, i was having a similar problem, this may solve it…)Blank PagesIf you have “blank pages” in WordPress with wp-cache turned on after you upgrade to PHP 5.1.2 – there is simple fix to solve the problem:-<ol><li>Open /wp-content/plugins/wp-cache/wp-cache-phase2.php file* in your favourite text editor, where is the domain that you’re having troubles with EG: dreamhost.com/wp-content/plugins/wp-cache/wp-cache-phase2.php</li><li>Alternatively if that file is not in that location for some reason you can issue this command to find it from the directory: find . -name wp-cache-phase2.php</li><li>Find out wp_cache_ob_end function</li><li>then inside that function find the line with: ob_end_clean(); (it should be line 219 or about)</li><li>and finally replace that line with: ob_end_flush();</li></ol>

    Reply
laytonjb

@dbbd,

You are correct – bcache isn’t ready for production. Even the author admits this. But I think this a time where we can influence bcache and ask for changes. For example, Kent, the author, knows that bcache doesn’t have a writethrough capability now and that impacts performance as I presented.

Great time to influence possible kernel patches.

Jeff

Reply
homarne

Jeff,

I’ve read through the article a couple of times and can’t figure one thing out.

If the working set of the test you ran was 16GB, and the SSD was 64GB, then the entire working set should have fit in the SSD cache, and the cache performance should have been very close to the SSD performance (unless bcache is VERY inefficient).

What the SSD cache size in fact set to a smaller size than the working set? If so I can’t figure out where you mention this in the article.

Thanks – Tom

Reply
ihatesigningup

Jeff,

Please consider benchmarking bcache and the other one I whose name I can’t remember with an sql benchmark, or some sort of “real-world” benchmark.

The reason I mention this is because, if you ran a busy Web site like facebook, and used bcache to speed up your database. I mean bcache was written to speed up this scenario, and I think bcache would really shine in this workload, because just like on a busy Web site there would be many more reads that writes, and many of those reads could stay in the SSD cache.

Thanks,
Dave.

Reply

This is a very good tip especially to those new to the blogosphere.

Brief but very precise info… Many thanks for sharing this one.
A must read article!

Reply

What’s up everyone, it’s my first pay a quick
visit att this web page, and post is genuinely fruitful in favor of me, keep up posting these types off articles.

Reply

I just want to tell you that I’m newbie to blogging and site-building and actually liked you’re web blog. Probably I’m planning to bookmark your site . You amazingly come with fabulous posts. Regards for sharing your blog.

Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>