On-the-fly Data Compression for SSDs

The key to good SSD performance is the controller. One SSD controller that has received good reviews is the SandForce SF-1200. However, a recent test of a SF-1200 SSD reveals some interesting things about what this controller does and just how it does it. Depending upon your point of view and, radically, your data, performance can be amazing.

Remember that SSDs have really great read performance due to the low latency of the SSD. So in the case of the SF-1200 controller, the data has to be read (which is quite fast), and then expanded (uncompressed). Logically, the time it takes the data to be uncompressed and other associated operations reduces the performance but the amount of data is potentially smaller than the true amount of data. Overall the impact of incompressible data relative to compressible data results in only about a 30% reduction in read throughput performance.

The ultimate driver of write throughput performance is the compressibility of the data. If the data is very compressible, then only a small part of the data actually has to be written to the storage media and the resulting throughput can be quite high. If the data is incompressible, then virtually all of it has to be written to the storage media within the drive but the resulting write throughput is reduced because of the time it has taken to examine the data and compress it with little or no resulting compression. In the case of the SandForce SF-1200 controller, if the data is almost incompressible, then we can see the write performance is greatly reduced (from about 260 MB/s to 65 MB/s – a factor of 4 reduction).

Data Capacity
An additional aspect of data compression to be considered is the space or data capacity of the SandForce based SSD. One would think that you could use data compression to improve the data capacity of the device. However, remember that the amount of space depends upon the compressibility of the data.

When you type “ls -lsa” on a file on your gleaming Linux box, you want it to report the correct information about the file including the number of bytes and/or the number of blocks. This is true regardless of the storage media including SSD devices using the SF-1200 controller. Therefore, SF-1200 based devices can’t really take advantage of data compressing to report a much larger capacity than they actually have.

Life of the SSD
Recall that the chips used in SSDs have a limited number of erase cycles. So you can only write to them a certain a number of times before they can no longer hold data correctly. There are a number of techniques that can be used to reduce the number of erase cycles overall (e.g. reducing write amplification) and reduce the chances of a “hot spot” developing in the chip somewhere. SandForce SSD controllers are likely to use all of these techniques but the data compression aspect gives them a leg up on the competition.

Since the data is compressed you can actually fit more data onto a given block possibly reducing the number of times the block has to be erased. For example, let’s assuming we a 64GB SandForce SSD and a 64GB SSD using a different controller. If we write the exact same amount of data to both drives, the SandForce based SSD will use fewer blocks than the conventional SSD. Effectively, it increases the size of the block pool which means that blocks will be written to less often as compared to the conventional SSD. A really interesting (and cool) thing about SandForce SSDs is that how much longer the drive will last relative to a conventional one, fully depends upon your data. If your data is highly compressible then you get better write performance (and to some degree read performance) and the drive will last longer since you are actually storing less data.

This features also means that SandForce based SSDs can use MLC chips that are denser resulting in inexpensive $/GB drives, while still providing a drive with a potentially longer life than conventional SSDs.

Let’s take a closer look at compressing data and how you can examine your data to estimate your data compressibility.

Data Compressibility and the Impact of SF-1200 Devices

The performance and longevity of SF-1200 based SSDs really depends upon your data. If your data is very compressible you will get good write performance (actually really good write performance). If your data isn’t compressible then you won’t get good write performance. So the fundamental question you need to answer is, “how compressible is my data?” Great question and I’m glad you asked.

There is an entire field of information theory and computer science devoted to compression techniques for data (and its partner data deduplication). You can simply Google “data compression” and you will see a massive number of hits (11,900,000 when I ran the search). But just a quick examination of the top hits indicates one thing – the amount of compression depends upon your specific data and the algorithm. There you go – the dreaded “it depends” answer.

As an example, a brief survey article describes several data compression methods as well as some compression ratios. For the sake of brevity here are some results:


  • Using the “compact” technique:

    • Text – 38% reduction in file size
    • Pascal source code – 43% reduction in file size
    • C source – 36% reduction in file size
    • Binary – 19% reduction in file size

  • Using “compress”:

    • Text and English – 50-60% reduction in file size

  • Arithmetic encoding:

    • 12.1% to 73.5% reduction in file size

  • Huffman encoding:corresponding

    • 42.1% file size reduction of large student record database

  • Data compression routines for specific applications:

    • Up to 98% reduction in file size have been reported


But, in addition to compressing the file, each compression algorithm requires a certain amount of computational work. In general, the more a file is compressed, the more computational work that is required (this isn’t always true because it depends upon the details but this is directionally correct).

In the case of the SandForce SF-1200 controller it is going to need a compression algorithm that is fast but also takes a fairly fixed amount of time so that the write performance does not drop through the floor in the event of incompressible data. But at the same time, the more compression that can be achieved, the better the throughput performance, particularly the write throughput.

How can you tell the compressibility of your data? The answer is that it is not easy. One thing you can do is select certain representative files for the common data types you work with. Then try using “gzip” on the files using various levels of compression and observe how much you can compress the data. For example, you could try

laytonjb@laytonjb-laptop:~$ gzip -6 file


where “-6″ is the compression level with 6 being the default. The maximum compression is “-9″. Try several levels of compression to see much compression you can achieve with those files.

But, gzip examines all of the data in the file while the SF-1200 controller will only be able to examine much smaller chunks of data. So a second thing you can do is to examine your most common applications using strace to find out the average system write call size. Then you can use “dd” to extract portions of your sample data files corresponding to the average write function size and run various levels of compression using gzip on the files. For example, to extract a portion of your file corresponding to 4KB (4,096 bytes) you can use,

laytonjb@laytonjb-laptop:~$ dd if=file of=data_example_1 bs=4096 skip=13


This command assumes that your file is larger than 14*4KB because I skipped the first thirteen 4KB blocks (skip=13). It’s fairly simple to write a bash script that takes a file, breaks it into n chunks of a specific block size, runs various levels of gzip on each block, and records the amount of compression for each block and compression level. Then you could take an average of the compression levels over all blocks and you will have an idea of compressibility of your data files.

It seems like a lot of work, and it can be, but it will definitely help you determine the compressibility of your data. If it looks like your data is very compressible, then a SF-1200 based SSD might be a really great solution for you. If your data isn’t that compressible, then a SF-1200 based SSD might not be the best solution for you in its present form. Let’s define the file compression ratio as, (file size after compression) / (file size before compression). A compression ratio of 1 gives you the minimal performance (i.e. the data isn’t compressible) and that a very small compression ratio will get you near the top in terms of the performance (about 260 MB/s for read and write throughput). You can use this measure of compression ratio to determine the “compressibility” of your data.

Alternatively, you could also just buy a SandForce SF-1200 based device and run various tests against the drive using your data sets.

Summary

Joe Landman made a very nice discovery about SF-1200 based SSD devices – they use on-the-fly compression. This article talked about Joe’s blog post and made some conjecture of what SandForce is doing with the controller. It also talked about the implications of the design on data capacity and performance. The really, really cool result is that the compressibility of your data drives both performance and the “life” of the SSD.

That’s a pretty interesting concept if you ask me. Having a SSD drive that has performance and longevity based on your specific data is pretty unique. If you have data that is very difficult to compress then you get some level of performance and longevity in your drive. But if your data is very compressible, then you can get a boost in performance and longevity. This is an interesting design that could point to some future SSD controller designs. Imagine taking Intel’s or Pliant’s SSD controller and combining it with SandForce’s compression technology – you could get some very good performance for data that is incompressible but you also have the possibility of going well above that performance level if the data is compressible. I guess you could think of it as analogous to Intel’s and AMD’s “Turbo mode” for CPUs that allows cores to increase their clock speed if there is thermal capability left in the chip. Or you could draw an analogy with This is Spinal Tap where the SF-1200 based drive drive goes to 11 instead of the usual 10 for compressible data.

Somewhat recently, SandForce has announced a new controller that has even better performance than the SF-1200. The SF-2000 controllers promise even better performance. SandForce is stating that you can reach up to 500 MB/s for sequential reads and write performance and up to 60,000 random read and random write performance for 4KB transactions. These new controllers have a 6Gbps interface to help performance, but I can’t help assume that the controller has been improved as well.

Fatal error: Call to undefined function aa_author_bios() in /opt/apache/dms/b2b/linux-mag.com/site/www/htdocs/wp-content/themes/linuxmag/single.php on line 62