SSDs (Solid-State Drives) are a hot topic right now for a number of reasons; not the least of which being their power to performance ratio. But to better understand SSDs you should first get a grip on how they are constructed and the features/limitations of these drives.
There definitely is enough promise in NAND Flash chips to justify their development, but there are limitations or challenges that need to be addressed to make the resulting drives compelling. Companies have been working on techniques for overcoming these limitations and this section will describe a few of them. The two biggest problems that manufacturers have been addressing are: the erase/program cycle limitation, and the performance problem when overwriting old data (overcoming the slow read/erase/program cycle).
Erasing/Writing Data Scenario
Before jumping into techniques that are being used to improve SSDs, let’s examine a simple scenario that illustrates a fairly severe problem. Let’s assume we have an application that wants to write some data to and SSD drive. The drive controller puts the data on some unused pages (i.e. they’ve never had data written to them so this is done with a write operation with no need to do an erase first). This data is much smaller than the block (less than 512KB). Then some additional data is written to the SSD, and this too is written to unused pages in the same block.
Then the application decides to erase the first piece of data. Recall that the SSD can’t erase individual pages but only blocks. In an effort to same time and improve performance the drive controller just marks those pages as unused but they are not erased.
Next, another application writes data to the drive that will use the remaining pages in the block including the pages marked as unused. This forces the controller to have to erase the pages marked as unused because there is existing data on them, but this forces the entire block to be erased. The basic process that the controller goes through is something like the following:
- Copy the entire contents of the block to a temporary location (likely cache)
- Remove the unused data from the cache (this is the erased data from the first write)
- Add the new data to the block in cache
- Erase the targeted block on the SSD drive
- Copy the entire block from the cache to the recently erased block
- Empty the cache
As you can see the process can be very involved. Just because the first few pages of data were erased by the application this forced the entire block to be erased just to use those pages. Recall that this can be a very expensive operation in terms of time because of the copying of data and erasing the block (recall that erasure is 2-3 orders of magnitude slower than a read). This also kills write throughput performance. Many of the techniques discussed below are used to help overcome this performance problem.
One of the biggest challenges in using NAND Flash chips is the limited number of erase/program cycles and was one of the first addressed by manufacturers. Many of the controllers in SSD drives keep track of the number of erase/program cycles in the drive. The controller then tries to put data in locations to avoid “hot spots” where certain cells may have a much smaller number of erase/program cycles remaining. This is commonly referred to as “wear leveling.” This approach has been fairly successful in avoiding hot spots within the drives but it does require a fair amount of work by the SSD drive controller.
The idea behind over provisioning is to have a “reserve” of spare pages and blocks (capacity) that can be used for various needs by the controller. This spare capacity is not presented to the OS so only the drive controller knows about them. However this spare capacity does diminish the useable capacity of the drive. For example the drive may have 64GB of actual capacity but the drive only appears as 50GB to the OS. Therefore the drive has a spare capacity of 14GB (over-provisioned). In effect you are paying for space you cannot use. However, this spare capacity can be very useful.
Let’s return to the erasing/writing data scenario. The first few pages are marked as unused but haven’t been erased yet and the second data write has stored data on the pages. Now the third data write needs the remaining pages including the unused pages on the block. This triggers the cycle of copying the entire block to a cache, merging the new data into the cache, erasing the targeted block, and writing the new block from cache to the drive. But now, we have some extra space that might be useful with this third data write.
Instead of having to erase the unused portion of the block to accommodate the third data write, the controller can use some of the spare space instead. This means that the sequence of reading the entire block, merging the new data, erasing the block, and writing the entire new block, can be avoided. The controller just maps spare space to be part of the drive capacity (so it is seen by the OS) and moves the unused pages to the spare capacity portion of the drive. Then the write occurs using the “fresh” spare space, but at some point the unused pages will have to be erased forcing the erase/write sequence and hurting performance.
In an effort to save performance there are some controllers that have logic that tries to do the erase/write sequence in the background or when the drive is not being used. While this can work in some cases, it may not help drives that are very heavily used since there isn’t much time when the drive is “quiet.”
In addition to helping performance, the spare capacity can also be used when severe hot spots or bad areas develop in the drive. For example,if a certain set of pages or even blocks has much fewer erase/write cycle remaining than most of the drive then the controller can just map spare pages or blocks to be used instead. Moreover, the controller can watch for bad writes and use the spare capacity as a “backup” or bad spots (similar to extra blocks on hard drives). The controller can check for bad writes by doing an immediate read after the write (recall that reads are 2-3 orders of magnitude faster than writes). If the read does not match the data then the write is considered bad. The controller then remaps that part of the drive to some spare pages or blocks within the drive.
It’s fairly obvious that over-provisioning, while using capacity, can increased the performance and the data integrity of the drive.
Write Amplification Avoidance
One side effect of wear leveling is that sometimes the number of writes that a controller must perform to evenly the wear across the cells increases. But the number of writes (erase/write) is something that should be minimized since the cells have a limited number of them. SSD drive controllers go to great lengths to minimize the number of writes. With the inclusion of over provisioning, write amplification can be reduced by using the spare cells. The buffer space can also be combined with logic to hold data in a buffer and wait for some period of time anticipating additional data changes before the data is actually written. This too can help reduce the number of writes.
While it is not specifically called out in many drives, newer SSD drive controllers are capable of performing internal RAID. This is RAID for performance, not necessarily RAID for data reliability. Rotating media has a single drive head that actual does the reading and writing from the disk resulting in the actual IO path in the drive being a serial operation. But SSD drives do not have a drive head. Consequently, the controllers and drives can be designed such that several data operations happen in parallel.
The obvious benefit of internal RAID are fairly obvious. A drive could have a version of RAID-0 to split the data into multiple parts for writing or reading (don’t forget that SSD drives already have fantastic read performance).
This is a fairly common technique where the controller holds the incoming write data in a buffer and reorders the operations to better suit the SSD drive. For example it may hold incoming data for some period of time anticipating that more neighboring data may be forth coming. This is especially important in the case of the block erase limitation. The controller will try to buffer data as long as possible, trying to reach one block in size in the buffer before committing the data to the drive. This makes writing the data much more efficient because the entire block is full.
The TRIM command is a great way for SSD drives to maintain good performance by wiping pages clean when they are deleted, prior to new data being written to them. The TRIM command isn’t in the Linux kernel as far as I know and drives that support the command are only now appearing. But TRIM support is available in Windows 7 (ouch).
The TRIM command works by forcing an actual erase of the unused pages during the data delete step where performance may not be as important as during a write step. In other words, when a page or more is deleted by the application, it is erased immediately.
If we return to the erase/write scenario, after the second data write, the data from the first write is erased (removed). Normally the drive controller defers the actual erase step until it is absolutely necessary. Unfortunately, it becomes necessary when the very next write needs that unused space, forcing the block to go through the whole process of erase/write, impacting the write performance of the drive. The TRIM command forces the controller to do the actual erase of the unused page during the data delete step. The additional overhead of copying the good data from the block to cache, erasing the entire block, and then coping the cache back to the block, all happens during the data delete where performance may not be as big an issue as during a data write.
SSD drives are an exciting technology for data storage and IO performance. The drives have been out for a while and the prices are starting to gradually fall making them more appealing for broader use. But as with all new technologies there are benefits and limitations.
This article is just a brief overview of the SSDs starting with the basic technology, floating-gate transistors, so that we can understand the source of the limitations (and benefits) of SSD drives. Using floating-gate transistors the storage is built into pages, then blocks, then planes, and finally into chips and drives. The benefits of SSD drives themselves have been discussed fairly pervasively:
- The seek times are extremely small because there are no mechanical parts. This gives SSD drives amazing IOPS performance.
- The performance is asymmetrical in reads and writes with reads being amazing fast and writes not so fast but still with very good performance.
- While not discussed in this article, because there are no moving parts in the drive there is no danger of the drive head impacting the platters causing the lose of data.
This article focused a bit more on the the limitations of SSD drives that are a result of the floating-fate transistors but are also a result of the design of the NAND Flash arrays as well. These limitations are;
- The performance is asymmetrical in reads and writes with reads being amazing fast and writes not so fast but still with very good performance (this is both a feature and a limitation for SSDs).
- Floating-gate transistors, and subsequently SSD drives, have a limited number of erase/program cycles after which they are incapable of storing any data. SLC cells have about a 100,000 cycle limit while MLC cells have about a 5,000-10,000 cycle limit.
- Due to the construction of the NAND Flash chips, data can only be erased in block units (512KB) but can be written in page (4KB) units.
As pointed out some of these limitations give rise to problems in SSD drives. But SSD drive manufacturers are moving to address this problems as discussed in the article.
SSD drives are without a doubt very “cool” technology that can help solve some IO and storage challenges. But before buying a fairly expensive drive, it is good to understand the limitations of the technology so you can make an informed decision. Equally important is that understanding the limitations will help you understand any test results, either good or bad, for your workloads.