Mmmm.... bacon. This article examines two mechanisms to prevent data loss -- write barriers and check summing. Both can be particularly important for drives with larger and larger caches. Pay attention: This can save your data bacon.
Writing file systems is definitely a very complicated task. The basics of a file system can be written reasonably easily for those inclined, but making a file system robust, POSIX compliant, fast, and useful for the general masses can be extraordinarily difficult and complex. File system developers spend a great deal of time examining “corner” cases that could lead to data loss which make file systems more complicated and which sometimes has an impact on performance. But these corner cases have to be addressed to prevent data loss.
One of the corner cases that file systems have grappled with revolves around how journal file systems function. Ext4 has spent a great deal of time addressing possible scenarios with journal corruption that can lead to data corruption. In particular it uses journal check summing and write barriers to help reduce the probability of data corruption from journals and out-of-order write caching on hard drives.
Journal Check Summing
A checksum is a simple way to compute a single number representing a larger block of data. This single number can be used to check that particular block of data for any changes. Somewhat obviously, they can be used to check the validity of a block of data such as a journal transaction.
Recall that a journaled file system keeps a log (journal) of the file operations. Then the file system “plays” each operation or transaction and once the operation is completed it is deleted from the journal. If the system crashes such as from a power failure, upon reboot only the journal has to replayed to get a consistent file system (as opposed to performing a file system check, fsck, to ensure the file system is consistent for non-journal file systems).
One of the keys to a journaled file system is the definition of “completed” from the point of view of the file system. Does completed mean the data is actually on the platters? Or does it mean that the data has been given to the disk and can be in drive cache rather than the actual platters? The difference between these two is a result of the drive cache and can have implications for the file system journal.
One key feature that many people forget is that the journal is also typically kept on a disk and many times the same disk containing the file system. Any discussion about writing data to the drive and the impact of caching is also true for the journal itself! This is an important fact that many people, including myself, often forget. Ideally a journaled file system wants to be sure that the data has made it to the actual platters before it deletes the transaction from the journal. However, drive caches can perform operations out of order for better performance so that the file system is never sure exactly when the data is actually on the disk. But, given enough time, the data will make to the actual disk platters.
When a transaction is being committed to the journal the relevant pieces of the transaction are written to the journal (these are termed the transaction log). Once the entire transaction log has been written to the journal, then a “commit block” is written indicating that the transaction log in the journal is complete. But recall that today’s drives have very large caches and can put write operations out of order.
What can happen is that a “commit block” of a transaction can be written to the actual disk while the other relevant blocks of the transaction log may not be written at the same time (remember that writes can be out of order because of the drive cache). If a power failure happens before the other blocks are written to the disk, then the system has to be rebooted and the journal is replayed. When the journal is replayed the transactions that have a commit block written, indicating that the entire transaction log is in the journal, are replayed. However, the pieces of the transaction log have not been written to the disk because they were in the drive cache so the file system. This means that that the file system is replaying corrupted journal data resulting in corrupted file system data (the old “garbage in, “garbage out”).
Chris Mason, a well known kernel and file system developer, created a simple test program that can cause data corruption through this process. You basically run the program and during the operation pull the plug on the system. According to Chris, about 50% of the time it corrupted his ext3 file system.
A way to counter the journal corruption is to compute a checksum of the transaction log and write it as part of the commit block process. Then if the journal has to be replayed, it can check the transaction log against the checksum. If they don’t match then you have a corrupted journal entry and it won’t be replayed eliminating file system data corruption. You have lost the data associated with the journal entry but at least you have not corrupted your file system.
One might ask the question, “what is the probability of getting a corrupted journal?” Trying to estimate an approximate answer is complex but the problem is being made worse by larger drive caches. As drives have gotten larger the drive cache has grown as well to the point where many 2TB drives have 64MB of cache. This potentially increases the amount of time it takes to actual have a piece of data make it to the platter because the drive can make decisions about how the data is actually written to the drive (i.e. increases the out-of-order-ness of the data in the cache). Consequently, the risk of a corrupted journal increases.
Journal check summing can help with data corruption issues but there are other techniques that can help as well.
As explained previously there is a risk of the journal becoming corrupted possibly causing file system corruption. In addition, to a corrupted journal, you also run the risk of losing drive cached data in the advent of a power failure. Any data that is in the drive cache is lost when the power fails. Adding a UPS to the system allows enough time to flush the drive cache (and potentially the system caches) solving the problem. Another way to truly make sure the data is written to disk is to turn off the drive cache.
Before you crucify me I admit that turning off the drive cache will impact performance, potentially by a great deal, but this way you are guaranteed of no data corruption due to the drive cache because everything is written to the drive. If you use RAID arrays you can possibly also get away with turning off drive caches because the RAID controller should have a battery backed cache that caches the data requests and can hold them through a power outage. So you may not see as much of a performance hit using a RAID controller and turning off the drive cache.
An alternative to turning off the write cache on hard drives to use something called write barriers. Write barriers basically flush the drive cache at appropriate times. In particular write barriers should happen before the journal commit to make sure all of the transaction logs are on the disk. The a write barrier will happen after the commit to make sure the journal entry is correct (theoretically). But, since the disk cache is being forced to flush, write barriers can have an impact on performance. However, it is generally felt that the performance penalty from write barriers is less than that resulting from turning off the drive cache.
Write barriers have been in the kernel for some time but file systems don’t always support it and some distributions or file systems don’t use by default. For example, in many distributions, write barriers are not turned on by default for ext3. You need to add the option “barrier=1″ to
/etc/fstab to make sure it is turned on (“barrier=0″ turns it off). There are also file systems that don’t support write barriers. The way to find out is to try “barrier=1″ as a mount option with the file system and then examine
/var/log/messages and look for messages around barrier. You can easily tell when barriers are disabled or not used.
There are also file systems that use barriers by default. I believe ext4 uses write barriers by default but there could be distributions that configure it not to use them. Just to be safe it is recommended to put the “barrier=1″ option in
/etc/fstab to make sure that write barriers are activated.
If you are using LVM (Logical Volume Manager) or md (Linux software RAID) then in the fairly recent past write barriers would not work correctly (basically write barriers were ignored). However all of that has changed with some fairly recent kernels. As of the 2.6.29 kernel, all write barriers will be respected by LVM. Prior to that kernel write barriers were ignored by LVM but I’m not sure about Redhat or Suse kernels which contain some backported capability respect write barriers or not. If are using the Linux md capability (software RAID), then as of 2.6.33 all of the md devices support barrier requests.
You may not have wanted to get so in-depth with your file system but the techniques covered here are pretty important to understand particularly because there is a risk of data corruption. Journal check summing can definitely help stop a corrupted journal from contaminating a file system. In ext4 journal check summing is turned on by default but it’s always good to be sure it is by putting it in your
Another technique that helps with data corruption is write barriers. They force the disk to flush the cache at certain points which can really help in reducing the risk of data corruption. But at the same time there is a performance impact from using write barriers.
A third alternative to help reduce the risk of data corruption is to turn off the write cache on your drives. For single drives this could greatly reduce performance but with a properly battery-backed RAID card with cache, the impact may not be as severe.
So you have a choice between reducing the risk of data corruption and getting reduced performance or you can go for more performance and run an increased risk of data corruption. I don’t know of a way to compute what the risk might be but many file system developers are saying that it is likely much greater than you think. Ultimately the decision is yours so I hope this article has broached the subject and you will do your homework.
Jeff Layton is an Enterprise Technologist for HPC at Dell. He can be found lounging around at a nearby Frys enjoying the coffee and waiting for sales (but never during working hours).