dcsimg

Saving Your Data Bacon with Write Barriers and Journal Check Summing

Mmmm.... bacon. This article examines two mechanisms to prevent data loss -- write barriers and check summing. Both can be particularly important for drives with larger and larger caches. Pay attention: This can save your data bacon.

Writing file systems is definitely a very complicated task. The basics of a file system can be written reasonably easily for those inclined, but making a file system robust, POSIX compliant, fast, and useful for the general masses can be extraordinarily difficult and complex. File system developers spend a great deal of time examining “corner” cases that could lead to data loss which make file systems more complicated and which sometimes has an impact on performance. But these corner cases have to be addressed to prevent data loss.

One of the corner cases that file systems have grappled with revolves around how journal file systems function. Ext4 has spent a great deal of time addressing possible scenarios with journal corruption that can lead to data corruption. In particular it uses journal check summing and write barriers to help reduce the probability of data corruption from journals and out-of-order write caching on hard drives.

Journal Check Summing

A checksum is a simple way to compute a single number representing a larger block of data. This single number can be used to check that particular block of data for any changes. Somewhat obviously, they can be used to check the validity of a block of data such as a journal transaction.

Recall that a journaled file system keeps a log (journal) of the file operations. Then the file system “plays” each operation or transaction and once the operation is completed it is deleted from the journal. If the system crashes such as from a power failure, upon reboot only the journal has to replayed to get a consistent file system (as opposed to performing a file system check, fsck, to ensure the file system is consistent for non-journal file systems).

One of the keys to a journaled file system is the definition of “completed” from the point of view of the file system. Does completed mean the data is actually on the platters? Or does it mean that the data has been given to the disk and can be in drive cache rather than the actual platters? The difference between these two is a result of the drive cache and can have implications for the file system journal.

One key feature that many people forget is that the journal is also typically kept on a disk and many times the same disk containing the file system. Any discussion about writing data to the drive and the impact of caching is also true for the journal itself! This is an important fact that many people, including myself, often forget. Ideally a journaled file system wants to be sure that the data has made it to the actual platters before it deletes the transaction from the journal. However, drive caches can perform operations out of order for better performance so that the file system is never sure exactly when the data is actually on the disk. But, given enough time, the data will make to the actual disk platters.

When a transaction is being committed to the journal the relevant pieces of the transaction are written to the journal (these are termed the transaction log). Once the entire transaction log has been written to the journal, then a “commit block” is written indicating that the transaction log in the journal is complete. But recall that today’s drives have very large caches and can put write operations out of order.

What can happen is that a “commit block” of a transaction can be written to the actual disk while the other relevant blocks of the transaction log may not be written at the same time (remember that writes can be out of order because of the drive cache). If a power failure happens before the other blocks are written to the disk, then the system has to be rebooted and the journal is replayed. When the journal is replayed the transactions that have a commit block written, indicating that the entire transaction log is in the journal, are replayed. However, the pieces of the transaction log have not been written to the disk because they were in the drive cache so the file system. This means that that the file system is replaying corrupted journal data resulting in corrupted file system data (the old “garbage in, “garbage out”).

Chris Mason, a well known kernel and file system developer, created a simple test program that can cause data corruption through this process. You basically run the program and during the operation pull the plug on the system. According to Chris, about 50% of the time it corrupted his ext3 file system.

A way to counter the journal corruption is to compute a checksum of the transaction log and write it as part of the commit block process. Then if the journal has to be replayed, it can check the transaction log against the checksum. If they don’t match then you have a corrupted journal entry and it won’t be replayed eliminating file system data corruption. You have lost the data associated with the journal entry but at least you have not corrupted your file system.

One might ask the question, “what is the probability of getting a corrupted journal?” Trying to estimate an approximate answer is complex but the problem is being made worse by larger drive caches. As drives have gotten larger the drive cache has grown as well to the point where many 2TB drives have 64MB of cache. This potentially increases the amount of time it takes to actual have a piece of data make it to the platter because the drive can make decisions about how the data is actually written to the drive (i.e. increases the out-of-order-ness of the data in the cache). Consequently, the risk of a corrupted journal increases.

Journal check summing can help with data corruption issues but there are other techniques that can help as well.

Write Barriers

As explained previously there is a risk of the journal becoming corrupted possibly causing file system corruption. In addition, to a corrupted journal, you also run the risk of losing drive cached data in the advent of a power failure. Any data that is in the drive cache is lost when the power fails. Adding a UPS to the system allows enough time to flush the drive cache (and potentially the system caches) solving the problem. Another way to truly make sure the data is written to disk is to turn off the drive cache.

Before you crucify me I admit that turning off the drive cache will impact performance, potentially by a great deal, but this way you are guaranteed of no data corruption due to the drive cache because everything is written to the drive. If you use RAID arrays you can possibly also get away with turning off drive caches because the RAID controller should have a battery backed cache that caches the data requests and can hold them through a power outage. So you may not see as much of a performance hit using a RAID controller and turning off the drive cache.

An alternative to turning off the write cache on hard drives to use something called write barriers. Write barriers basically flush the drive cache at appropriate times. In particular write barriers should happen before the journal commit to make sure all of the transaction logs are on the disk. The a write barrier will happen after the commit to make sure the journal entry is correct (theoretically). But, since the disk cache is being forced to flush, write barriers can have an impact on performance. However, it is generally felt that the performance penalty from write barriers is less than that resulting from turning off the drive cache.

Write barriers have been in the kernel for some time but file systems don’t always support it and some distributions or file systems don’t use by default. For example, in many distributions, write barriers are not turned on by default for ext3. You need to add the option “barrier=1″ to /etc/fstab to make sure it is turned on (“barrier=0″ turns it off). There are also file systems that don’t support write barriers. The way to find out is to try “barrier=1″ as a mount option with the file system and then examine /var/log/messages and look for messages around barrier. You can easily tell when barriers are disabled or not used.

There are also file systems that use barriers by default. I believe ext4 uses write barriers by default but there could be distributions that configure it not to use them. Just to be safe it is recommended to put the “barrier=1″ option in /etc/fstab to make sure that write barriers are activated.

If you are using LVM (Logical Volume Manager) or md (Linux software RAID) then in the fairly recent past write barriers would not work correctly (basically write barriers were ignored). However all of that has changed with some fairly recent kernels. As of the 2.6.29 kernel, all write barriers will be respected by LVM. Prior to that kernel write barriers were ignored by LVM but I’m not sure about Redhat or Suse kernels which contain some backported capability respect write barriers or not. If are using the Linux md capability (software RAID), then as of 2.6.33 all of the md devices support barrier requests.

Summary

You may not have wanted to get so in-depth with your file system but the techniques covered here are pretty important to understand particularly because there is a risk of data corruption. Journal check summing can definitely help stop a corrupted journal from contaminating a file system. In ext4 journal check summing is turned on by default but it’s always good to be sure it is by putting it in your /etc/fstab file.

Another technique that helps with data corruption is write barriers. They force the disk to flush the cache at certain points which can really help in reducing the risk of data corruption. But at the same time there is a performance impact from using write barriers.

A third alternative to help reduce the risk of data corruption is to turn off the write cache on your drives. For single drives this could greatly reduce performance but with a properly battery-backed RAID card with cache, the impact may not be as severe.

So you have a choice between reducing the risk of data corruption and getting reduced performance or you can go for more performance and run an increased risk of data corruption. I don’t know of a way to compute what the risk might be but many file system developers are saying that it is likely much greater than you think. Ultimately the decision is yours so I hope this article has broached the subject and you will do your homework.

Comments on "Saving Your Data Bacon with Write Barriers and Journal Check Summing"

stevemadere

how about putting the journal on a separate disk with the write cache turned off? Performance should be better than a single drive with write cache turned on because the journal disk only writes linearly (80 Mb/sec sustained typically) and the data disk can cache to its heart\’s content. This seems like it might provide a \’perfect\’ solution if it weren\’t for that \’little\’ problem of using up one of your disk drive bays and one of your controller ports (sata or scsi).

Yet another arrangement that can be a practical compromise is to place any cold storage (rarely accessed and typically in large contiguous blocks when it is accessed) on the same physical drive with the journal
for your hot storage FS. The data for the hot storage FS is on a separate dedicated drive. This way, you\’re not wasting drive space or bays but you do have to do the homework up front of figuring out which of our data fits the hot access profile and which fits the cold access profile.

Reply
aotto

Activating write barriers or disabling the disk cache are both things that most readers would be reluctant do do because of the reduced performance that will most certainly result.

Am I wrong to assume that any hard drive worth a salt has a capacitor on it that can power the drive long enough to commit any uncommitted parts of the local drive cache upon power failure?

With a properly designed hardware device, the need for write barriers should be redundant. With fewer cache flushes on the device, sustained performance should be much higher.

Also, using SSD drives could be another way to get around this, at least having one small SSD drive to hold the filesystem journal boosts performance a whole lot.

Reply
jafcobend

Thank you for the article. It has caused me to think quite a bit. Before I begin I would ask, \”PLEASE, PLEASE, PLEASE proof read!\” I\’m still not sure how a couple of the sentences were supposed to be read. I haven\’t spent any time investigating how the on-drive caches work so some of my thoughts may be from pure ignorance.

1. Does the on-board drive cache add that much in performance? Does the extra 3GB of system RAM, that I have, not trump whatever cache improvements I would get from the on-drive cache? I thought the schedulers, especially with the ability to pick the one that best fits your work load would do a better job of caching and ordering writes than whatever mechanism is on the drive. I think I\’m going to investigate turning off the on-drive caches on my system just to find out.

2. It seems from a theoretical standpoint that caching should not be done on the drive if you have an OS that is designed well at all. For which I\’m fairly confident that Linux is. :-) The drive can never know the correct ordering required to push the data to platter to maintain data integrity. Nor can drives bring the kind of cache sizes to bare that a computer can.

3. Putting a file system log on another drive without caching ensures the integrity of the log but not the data that it represents. Specifically in the case of expanding files or new files. The data written to the file may have not made it to the data drive while the log says it did. So even though you have space allocated the data within it is erroneous. And then again there are no guarantees with the log on the same drive and on-drive caching enabled.

4. Are the drives guaranteed to flush the caches before the system powers off? I would assume that the drive would flush the cache when it receives a power down signal. But that also assumes that the drive is given enough time to do so before the system power goes off. Perhaps the caps mentioned by a previous poster are real indeed?

5. Do all drives with cache support write barriers and cache control?

I hope this sparks some interesting discussion. But it seems to me that regardless of whatever slowdown you might get you are better off letting Linux figure out the best way to order writes then you are letting the drive, except for the case of battery backed caches.

I am a firm believer in decently long lived battery backups regardless. Any computer handling data that is worth anything, and especially if its shared on a network, MUST have a decent battery backup, which is checked regularly to make sure the battery is still viable. But then during the spring and fall in my area the 60 MPH winds will cause numerous power failures in a single day.

Thanks again for the article.

Reply
laytonjb

@jacfcobend:
I have to admit that this article was difficult to write because it is so technical and so involved. But I felt it was an important topic to at least broach so I went for it. If you have problems with any particular sentence or paragraph, please point it out. I did my best to proof the article – I can\’t tell you how many times I re-read it :) But I\’m sure I missed something (actually one person just pointed out a typo in a past article).

Now on to the questions… But before I answer them let me say that I don\’t know that much about how drives work internally. I don\’t know how the cache interacts with the kernel, etc. (what is really interesting is that the new Seagate drive with the built-in SSD apparently can \”learn\” and move certain blocks onto the SSD so drives have far more intelligence that I knew about).

1. It\’s always good to experiment with various options (write barriers, disabling the drive cache). It helps us understand the performance implications and then we can make an informed decision about choices we make. But to be honest, I don\’t know the relative impact of drive cache vs. system memory. The only \”gotcha\” is that system memory can\’t generally be allocated for IO – the kernel will do what it wants with it (I wish there was a way to force that – need to ask the kernel gods about that). BTW – if you have any data to share, let\’s hear about it! You can always write a quick article for Linux Mag :)

2. Try turning off the drive cache. Performance isn\’t always that great. Plus if the cache is disabled then the kernel will have to \”pause\” more often for the drive to return. This will definitely impact performance.

3. Great observation.

4. The drives will flush as part of the shutdown. I forgot the details but there was some discussion on the ext4 mailing list where Larry McVoy (I hope I got his name correct from memory) pointed out that until the file system is unmounted, the drive may not truly flush it\’s cache. I admit that I don\’t know the details of sync() or fsync() to know when data is truly flushed from cache.

5. I believe that all drives (within reason) obey the write barrier.

Thanks for the comments! Greatly appreciated. And again, if you have particular sentences or passages that seem goofy, let me know. Either I didn\’t explain things well or I made mistakes.

Jeff

Reply
jtmcdole

@jacfcobend

#3: There are different levels of journaling. The basic level of journaling is to log the file system meta data; you take a small performance hit for some security. You\’re asking about full-journaling in which ALL user data is written to the log first. This can be done with some file systems, but you take a much bigger hit as the data has to be written to the disk twice.

For people who are really worried about their data, you could invest some time in parity files (PAR2 for example). I\’m really shocked that the ext3 journal didn\’t have checksums!

Reply
jab1

(I just found this interesting article!)

In a slightly ironic twist, not long after they finally got the I/O barriers working also with LVM/MD in 2.6.33, now they’re getting rid of them. See http://lkml.org/lkml/2010/9/3/199

No need to worry about data integrity with volatile write caches though; the barriers are being replaced with a (simpler) interface to issue cache flush and FUA commands to the devices. Beyond being simpler, the big win with the new code is avoiding IO queue draining. In the current ordered barrier code, ordering is ensured by draining the queue before flushing the cache, but in practice all file systems (well, reiserfs could optionally use the ordered barrier code for ordering) already take care of ordering themselves by waiting for completion before submitting dependent IO’s. So the draining turns out to be unnecessary, and it can have a rather big performance impact.

Reply
jab1

@jacfcobend:

1) It’s not so much about increasing the amount of cache memory, which as you note is rather insignificant compared to all the RAM the kernel can use for the page cache. Rather it’s about reducing command latency. I.e. instead of “kernel issues write -> device writes stuff to disk -> device signals completion” you have “kernel issues write -> device writes to cache memory -> device signals completion”. Also, presumably the drive has the best information about where it’s head is at the moment, and is thus in the best position to decide in which order to serve I/O commands. Another way to reduce the impact of command latency is to have multiple outstanding commands; SCSI has had this since, well, forever with something called TCQ. Which perhaps explains why SCSI devices typically don’t have volatile write caches. SATA nowadays has something roughly equivalent called NCQ, but it came on the scene after volatile write caches were already the norm in the (S)ATA world.

2) In Linux ordering is handled by the filesystems waiting for IO’s to complete before issuing dependent IO’s. The block layer, and the device command queue for devices supporting such a thing, are free to reorder IO’s in any way they see fit.

4) Sadly, I know of no traditional drive with capacitors. If it had that, it would be awesome; we could safely mount our filesystems with barrier=0 and still be safe. But, I’m sure there are mechanisms to ensure caches are flushed when shutting down; I don’t know if the OS explicitly has to do that when unmounting, or if hw handles it itself.

5) There are rumors about drives which don’t honor cache flush commands, but AFAIK more or less all drives nowadays do honor them. Also, to be pedantic, neither the SCSI nor SATA standards know anything about write barriers, they are purely a software concept in the Linux kernel (implemented via queue draining and cache flushing). However see my previous post about how they are being replaced.

Reply
grabur

Thank you. I always enjoy your articles.

Reply

what is a few fun websites to blog on and make fun content besides fb and tumblr?

Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>