Don't know what a journaling filesystem is? You may wish you did the next time the power goes out in your machine room. We explain what they do and walk you through installing the reiserfs journaling filesystem.
You’re a good system administrator. Your machines and network run smoothly, you back things up, and everything is under control. But what if the power supply and backup power fail on your server? Things could get ugly when you finally restart the system, run fsck to check and repair the filesystems and wait and wait and wait.
Provided, of course, that fsck can repair your hard drive. Hey, you tell yourself, this kind of power failure is a remote possibility and, after all, it’s not a perfect world.
No, it’s not a perfect world; but it would be a little more perfect if you had a journaling filesystem.
For most of Linux’s existence, ext2fs has been its standard filesystem. ext2fs has served long and well, but it is not well-poised to handle the new and upcoming challenges Linux faces in the commercial arena. Because ext2fs is a static filesystem, it does not guarantee that all updates to your hard drive are performed safely. This is a major stumbling block preventing widespread use of Linux as a database server, among other things.
Journaling filesystems are superior to static filesystems when it comes to guaranteeing data integrity and even when it comes to flat-out filesystem performance. Replacing the ext2fs static filesystem with a journaling filesystem will ultimately be a big win for all Linux users.
How Filesystems Work
A filesystem stores data on your hard drive by determining where each file’s blocks (chunks of the file’s content) should be stored and by maintaining a table of the locations of those blocks. The data structure specifying the location of a block (or a set of blocks) is called an inode (index node). The particular methods for allocating and retrieving file blocks determine the overall performance of reads and writes to disk and the reliability of the filesystem itself.
|Figure One: Generic example of directory and file inodes.|
Static filesystems such as ext2fs maintain a map of inodes, which point to data and directory blocks. Each inode has a number that uniquely identifies it. Directory blocks contain a table associating a list of inode numbers with the names of the files and other directories stored in that directory. A file’s inode contains information describing the file. This information includes the file’s inode number, metadata (such as ownership and permission information), size of the file, date the file was last accessed and/or modified, and a list of each of the file’s data blocks. For large files, this list can also contain indirect blocks, which are blocks that themselves contain lists of blocks allocated to the file. In turn, indirect blocks can contain lists of double-indirect blocks, and double-indirect blocks can point to triple indirect blocks. (See Figure One.)
|Figure Two: An inode allocated to test.file and the directory it’s stored in.|
What happens when you change the contents of the file test.file? The inode for test.file lists three data blocks, which reside at disk locations 3110, 3111, and 3506. (See Figure Two.) The gaps between these blocks are indicative of fragmentation, probably because the data blocks between 3111 and 3506 were already allocated to other files. In order to read the entire file, the hard drive seeks to the 3110 area of the disk’s surface, reads two blocks, then seeks over to the 3500 area and reads the final block. (In general, filesystems actually use a readahead strategy, where more blocks than those actually requested are read from the disk and cached for later use.)
If you modified the second and third blocks of this file, the filesystem would read those blocks, make your changes, and then rewrite the blocks at their previous locations: 3111 and 3506. If you added two blocks to the file, the filesystem would find two free blocks on the disk, allocate those for the new data, and write the data there. (See Figure Three.)
|Figure Three: A sample ext2fsinode after the file has been modified.|
Suppose you are in the midst of updating a directory entry, and you have just modified 23 file entries in the fifth block of a giant directory. Just as the disk is in the middle of writing this block, there is a power outage. Because the write was not completed, you now have a corrupted block.
When a Linux box is rebooted, it runs a program called fsck (filesystem consistency check) that walks through the entire filesystem, validating all entries and making sure that blocks are allocated and referenced correctly. fsck would find the corrupted directory entry and attempt to repair it. However, there is no guarantee that fsck will actually be able to repair the damage. Quite often, it cannot. In this situation, all of the entries in the corrupted directory can be “lost,” (which means they get linked into the lost+ found directory for each filesystem. Blocks put into the lost+found directory are in use, but there is no way to know where they are referenced from).
For large filesystems, running fsck can take what seems to be forever when you and your users are waiting for a system to come back up. On a machine with many gigabytes of files, fsck can run for up to 20 minutes per filesystem. Filesystems used for database storage also have their own verification routines that must be run before the system can be made available to users. During this time, the system is unavailable, causing what is for some installations an unacceptable amount of downtime.
Journaling Filesystems and Linux
Journaling and logging filesystems solve many of these problems. Journaling and logging filesystems can either keep track of the changes to a file’s “metadata” (information such as ownership, creation dates, and so on), or to the data blocks associated with a file, or to both, rather than maintaining a single static snapshot of the state of a file.
In the example we used earlier of modifying the blocks in the middle of a file and then adding new blocks to the end, a journaling filesystem would first store the pending changes (modified and new blocks) in a special section of the disk known as the “log.” The filesystem would then update the actual file and directory inodes using the data from the log, and would then mark that log operation as having been completed (“committed,” in logging terms).
Whenever a file is accessed, the last snapshot of the file is retrieved from the disk and the log is consulted to see if any uncommitted changes have been made to the file since the snapshot was taken. Every so often, the filesystem will update file snapshots and record the changes in the log, thereby “trimming” the log and reducing access time. Committing operations from the log and synchronizing the log and its associated filesystem is called a checkpoint.
|Figure Four: Inodes before and after changes to test.filewith the log record showing the entry as committed.|
Journaling and logging filesystems get around the problem of inconsistencies introduced during a system crash by using the log. Before any on-disk structures are changed, an “intent-to-commit” record is written to the log. The directory structure is then updated and the log entry is marked as committed. (See Figure Four.) Since every change to the filesystem structure is recorded in the log, filesystem consistency can be checked by looking in the log without the need for verifying the entire filesystem. When disks are mounted, if an intent-to-commit entry is found but not marked as committed, then the file structure for that block is checked and fixed if necessary.
After a crash, filesystems can come online almost immediately because only the log entries after the last checkpoint need to be examined. Any changes in the log can be quickly “replayed,” and the corrupted part of the disk will always correspond to the last change added to the log. The log can then be truncated since it will be invalid, and no data is lost except for any changes that were being logged when the system went down. Mounting a heavily populated directory that requires subsequent validation for database partitions might take 10 to 20 minutes to fsck with a standard static filesystem. A journaled filesystem can reduce that time to a few seconds.
Unfortunately, the benefits of journaling don’t come for free. Logging generally requires more disk writes because you have to first append log records to the log, then replay them against the filesystem. However, in practice, the system can operate more efficiently by using its “free time” to commit entries from the log and checkpoint the file system records. Also, because logs are stored separately on the disk from filesystem data and are only appended to, logging changes happen much faster than actually making those changes.
Journaling filesystems are the future of local disk storage. High-quality, time-tested journaling filesystems that have proved their worth on other Unix platforms are being challenged by new entrants such as Linux’s own reiserfs. (See The Contenders sidebar, pg. 61). Such new filesystems are designed to be free of historical baggage while encompassing the latest, most powerful developments in file system design and theory. Regardless of which journaling filesystem “wins” in terms of local Linux storage, the systems administrators and users of the systems that adopt journaling filesystems are the real winners.
Journaling and logging file systems are a hot topic on all Unix platforms today, not just Linux, but Linux seems to have the widest selection from which to choose. There are multiple offerings from commercial software vendors who have released journaled filesystems for Linux, and there are a few contenders from the open source community. We decided to take a look at the state of these offerings.
Silicon graphics, Inc. (SGI) has released its excellent XFS journaling filesystem for Linux under the GPL. XFS is a fast, solid 64-bit filesystem, which means that it can support large files (9 million terabytes) and even larger filesystems (18 million terabytes). Unfortunately, simply releasing the source for XFS doesn’t mean that it’s ready for prime time — the code is still being purged of commands specific to IRIX, SGI’s version of Unix.
In February, 2000, IBM made its excellent JFS journaling filesystem available for Linux under an open source license. JFS is an enterprise-class filesystem that has proven itself in huge installations and has a good chance of becoming very successful in the Linux world. As with XFS, just releasing the source for JFS doesn’t mean that it’s ready for prime time — the code is still being purged of commands specific to AIX, IBM’s version of Unix.
Veritas File System
Journaling and logging filesystems have been around for quite some time. Solaris’ filesystem, the UFS filesystem, has always had the same problems as ext2fs, but Veritas Software offers a popular journaling filesystem for Solaris. Earlier this year, Veritas announced plans to port its journaling filesystem to Linux, but it will not be open source.
reiserfs has been under development for about two years. Hans Reiser, the primary author of reiserfs, has recently secured funding from commercial companies such as MP3, BigStorage.com, SuSE, and Ecila.com. These companies all need better, more flexible filesystems yesterday, and can immediately channel early beta user experience back to the developers.
Finally there is the ext3fs journaling filesystem, which is currently under development by Red Hat superhacker Stephen Tweedie. One of the major advantages of ext3fs is that it is designed to make the migration from static ext2fs filesystems to ext3fs as easy as possible. Tweedie’s initial proposal for ext3fs suggested implementing ext3fs by simply adding logging capabilities to ext2fs through a log file in that filesystem, which would be incredibly slow. In general, the ext3fs filesystem needs some work in the performance department, but Tweedie says it is “extremely reliable.”
Hands On: A Look at the reiserfs Journaling Filesystem
reiserfs is a new, general-purpose filesystem for Linux that is designed for flexibility and efficiency. The current version of reiserfs, 3.5.16, is quite stable. Versions of reiserfs are available for the Linux 2.2, 2.3 (Beta), and 2.4 (Beta) kernels.
THE DESIGN OF REISERFS
Journaling was added to reiserfs in 1999, and in some cases it still extracts a slight performance penalty in the interests of increased reliability and faster restart times. In its current incarnation for the 2.2.14/15 Linux kernels, reiserfs can be slower than ext2fs when dealing with files between 1 KB and 10 KB, but on average it is substantially faster. (The average file size on Unix servers is about 91 KB.)
reiserfs supports filesystem plug-ins that make it easy to create your own types of directories and files. This guarantees reiserfs a place in the Linux filesystems of the future by making it easy to extend reiserfs to support the requirements for protocols that are still being finalized, such as streaming audio and video. For example, a system administrator can create a special filesystem object for streaming audio or video files, and then create her own special item and search handlers for the new object types. The content of such files can already be stored in TCP/IP packet format, reducing processing latency during subsequent transmission of the actual file.
The version of the reiserfs filesystem for Linux 2.2 kernels is not yet 64-bit enabled, but this sort of cleanup is part of the planned development once the Linux 2.3 and 2.4 kernels are more stable.
Before installing reiserfs, the system administrator should decide which parts of his filesystem tree should be converted into journaling filesystems. Read-only directories do not need a journaling filesystem. Thus, for a standard Red Hat installation, the directories /boot, /root, /usr, /usr/local, /etc, and /dev do not need to be rebuilt for journaling.
Let’s assume you would like to put /home and /oracle_data under reiserfs‘ control. The first thing would be to make a backup of all files on these directories to tape or to another disk somewhere.
If you did not install the kernel sources when you installed Red Hat, do it now. reiserfs is installed as patches to the kernel source code. Then the kernel needs to be recompiled and installed under /boot.
Assuming you have kernel 2.2.14 (the standard kernel shipped with red hat 6.2) or 2.2.15 (available as an update from Red Hat’s web site), get the patch linux-2_2_14-reiserfs-3_5_ 19-patch.gz or linux-2.2.15-reiserfs-3.5.19-patch.gz. It is important to get the reiserfs patches to the kernel for which you have the source code installed. Recompiling your kernel to add reiserfs support will not work if you obtain the patches for the wrong kernel source code. To download these patches go to:
Then type cd/usr/src and patch your kernel by executing the command:
zcat filename | patch -p0
Please be sure to enable reiserfs support during the Filesystem section of the make config or make xconfig step.
Activating a reiserfs Filesystem for /home and /oracle_data
Once you have installed your new reiserfs-enabled kernel, you’ll need to cd to the new kernel-source directory fs/reiserfs/ utils and type make depend, make, and make install. This will build and install the reiserfs-related utilities such as mkreiserfs, resize_reiserfs, and fsck.reiserfs.
Note: when installing fsck.reiserfs for Linux 2.2.14, you’ll have to either fix the Makefile or rename the installed binary from /sbin/reiserfsck to /sbin/fsck.reiserfs.
After you build and install the reiserfs utilities, you can then proceed to make a reiserfs filesystem with the mkreiserfs tool (which is found in /sbin). First create two reiserfs filesystems on /dev/hda2 and /dev/hda3 by typing:
# mkreiserfs /dev/hda2
# mkreiserfs /dev/hda3
Next, mount the two newly formatted partitions by typing:
# mount /dev/hda2 /home
# mount /dev/hda3 /oracle_data
Now restore the backups to their appropriate directories and — presto! — you have a journaling filesystem.
reiserf has several mount options:
*-notail causes the filesystem to work faster, especially for small appends to small files. However, this option also wastes more disk space.
*-genericread causes reiserfs to use generic file read, which is used by most Linux filesystems. Sometimes it can improve performance, sometimes not. If you are doing lots of small reads or lots of seeks, turn it on.
* If you are using md to spread reiserfs over multiple disks, turn off REISERFS_READ_LOCK (simply comment out #define REISERFS_READ_LOCK in linux/include/linux/ reiserfs_fs.h.).
In about three months of running reiserfs, our laboratory experienced a few hard-to-trace problems that could have been related to reiserfs. Oracle died a few times with mysterious error messages related to data files that resided on reiserfs volumes. The errors were not reproducible, however, and they have been forwarded to Oracle R&D for further analysis.
Large updates on Oracle tables that reside on reiserfs volumes take a bit longer than on ext2fs filesystems. The increased security, recoverability, and faster restart times, however, more than make up for any slower operation.
Moshe Bar is an Israeli systems administrator and OS researcher. His Web site is at http://www.moelabs.com.