A variety of causes — a forgotten root password, corrupted partition tables, or a corrupt filesystem — can lead to serious problems, up to and including an inability to boot the computer. When such problems occur, it’s time to reach for your first aid kit. Not prepared? Scrub in and learn how to handle the worst. Welcome to the Linux ER.
Linux is well-known as a robust and capable operating. Linux systems seldom crash outright, and when they do, the cause is usually a hardware fault or a buggy experimental driver.
Even so, Linux isn’t perfect. A variety of causes — a forgotten root password, corrupted partition tables, or a corrupt filesystem — can lead to serious problems, up to and including an inability to boot the computer.
When such problems occur, it’s time to reach for your first aid kit. Not prepared? Scrub in and learn how to handle the worst. Welcome to the Linux ER.
Emergency Recovery Systems
The first step to correct severe system problems is preparation. It’s best to prepare your tools and plan well before a problem occurs. After all, you don’t want to be spending hours looking for emergency software on the Internet after your computer falls ill. It’s Better to have an emergency response system (or several) ready to be used. You can then boot that system in a minute or two and begin the recovery work.
What sorts of emergency recovery systems exist, though? Quite a few options are available. Indeed, quite a few types of options exist:
*Side-by-side installations. You can set up a second Linux installation on the same system. You can use any Linux distribution you like, but the distribution you use for the primary system is probably the best choice.. You can then use the secondary system’s tools, including its package management system, to recover from problems in the primary system. This approach is easy to implement, but is vulnerable to certain types of problems, such as a failed hard disk, which could wipe out your emergency system as well as the primary system.
*Distribution installation tools. Most Linux distributions’ installation tools double as emergency recovery tools. Typically, you select an option when booting the installer, or boot from a disc other than the first one, to enter the emergency recovery system. These systems have the advantage of handling distribution-specific problems well, but they tend to be inflexible compared to some other options.
*CD-based distributions. Distributions such as Knoppix (http://www.knoppix.org), System Rescue CD (http://www.sysresccd.org), and FIRE (http://biatchux.dmzs.com) are designed to boot and run entirely from a CD-ROM. Some mainstream distributions, such as SuSE (http://www.novell.com/linux/suse), provide similar “demo” versions of Linux. These systems may be the most flexible emergency recovery tools, because they provide enough tools to do most recovery tasks. On the other hand, they tend to be rather sluggish and can be tricky to customize.
*Small sub-CD distributions. A few distributions are available for use on media smaller than CD-ROMs. One notable example is ZipSlack, a minimalist variant of Slackware (http://www.slackware.org) intended to be installed on a Zip disk, although it can also be installed on LS-120 and other small disks, or even used on a small hard disk partition. Such a distribution is probably adequate for many emergency recovery tasks if you’re comfortable working exclusively with the command-line.
*Floppy-based distributions. At the extreme of the size scale are floppy-based distributions, such as muLinux (http://mulinux.sunsite.dk) and Tom’s Root/Boot Disk (http://www.toms.net/rb/). Such distributions are particularly handy if the system you’re recovering lacks a CD-ROM drive or other higher-capacity removable disk. They tend to be very minimalist, though, and are likely to lack the tools necessary for at least some tasks.
Within each of these categories, quite a few options are available. Some of these tools are tailored to specialized tasks. Indeed, emergency recovery can be considered such a specialized task. If you simply pick a tool at random, you might find that it’s not suitable for emergency recovery, although it might make be perfectly suited for, say, dedicated router duty. Thus, you should look into these options and familiarize yourself with at least a couple of these systems.
Preparing an Emergency Kit
You should prepare a Linux emergency kit and keep it handy. This kit should contain several items:
*System description. Record critical data on your Linux system on paper. Data to record should include your partition layout (as displayed by typing fdisk –l /dev/hda from within your normal Linux installation); your disk’s cylinder/head/sector (CHS) geometry (again, as reported by fdisk); and the filesystems you’ve used on your partitions and their mount points (as recorded within /etc/fstab). You might also want to note the distribution (including version number) you’ve installed. Do not record any passwords on paper, though. Doing so is a security risk, and you can reset passwords without knowing them, as described shortly.
*Linux emergency system. Have at least one Linux emergency system at hand. Ideally, your kit should have a selection of emergency tools, where each may be suited to a particular diagnosis. Be sure that your emergency system can read the filesystems you use on your main installation and has disk tools (such as fsck programs) for that filesystem.
*Extra software. If your emergency system lacks some key program, put it on a floppy disk, a Zip disk, a CD-R, or some other medium you can use from the emergency system.
*Instructions. Read any instructions that come with your emergency systems, and print out anything that’s particularly vital.
You should perform test boots of your emergency systems to be sure that they all work and can read the filesystems you use on your main system. If a system is easily customized, you might want to add software you need or that you simply like.
Once the system is prepared, set it aside, but be sure to test it periodically. Removable media can fail over time, and a working configuration can become unusable because of changes to your primary system. (For instance, a change in filesystem data structures might obsolete an emergency system’s support for a particular filesystem.)
One particularly vexing problem is a forgotten password. In most cases, you can correct this problem by logging in as root and using the passwd utility, as in passwd jones to change jones’ password.
Unfortunately, this approach won’t work if you’ve forgotten the root password itself. To fix this problem, you must boot an emergency system with no password or its own password. You can then mount the main system’s root filesystem and edit its /etc/shadow file. Be sure to edit the file on the main system, not on your emergency system! Look for the line that holds the root password:
The password appears in the second colon-delimited field, and in most cases it begins with the string $1$.
You can do one of two things to correct a forgotten root password:
*Copy a password field from an account with a password that you do remember. When you reboot, you can then use that password to log into the computer. This is the superior option, if you know at least one other password on the computer (or on another computer– you can copy the password from another system).
*Delete the password field. This sets a null password, which lets you login as root without a password.
In either case, but particularly in the second case, you should use passwd to change the root password immediately upon rebooting the computer back into its normal software. In fact, you might want to disconnect the computer from the network when you reboot, to minimize the slim chance that a cracker will be able to break in during the brief period of vulnerability when the system has no root password.
Recovering Corrupted Partition Tables
Occasionally, a hard disk’s partition table may become damaged, albeit leaving the data in the partitions themselves unharmed. This happens most often because of careless use of fdisk or another low-level disk utility (perhaps even a non-Linux disk utility). This problem can be particularly frustrating because the data is intact, but inaccessible. Fortunately, if you’re prepared, the corrective measure is relatively straightforward. If not, recovery is still possible, but trickier.
Ideally, you’ll have information on the CHS geometry and partition table of the disk from before the problem occurred. If so, you can recover the partitions by using Linux’s fdisk.
1.Launch fdisk as root.
2.Type p in fdisk to check the partition table. Depending on the cause of the problem, some or all of the partitions that should be present will be missing, or the system might return complete gibberish. Note the CHS geometry reported by fdisk.
3.If the CHS geometry doesn’t match the CHS geometry you’ve previously recorded, change it by typing x to enter the expert menu and then using the c, h, and s options to change the number of cylinders, heads, and sectors, respectively, to match the values you recorded. When you’re done, type r to return to the main menu.
4.Type p again. If fdisk reports nonsense partitions, delete them using the d option.
5.Use the n option to add back your original partitions. You may need to use this option several times. Be sure to restore the partitions to the precise sizes you’ve recorded. If you don’t, some or all of your partitions will be inaccessible, and you might even damage the data they contain.
6.Type p to review your partitions. Compare the output to the partition table you’ve recorded from before the problem occurred. If you note any discrepancies, correct them.
7.When you’re done, type w to write the changes to disk and exit from fdisk.
If you make a mistake during this process, you can either try to correct it (say, by deleting and re-creating a partition you created with the wrong size) or exit from fdisk without saving changes by typing q. If you do the latter, you’ll have to start over again.
When you’re done, you should be able to mount and use the partitions. Be sure to check them all. If some of them don’t work, it may be that they’ve been damaged, or it could be that you’ve entered some information incorrectly. Go back into fdisk and check your work. If it seems correct, perhaps using fsck will help, as described shortly.
If you don’t have the original disk CHS geometry and partition layout information, you may still be able to recover the disk’s data. The trick is to use GNU Parted (http://www.gnu.org/software/parted/parted.html), a flexible partition management tool that supports partition creation, deletion, resizing, and other features. (The July 2003 “Guru Guidance” column described GNU Parted as a partition-resizing tool; that article is available online at http://www.linux-mag.com/2003-07/guru_01.html.) If your emergency system doesn’t support GNU Parted, you’ll have to place a binary on a partition or removable disk that’s accessible from your emergency system.
To recover partitions using GNU Parted, follow these steps:
1.Launch GNU Parted by typing parted /dev/hda. (Change /dev/hda to the device identifier for the hard disk you want to rescue.)
2.Type rescue start end, where start and end are the approximate start and end points of the partition. These points don’t need to be exact, and if in doubt, you should err on the side of making the space too large. If GNU Parted finds a filesystem in that area, it reports details and asks you if you want to create a partition for the filesystem. Respond affirmatively.
3.Repeat the previous step as necessary to recover all your partitions.
4.Exit from GNU Parted.
Unfortunately, GNU Parted’s rescue command is rather unreliable. Sometimes it works, but if you know the exact start and end points of the partitions you want to recover, you should use Linux’s fdisk instead — fdisk is much simpler and more reliable if you know the precise start and end points.
Overcoming Filesystem Failures
Filesystem failures result when critical filesystem data structures, such as directories, inodes, and free space bitmaps, become damaged. This can happen because of kernel bugs, power outages at inopportune times, buggy low-level disk utilities, or mistakes by a system administrator (or even an ordinary user, if that user has write access to the filesystem’s device file, as is common with floppy disks).
A failed filesystem may show no problems in the partition table, but the filesystem either won’t mount or exhibits strange behavior when mounted, such as an inability to access certain files or gibberish appearing in directory listings. (These symptoms can also be caused by failure of the disk medium.)
Before proceeding, you may want to back up the affected partition. To do so, use the dd command, as in dd if=/dev/hda5 of=/mnt/spare/hda5.img. This command backs up the /dev/hda5 partition to the /mnt/spare/hda5.img file. (Of course, the target location must have enough free disk space to hold the backup.) This backup can also take some time, but before attempting to recover truly serious filesystem corruption, making such a backup is wise. Occasionally, disk recovery tools can actually make matters worse, so having a backup enables you to try again with other options or other tools. To restore the backup, type dd if=/mnt/spare/hda5.img of=/dev/hda5, making appropriate substitutions for the backup image file and the target device filename.
The usual solution to filesystem failures is to run the Linux filesystem check program, fsck. (In truth, this program is merely a front-end to other filesystem-specific programs, such as fsck.ext2 and fsck.reiserfs, for ext2fs/ext3fs and ReiserFS, respectively. Some of these names, in turn, are symbolic links to other programs; for instance, fsck.ext2 is a symbolic link to e2fsck.) fsck is designed to check the filesystem for errors and to correct those errors.
In its most basic form, you type fsck device, where device is the device filename for the filesystem, such as /dev/hda5. You can also use the –t fstype option to force fsck to treat the filesystem as the specified type, such as ext2 or reiserfs. Of course, you can also run the filesystem-specific utility directly if you want to do this.
Many filesystems support filesystem-specific fsck options. You can either pass these to fsck after a double dash, as in fsck /dev/hda5 –– –f to pass the –f option, or call the filesystem-specific check program directly. Table One shows some of the more useful of these options.
If you use XFS, you should be aware that the fsck.xfs program does nothing. To check an XFS filesystem, you should use xfs_check, and to correct XFS problems, you should use xfs_repair. Both programs accept various options, but they’re intended for advanced users.
The fsck programs for ext2fs, ext3fs, and JFS all first check the filesystem’s clean-unmount flag, which tells the system whether the filesystem was cleanly unmounted. If it was, these programs perform a very truncated check in order to save time. To perform a full check on such a partition, you must specify the –f option.
The ext2fs and ext3fs fsck program enables you to specify a backup superblock. This is a particularly sensitive filesystem data structure; if it’s damaged, access to the rest of the disk becomes impossible, or at least unreliable. For this reason, ext2fs and ext3fs store backups of the superblock at strategic locations throughout the filesystem. Ordinarily, fsck looks for and uses the primary superblock, but you can tell it to use a backup superblock by specifying a location with the –b option.
No matter what filesystem you’re using, a full check is likely to take some time– probably several minutes, or conceivably over an hour on a very large disk. (Small filesystems, such as those on /boot partitions, can usually be checked in a few seconds, though.) With luck, this check corrects any problems you’re having with the disk. If it fails, restore your backup and try again with other options. You may also want to read the man page for your filesystem-specific fsck tool; perhaps a more advanced option will help you work around the problem.
Most Linux configurations automatically call fsck at boot time. In most cases, this call results in just a quick check, and possibly a recovery using the journal, in the case of journaling filesystems. Running fsck manually can result in a more complete check of the filesystem, particularly if you use the –c option and omit the –p option.
Surviving a Hard Disk Crash
One of the most troubling types of failure is a hard disk crash. In an extreme case, a disk crash means that the data on the disk become completely inaccessible, at least short of sending the hardware off to a data-recovery outfit — a course of action that’s likely to be quite expensive.
In less extreme cases, the disk may begin manifesting problems before it fails outright. These problems can look very much like filesystem failures. In fact, in a very real sense they are filesystem failures; they’re just filesystem failures that are caused by hardware failures.
Your best clues that you’re dealing with a hardware problem are bad block reported by fsck when you pass it the –c filesystem-specific parameter to check for bad blocks, error messages about disk failures in the dmesg output, and failures reported by low-level disk diagnostics. (All major hard disk manufacturers provide such tools on their web sites. They’re typically DOS programs that come with DOS boot floppy images, so you can run them on x86 Linux systems, although you’ll need to reboot the computer.)
If you suspect your hard disk is failing, back it up immediately and replace it as soon as possible. If you’re lucky, you’ll be able to transfer your system directly from the failing hard disk to a replacement with little or no loss of data:
1.Copy your kernel file (/vmlinuz, /boot/bzImage, or whatever it happens to be called) to a floppy disk, or, if it won’t fit on one, to a CD-R or other high-capacity removable disk.
2.Prepare a DOS boot floppy that also holds LOADLIN. This DOS program can boot a Linux kernel. If you don’t have DOS handy, try FreeDOS (http://www.freedos.org). Be sure this boot floppy can read whatever medium you used to store the kernel image. (This may mean adding CD-ROM drivers, if you stored the kernel on a CD-R.)
3.Shut down the computer.
4.Install both hard disks in the computer.
5.Boot to an emergency Linux system or to the original Linux system.
6.Prepare partitions on the new hard disk using fdisk and mkfs, GNU Parted, or other tools of your choice.
7.Mount the original partitions (if necessary) and the newly-prepared partitions.
8.Transfer data from the old disk to the new one. You can do this with tar, one partition at a time, by changing into the directory corresponding to the partition you want to move and typing tar cvplf –./|(cd /dest/dir;tar xplf –), where /dest/dir is the destination partition. Repeat this step for each partition.
9.If you’ve changed your partition layout or numbers, modify /etc/fstab on the new system as appropriate. Note that the device filenames will change based on how you’ll reconfigure the hard disk in Step 11 (below), so set the entries according to the target device filenames, not the current ones.
10.Shut down the computer.
11.Remove the old hard disk and reconfigure the new one to take on its new identity to replace the old hard disk.
12.Insert the DOS floppy you prepared in Step 2.
13.Boot the computer into DOS.
14.Use LOADLIN to boot your kernel. For instance, you might type LOADLIN bzimage ro root=/dev/hda2 to boot the bzimage file as the kernel and point it to /dev/hda2 as the root filesystem. This should bring up your Linux system. If it doesn’t boot or works strangely, you’ll need to troubleshoot the system.
15.Re-install your boot loader (LILO or GRUB). For LILO, typing lilo as root should do the job, assuming your partition layout is the same as it was on the first disk. For GRUB, typing grub-install should work, again assuming the partition layout of the new disk is the same as it was on the old disk. If your partition layout has changed, you must edit your LILO or GRUB configuration file before re-installing the boot loader.
16.Remove the floppy disk from the computer and reboot. This step is intended to test the newly-reinstalled boot loader; the system should now boot from the hard disk. If it doesn’t, start again from Step 12 and review your boot loader configuration.
More serious disk crashes present additional complications. Frequently, you won’t be able to copy all of the files, although you may be able to copy enough to get a working system on a new disk. If this is the case, you should take notes of any files that you can’t copy and replace them from their original sources, say, by re-installing an RPM or Debian package file or by recovering from a system backup.
In the most extreme cases, you won’t be able to copy any files, or at least not enough to make the effort worthwhile. In this case, you’ll need to resort to your backups, if you have them. If you have no backups, you’ll need to re-install the system from scratch. You might want to then concentrate recovery efforts on your /home directory or any other directory holding important local files. Perhaps you’ll be able to copy some of these files, even if you can’t recover enough system files to create a bootable system.
Whether you’re dealing with a complete hard disk failure or a less severe problem, having emergency recovery tools on hand and being familiar with them can save you a lot of time and trouble when a serious problem occurs. Thus, taking some time now to prepare for such problems can pay off many times over in the future.
Roderick W. Smith is the author or co-author of tweleve books, including Linux Power Tools and Linux in a Windows World, as well as the author of Linux Magazine ‘s “Guru Guidance” column. He can be reached at