Using Kexec and Kdump

Kexec can spawn a kernel-over-a-kernel without the overhead of boot firmware, while kdump can reliably collect a crash-dump using the services of kexec.

Last time, you learned to insert dynamic kernel probes using
kprobes. This month, let’s continue to
look at more kernel serviceability features, such as kexec and
kdump, which were introduced in recent versions of the "i">2.6 kernel.

Kexec uses the image overlay philosophy
of the UNIX exec system call to spawn a new
Linux over a running Linux, without the overhead of boot firmware.
Kexec has different uses including fast reboot, but one of its more
interesting uses is kdump. Capturing a dump
after a kernel crash is inherently unreliable, since kernel code
that access the dump device may be in an unstable state. Kdump
circumvents this problem by collecting the dump after booting into
a healthy kernel via kexec.

All About Kexec

Kexec replaces the running kernel image with a new kernel image
without going through boot firmware. This can save several seconds
of reboot time, since boot firmware is usually responsible for
walking buses and recognizing devices. Less reboot latency
translates to less system downtime — a main motivation for
implementing kexec. However, kdump is the most popular user of
kexec today. Kdump works in tandem with kexec to collect reliable
crash dumps.

Here are the preparations needed before you can use kexec with
your kernel:

1.Compile and boot into a
kernel that has kexec support. For this, turn on” Kexec system
call” under” Processor type and features” in the configuration
menu. Prepare the second kernel that is to be kexec-ed. This kernel
can be the same as the first kernel.

2.Download the "i">kexec-tools package source tar ball from "http://www.xmission.com/~ebiederm/files/kexec/" class=
patch it with the kdump patch from "http://lse.sourceforge.net/kdump/" class=
(needed for
using kdump in the next section), and build. The current versions
are kexec-tools-1.101 and "i">kexec-tools-1.101-kdump.patch. This produces a user
space tool called kexec.

The kexec command is invoked in two stages: the first loads the
new kernel image into the buffers of the running kernel, while the
second actually overlays the running kernel. Here’s how to
use it:

1.Load the second (overlay)
kernel using the kexec command.

bash> kexec -l /usr/src/linux-

bzImage is the
second kernel, hdaX
is the root device, and "i">myinitrd.img is the initial ramdisk. The kernel
implementation of this stage is mostly architecture-independent. At
the heart of this stage is the sys_kexec()
system call. The kexec command uses the
services of this system call to load the new kernel image into the
running kernel’s buffers.

2.Next, boot into the second

bash< kexec -e   

kexec abruptly starts the new kernel
without shutting down the operating system. To shutdown prior to
reboot, add the above command to the bottom of the "i">halt script (usually "i">/etc/rc.d/rc0.d/S01halt) and invoke "i">halt instead.

The implementation of this second stage depends on the
particular architecture. The crux of this stage is a "c">reboot_code_buffer that contains assembly code to put
the new kernel in place to boot.

kexec bypasses the initial kernel code
that invokes services of boot firmware, and directly jumps to the
protected mode entry point. An important challenge to implement
kexec is the interaction between the kernel
and the boot firmware (BIOS on x86- based
systems, Openfirmware on Power- based
machines). On x86 systems, information such as the "i">e820 memory map passed to the kernel by the BIOS, needs
to be supplied to the kexec- ed kernel as

Kexec with Kdump

The kexec invocation semantics are
somewhat special when it’s used in tandem with "i">kdump. In this case, kexec is
required to automatically boot a new kernel when it encounters a
kernel panic. If the running kernel (called the” crash kernel” or
the” first kernel”) crashes, the new kernel (called the” capture
kernel”) is booted to reliably capture the dump. A typical call
syntax is:

bash> kexec -p /usr/src/linux-
    --args-linux --append="root=/dev/hdaX irqpoll"

The -p option asks "i">kexec to trigger the reboot when a kernel panic occurs.
A vmlinux ELF (Executable
and Linking Format
) kernel image has to be used as the
capture kernel; k exec doesn’t like
the bzImage format (which relocates after
boot) yet. Since vmlinux is a general ELF
boot image and since kexec is theoretically
operating system-agnostic, you need to specify via "c">–args-linux option that the following arguments have to
be interpreted in a Linux-specific manner.

The capture kernel boots asynchronously during a kernel crash,
so device drivers using shared interrupts may fatally express their
unhappiness during boot up. To be nice to them, specify
irqpoll in the command line passed to the
capture kernel.

The capture kernel requires access to kernel memory of the first
(crashed) kernel to generate a wholesome dump, so the latter cannot
just overwrite the former, as was done in the non- "i">kdump case. The running kernel needs to reserve a memory
region to run the capture kernel. To mark this region,

*Boot the first kernel with the
command line parameter crashkernel=64M@16M
(or other suitable size and start values).

*While configuring the capture
kernel, set CONFIG_PHYSICAL_START to the
same start address assigned above (16 MB in this case). If you”
kexec” into the capture kernel and peek inside "i">/proc/meminfo, you’ll find that this is the total
amount of physical memory that this kernel can see.

Now that you’re comfortable with "i">kexec and kdump, let’s
delve further into kdump and use it to
analyze some real world kernel crashes.

Kdump to the Rescue

An image of system memory captured after a kernel crash or hang
is called a crash dump. Analyzing a crash
dump can yield valuable clues for post mortem analysis of kernel
problems. However, obtaining a dump after a kernel crash is
inherently unreliable since the storage driver responsible for
logging data onto the dump device might be in an undefined

Until the advent of kdump, "i">Linux Kernel Crash Dump (LKCD) was the popular mechanism
to obtain and analyze dumps. LKCD uses a temporary dump device
(such as the swap partition) to capture the dump. It then
warm-reboots back to a healthy state and copies the dump from the
temporary device to a permanent location. A tool called
lcrash is used to analyze the dump. The
disadvantages of LKCD include:

*Even copying the dump to a
temporary device might be unreliable on a crashed kernel.

*Dump device configuration is

*The reboot might be slow since
swap space can be activated only after the dump has been safely
saved away to a permanent location.

*LKCD is not part of the
mainline kernel, so installing the proper patches for your kernel
version is a hurdle.

By comparisonm, kdump isn’t
burdened with these shortfalls. It eliminates indeterminism by
collecting the dump after booting into a new kernel via
kexec. Also, since memory state is preserved
after a kexec reboot, the memory image can
be accurately accessed from the capture kernel.

Let’s first get the preliminary "i">kdump setup out of the way:

1.Ask the running kernel to”
kexec” into a capture kernel on encountering a panic, as described
in the previous section. The capture kernel should additionally
have CONFIG_KDUMP and "c">CONFIG_HIMEM turned on (both these options sit inside”
Processor type and Features” in the kernel configuration menu).

2.Once the capture kernel
boots, copy the collected dump information from "i">/proc/vmcore to a file on your hard disk:

bash> cp /proc/vmcore /dump/vmcore.dump

You can also save other information like the raw memory snap
shot of the crashed kernel, via "i">/dev/oldmem.

3.Boot back into the first
kernel. You are now ready to start dump analysis.

Let’s use the collected dump file and the "i">crash tool to analyze some example kernel crashes.
Introduce this bug snippet to the interrupt handler of the Real
Time Clock (RTC) driver found at "i">drivers/char/rtc.c:

irqreturn_t rtc_interrupt(int irq, void *dev_id, struct pt_regs *regs)
+  volatile int * integerp = 0xFF;
+  int integerd = *integerp;  /* Bad memory reference! */

spin_lock (&rtc_lock);

Trigger execution of the handler by enabling interrupts via the
hwclock command:

bash> hwclock

Save /proc/vmcore onto "i">/dump/vmcore.dump, reboot back into the first (crashed)
kernel, and start analysis using the crash
tool. Of course, in a real world situation, the dump will be
captured at a customer site, while the analysis might be done at a
support center.

bash> crash /usr/src/linux- /dump/vmcore.dump
crash 4.0-2.24
      KERNEL: /usr/src/linux-
    DUMPFILE: /root/vmcore.dumpfile
        CPUS: 1
        DATE: Mon May 29 04:15:49 2006
      UPTIME: 00:17:22
LOAD AVERAGE: 0.82, 1.02, 0.87
       TASKS: 63
    NODENAME: localhost.localdomain
     VERSION: #9 Sun May 28 17:55:16 IST 2006
     MACHINE: i686  (599 Mhz)
      MEMORY: 1 GB
       PANIC: "Oops: 0000 [#1]" (check log for details)

Examine the stack trace to get a hang of the crash:

crash> bt
PID: 0      TASK: c03a32e0  CPU: 0   COMMAND: "swapper"
 #0 [c0431eb8] crash_kexec at c013a8e7
 #1 [c0431f04] die at c0103a73
 #2 [c0431f44] do_page_fault at c0343381
 #3 [c0431f84] error_code (via page_fault) at c010312d
    EAX: 00000008  EBX: c59a5360  ECX: c03fbf90  EDX: 00000000  EBP: 00000000
    DS:  007b      ESI: 00000000  ES:  007b      EDI: c03fbf90
    CS:  0060      EIP: f8a8c004  ERR: ffffffff  EFLAGS: 00010092
 #4 [c0431fb8] rtc_interrupt at f8a8c004
 #5 [c0431fc4] handle_IRQ_event at c013de51
 #6 [c0431fdc] __do_IRQ at c013df0f

The stack trace points the needle of suspicion at "c">rtc_interrupt(). Let’s disassemble the surrounding
instructions at the address gleaned from above:

crash> dis 0xf8a8c000 5
0xf8a8c000 <rtc_interrupt>:     push   %ebx
0xf8a8c001 <rtc_interrupt+1>:   sub    $0x4,%esp
0xf8a8c004 <rtc_interrupt+4>:   mov    0xff,%eax
0xf8a8c009 <rtc_interrupt+9>:   mov    $0xc03a6640,%eax
0xf8a8c00e <rtc_interrupt+14>:  call   0xc0342300 <_spin_lock>

The instruction at address 0xf8a8c004 is
attempting to move the contents of the EAX
register to address 0xff, which is clearly
the invalid deference that caused the crash.

If you use the irq command, you can
figure out the identity of the interrupt that was in progress
during the time of the crash. In this case, the output shows that
the culprit is indeed the RTC interrupt handler:

crash> irq
    IRQ: 8
handler: f8a8c000  <rtc_interrupt>
            flags: 20000000  (SA_INTERRUPT)
             mask: 0
             name: f8a8c29d  "rtc"

crash> quit

Let’s now shift gears to a case where the kernel freezes,
rather than generate an” oops.” Consider the following buggy driver
init() routine:

static int __init
mydrv_init (void)
  spin_lock (&mydrv_wq.lock);  /* Usage before initialization! */
  spin_lock_init (&mydrv_wq.lock);

  /* … */

The code is erroneously using a spin lock before initializing
it. Effectively, the CPU spins forever trying to acquire the lock,
and the kernel appears to hang. Let’s debug this problem
using kdump. In this case, there will be no
auto trigger since there is no panic, so force a crash dump by
pressing the magic sysrq key combination,
Alt-Sysrq-c. You may need to enable sysrq by
writing a 1 to /proc/sys/kernel/sysrq:

bash> echo 1 > /proc/sys/kernel/sysrq
bash> modprobe mydrv
/* HANG inside mydrv_init() */

… Triggering Crash Dump …

Save the dump to disk after kexec boots
the capture kernel, boot back to the original kernel, and run
crash on the saved dump:

bash> crash vmlinux vmcore.dump
       PANIC: "SysRq : Trigger a crashdump"
         PID: 2115
     COMMAND: "insmod"
        TASK: f7c57000  [THREAD_INFO: f6170000]
         CPU: 0

Test the waters by checking the identity of the process that was
running during the time of the crash. In this case, it was
apparently insmod (of "i">mydrv.o):

crash> ps
 2171   2137   0  f6bb7000  IN   0.5   11728   5352  emacs-x
 2214      1   0  f6b5b000  IN   0.1    2732   1192  login
 2230   2214   0  f6bb0550  IN   0.1    4580   1528  bash
> 2261   2230   0  c596f550  RU   0.0    1572    376  insmod

The stack trace doesn’t yield useful information, except
that it blames the sysrq keypress for causing the crash:

crash> bt
PID: 2115   TASK: f7c57000  CPU: 0   COMMAND: "insmod"
 #0 [c0431e68] crash_kexec at c013a8e7
 #1 [c0431eb4] __handle_sysrq at c0254664
 #2 [c0431edc] handle_sysrq at c0254713

Let’s peek at the log messages generated by the crashed
kernel. The log command reads the messages
from the kernel printk ring buffer on the
dump file:

crash> log
BUG: soft lockup detected on CPU#0!

Pid: 2261, comm:               insmod
EIP: 0060:[<c010ec1b>] CPU: 0
EIP is at delay_pmtmr+0xb/0x20
 EFLAGS: 00000246    Tainted: P       ( #11)
EAX: 5caaa48c EBX: 00000001 ECX: 5caaa459 EDX: 00000012
ESI: 02e169c9 EDI: 00000000 EBP: 00000001 DS: 007b ES: 007b
CR0: 8005003b CR2: 08062017 CR3: 35e89000 CR4: 000006d0
 [<c01fede9>] __delay+0x9/0x10
 [<c0200089>] _raw_spin_lock+0xa9/0x150
 [<f893d00d>] mydrv_init+0xd/0xb2 [wqdrv]
 [<c0136565>] sys_init_module+0x175/0x17a2
 [<c015d834>] do_sync_read+0xc4/0x100
 [<c013ce4d>] audit_syscall_entry+0x13d/0x170
 [<c0105578>] do_syscall_trace+0x208/0x21a
 [<c0102f05>] syscall_call+0x7/0xb
SysRq : Trigger a crashdump

The log offers two useful pieces of debug information. First, it
lets you know that a soft lockup was detected on the crashed
kernel. The kernel detects this as follows: a kernel watchdog
thread runs once a second and touches a per-cpu time-stamp
variable. If the system locks up, the watchdog thread can’t
update this time-stamp. An update check is carried out during timer
interrupts using softlockup_tick() (defined
in kernel.softlockup.c). If the watchdog
time-stamp is more than 10 seconds old, it concludes that a soft
lockup has occurred and emits a kernel message.

Secondly, the log frowns on mydrv_init().
So, let’s look at the disassembly of the code region
surrounding mydrv_init+0xd:

crash> dis f893d000 5
dis: WARNING: f893d000: no associated kernel symbol found
0xf893d000:     mov    $0xf89f1208,%eax
0xf893d005:     sub    $0x8,%esp
0xf893d008:     call   0xc0342300 <_spin_lock>
0xf893d00d:     movl   $0xffffffff,0xf89f1214
0xf893d017:     movl   $0xffffffff,0xf89f1210

The return address in the stack is "c">0xf893d00d, so the kernel is hanging inside the previous
instruction, which is a call to spin_lock().
If you co-relate this with the earlier source snippet and look at
it in the eye, you can see the error sequence "c">spin_lock()/spin_lock_init() staring sorrowfully back at
you. Fix the problem by swapping the sequence.

You can also use crash to peek at data
structures of interest, but be aware that memory regions that were
swapped out during the crash are not part of the dump. In the above
example, you can, say, look at mydrv_wq as

crash> rd mydrv_wq 100
f892c200:  00000000 00000000 00000000 00000000   ................
f892c230:  2e636373 00000068 00000000 00000011   scc.h...........

The GNU DeBugger (gdb) is integrated with’ crash’,
so you can pass commands from crash to
gdb for evaluation. For example, you can use
gdb’ s p
command to print data structures.

Looking at the Sources

Architecture-dependent portions of kexec
reside in "i">arch/your-arch/kernel/machine_kexec.c and "i">arch/your-arch/kernel/relocate_kernel.S. The generic
parts live in kernel/kexec.c (and
include/linux/kexec.h). Peek inside
arch/your-arch/kernel/crash.c and
arch/your-arch/kernel/crash_dump.c for the
kdump implementation.

Sreekrishnan Venkateswaran is a development manager
at IBM India. His recent projects include putting Linux onto
pervasive and medical grade devices. You can reach Krishnan at

Comments are closed.