Expand Your Debugging Toolkit with Kernel Probes

Tried of booting a debug kernel? Kprobes can intrude into your kernel code and extract debug information or apply run time medication.

Current releases of the Linux 2.6 kernel have new features such as kprobes and jprobes to support reliability, availability, and serviceability (or RAS; see the sidebar of the same name for more information). Kprobes can intrude into a kernel function to apply a patch or extract debug information, and are useful additions to your debugging repertoire. You’ll find kprobes essential when investigating inexplicable behavior at a customer site, especially when you don’t have the option to reboot the system.

In this column, let’s learn how to use kprobes using a handful of examples. Next month, we’ll continue and look at other facets of Linux RAS, such as kexec and kdump.

Kprobes

Kprobes can save you the trouble of building and booting a debug kernel. Using kprobes, you can dynamically dump kernel data structures or insert code into a running kernel. You can for example, add a few printk() calls inside the Linux scheduler on-the-fly, without recompiling the kernel. In fact, you could even patch a bug on a Mars rover without rebooting it.

To insert a kprobe into a kernel function:

1.Turn on CONFIG_KPROBES during kernel configuration. Kprobes has recently moved from “Kernel Hacking” to “Instrumentation Support” in the kernel configuration menu.

2.Write a kernel module that registers a kprobe at the instruction of interest. You need to register a pre-handler that kprobes runs just before executing the probed instruction, and a post-handler that kprobes runs after executing the probed instruction. You can also supply a fault-handler, which runs if a fault is detected while executing the pre- or post-handlers (since you don’t want to “OOPs” because of a debugging bug).

When a kprobe is registered, it saves the probed instruction and replaces it with an instruction that generates a breakpoint (int 0×03 on x86- based systems). When the breakpoint is hit, the kernel generates a die notification (notifier chains were discussed in a previous column). Kprobes inserts itself into the die notifier chain to get notified when breakpoints are hit.

Once notified, kprobes executes the registered pre-handler. Next, it steps through a copy of the probed instruction. (It doesn’t re-swap the probed instruction with the breakpoint instruction before single-stepping it to maintain SMP consistency.) Finally, kprobes executes the post-handler. The pre-and post-handler windows are your hooks into the process. The handlers can be registered and unregistered dynamically, so serviceability is not merely static at compile time, but programmable during run time.

Here’s an example. Consider the code snippet in Listing One, a kernel thread that frees npages number of pages to the free memory pool each time a SIGUSR1 signal is delivered to it. Assume that you’re at a customer site to debug a problem reported with this code. Specifically, you notice that bad things happen whenever npages becomes greater than 10; as a repair, you want to add a run time patch to limit the number of pages freed in any single chunk to 10.

LISTING ONE: mydrv.c, the problem code to patch with kprobes

int npages=0;
EXPORT_SYMBOL (npages);

static int memwalkd (void *unused)
{
long curr_pfn = (64*1024*1024 >> PAGE_SHIFT);
struct page * curr_page;
/* … */

daemonize (“memwalkd”); /* kernel thread */

sigfillset (&current->blocked);
allow_signal (SIGUSR1);

for (;;) {
/* Dequeue a signal if it’s pending */
if (signal_pending (current)) {
sig = dequeue_signal (current, &current->blocked, &info);

PointA:
/* Free npages pages when SIGUSR1 is received */
if (sig == SIGUSR1) {

PointB:
/* Problem manifests when npages is 10 */
/* Let’s apply run time medication here via kprobes */
for (i=0; i < npages; i++, curr_pfn++) {
/* … */
}
}
/* … */
}
/* … */
}

Listing Two uses kprobes to insert a patch at kallsyms_lookup_name(“memwalkd”)+0xaa that limits npages to 10. To figure out how to arrive at this probe address, look again at Listing One. You want the patch to be inserted at Point B. To calculate the kernel address at Point B, disassemble the contents of mydrv.o using objdump. The output should resemble Figure One.

FIGURE ONE: Disassembling the problematic code in mydrv.o
 $ objdump –D mydrv.o

 mydrv.ko:     file format elf32-i386

 Disassembly of section .text:

 00000000 <memwalkd>:
 0: 55                    push   %ebp
 1: bd 00 40 00 00        mov    $0x4000,%ebp
 6: 57                    push   %edi
 7: 56                    push   %esi
 8: 53                    push   %ebx
 9: bb 00 f0 ff ff        mov    $0xfffff000,%ebx
 e: 81 ec 90 00 00 00     sub    $0x90,%esp
 …
 ; Point A
 7a: 83 f8 0a              cmp    $0xa,%eax
 7d: 74 2b                 je     aa <memwalkd+0xaa>
 7f: 83 f8 09              cmp    $0x9,%eax
 82: 75 cc                 jne    50 <memwalkd+0x50>
 …
 a9: c3                    ret    
 ;Point B
 aa: a1 00 00 00 00        mov    0x0,%eax
 af: 85 c0                 test   %eax,%eax
 b1: 0f 8e b5 00 00 00     jle    16c <memwalkd+0x16c>
 b7: 81 fd 7b f6 00 00     cmp    $0xf67b,%ebp
 …  
 fa: a1 00 00 00 00        mov    0x0,%eax

If you try and match the C code of Listing One to the disassembed code above, you can associate Point A and Point B with kernel addresses. kallsyms_lookup_name() locates the address of memwalkd(), and 0xaa is the offset of Point B. Hence, you should apply the kprobe at kallsyms_lookup_name(“memwalkd”)+0xaa.

LISTING TWO: Registering kprobe handlers

#include <linux/kernel.h>
#include <linux/module.h>
#include <linux/kprobes.h>
#include <linux/kallsyms.h>
#include <linux/sched.h>

extern int npages; /* See Listing One */

/* Per-probe structure */
static struct kprobe bandaid;

/* Pre Handler: Invoked before running probed instruction */
int bandaid_pre (struct kprobe *p, struct pt_regs *regs)
{
if (npages > 10) npages = 10;
return 0;
}

/* Post Handler: Invoked after running probed instruction */
void bandaid_post (struct kprobe *p, struct pt_regs *regs, unsigned long flags)
{
/* Nothing to do */
}

/* Fault Handler: Invoked if the pre/post handlers encounter a fault */
int bandaid_fault (struct kprobe *p, struct pt_regs *regs, int trapnr)
{
return 0;
}

int init_module (void)
{
int retval;

/* Fill the kprobe structure */
bandaid.pre_handler = bandaid_pre;
bandaid.post_handler = bandaid_post;
bandaid.fault_handler = bandaid_fault;

/* Arrive at the target address as explained */
bandaid.addr = (kprobe_opcode_t*)
kallsyms_lookup_name(“memwalkd”) + 0xaa;

if (!bandaid.addr) {
printk(“Bad Probe Point\n”);
return -1;
}

/* Register the kprobe */
if ((retval = register_kprobe (&bandaid) < 0)) {
printk(“register_kprobe error, return value=%d\n”, retval);
return -1;
}
return 0;
}

void module_cleanup (void)
{
unregister_kprobe (&bandaid);
}

MODULE_LICENSE (“GPL”); /* You can’t link the kprobes API
unless your user module is GPL’ed */

Once you register the probe, memwalkd() in Listing One is equivalent to this:

 static int memwalkd (void *unused)
 {
 /* ...*/ 
 for (;;) {
 /* ... */

 PointA:
 /* Free npages pages when SIGUSR1 is received */
 if (sig == SIGUSR1) {

 PointB: 
 if (npages > 10) npages = 10; /* The medicated patch! */

 for (i=0; i < npages; i++, curr_pfn++) {
 /* ... */
 }
 }
 /* ... */
 }
 /* ... */
 }

Whenever npages is assigned a value greater than 10, the kprobe patch resets it to 10, sidestepping the problem.

In the next two sections, let’s look at a couple of helper facilities that make it easier to use kprobes during function entry and exit.

JProbes

A jprobe is a specialized kprobe. It eases the work of adding a probe when the point to investigate is at the entry to a kernel function.

The jprobe handler assumes the same prototype as the probed function. It’s also invoked with the same argument list as the probed function, and you can easily access the passed parameters. (If you used a kprobe instead of a jprobe, imagine the hassles your probe handler would have, wading through the dark alleys of the function stack to extract function arguments! Worse, the code that delves into the stack to elicit argument values is highly function-specific, not to mention being architecture-dependent and unportable.)

To learn to use jprobes, let’s look at an example. Assume that you’re debugging a network device driver (which is built as part of the kernel) by looking at the printk() messages it’s generating. The driver is emitting crucial values in octal (base 8), but to your horror, the driver writer has introduced a typo in the print format string, coding %O instead of %o. The small error leaves you “blind,” because all you can see are messages such as ”Number of Free Receive buffers=%O”.

Jprobes to the rescue. You can fix this in a few seconds, without recompiling or rebooting the kernel. First have a peek at printk() defined in kernel/printk.c:

 asmlinkage int printk (const char *fmt, ...)
 {
 va_list args;
 int r;

 va_start(args, fmt);
 r = vprintk(fmt, args);
 va_end(args);
 return r;
 }

Let’s add a simple jprobe at the entry to printk() and transform all occurrences of %O into %o. Listing Three does the job.

LISTING THREE: Registering jprobe handlers

#include <linux/kernel.h>
#include <linux/module.h>
#include <linux/kprobes.h>
#include <linux/kallsyms.h>

/* Jprobe entrance to printk */
asmlinkage int jprintk (const char *fmt, …)
{
for (; *fmt; ++fmt) {
if (*fmt == ’O’) * (char *) fmt = ’o’;
}
jprobe_return ();
return 0;
}

/* Per-probe structure */
static struct jprobe jprobe_eg = {
.entry = (kprobe_opcode_t *) jprintk
};

int init_module (void)
{
int retval;

jprobe_eg.kp.addr = (kprobe_opcode_t*)
kallsyms_lookup_name(“printk”);

if (!jprobe_eg.kp.addr) {
printk(“Bad Probe Point\n”);
return -1;
}

/* Register the Jprobe */
if ((retval = register_jprobe (&jprobe_eg) < 0)) {
printk(“register_jprobe error, return value=%d\n”, retval);
return -1;
}
printk(“Jprobe registered.\n”);
return 0;
}

void module_cleanup (void)
{
unregister_jprobe (&jprobe_eg);
}

MODULE_LICENSE (“GPL”);

The jprobe handler needs to have the same prototype as printk(). Both functions are marked with the asmlinkage tag that asks the compiler to leave arguments in the stack, rather than in registers.

When Listing Three invokes register_jprobes() to register the jprobe, a kprobe is inserted at the beginning of printk(). When this probe is hit, kprobes replaces the saved return address with that of the registered jprobe handler, (jprintk()). It then copies a portion of the stack and returns, thus passing control to jprintk() with printk()’ s argument list. When jprintk() calls jprobe_return(), the original call state is restored and printk() continues to execute normally.

Once you insert this jprobe user module, the network driver no longer emits useless messages announcing %O buffers. Instead, it prints much more useful information, such as “Number of Free Receive buffers=12″.

Return Probes

A return probe (or a kretprobe in kprobes terminology) is another specialized kprobe helper. A return probe eases the work of inserting a kprobe when you need to probe the return point of a function. (If you use a kprobe to investigate function return points, you might need to register them at multiple places since a function can return via multiple code paths. However, if you use return probes, you need to insert only a single probe, rather than register, say 20 kprobes, to cover a function’s 20 return paths.

The function, tty_open(), defined in drivers/char/tty_io.c has seven return paths. One of them is Success, others are -ENXIO, -ENODEV, and errors returned by sub-functions. A single return probe suffices to alert you to failures regardless of the associated code path. Listing Four implements this return probe.

LISTING FOUR: Registering a return probe handler

#include <linux/kernel.h>
#include <linux/module.h>
#include <linux/kprobes.h>
#include <linux/kallsyms.h>

/* Kretprobe at exit from tty_open */
static int kret_tty_open (struct kretprobe_instance *kreti,
struct pt_regs *regs)
{
/* The EAX register contains the function return value
on x86 systems */
if ((int) regs->eax) {
/* tty_open() failed. Announce the return code */
printk(“tty_open returned %d\n”, (int)regs->eax);
}
return 0;
}

/* Per-probe structure */
static struct kretprobe kretprobe_eg = {
.handler = (kretprobe_handler_t) kret_tty_open
};

int init_module (void)
{
int retval;

kretprobe_eg.kp.addr = (kprobe_opcode_t*)
kallsyms_lookup_name(“tty_open”);

if (!kretprobe_eg.kp.addr) {
printk(“Bad Probe Point\n”);
return -1;
}

/* Register the kretprobe */
if ((retval = register_kretprobe (&kretprobe_eg) < 0)) {
printk(“register_kretprobe error, return value=%d\n”, retval);
return -1;
}

printk(“Kretprobe registered.\n”);
return 0;
}

void module_cleanup (void)
{
unregister_kretprobe (&kretprobe_eg);
}

MODULE_LICENSE (“GPL”);

When Listing Four invokes register_kretprobes(), an internal kprobe is inserted at the beginning of tty_open(). When this probe gets hit, this internal kprobe handler replaces the function return address with that of a special routine (called a trampoline in kprobes terminology). Look at arch/your-arch/kernel/kprobes.c for the implementation of the trampoline.

When tty_open() returns via any of its seven return paths, control returns to the trampoline instead of the caller function. The trampoline invokes the kretprobe handler (kret_tty_open()) registered by Listing Four, which prints the return value if tty_open() hasn’t returned successfully.

Limitations

Kprobes has its limitations. Some of them are obvious. You won’t for example, see desired results if you insert a probe inside an inline function. And of course you can’t probe kprobes code itself.

Kprobes is more useful for applying probes inside the base kernel. If the subject code is part of a dynamically loadable module, you might as well rewrite and recompile your module rather than write and compile a new module to “kprobe” it. However, you might still want to use kprobes if bringing down the module is unacceptable.

There are less obvious limitations, too. Here’s one of them: Optimizations are done at compile time, while kprobes are inserted during run time. So, the effect of inserting instructions via kprobes isn’t equivalent to adding code in the original source files. For example, the code snippet…

 volatile int * integerp = 0xFF;
 int integerd = *integerp

… is reduced by the compiler to…

 mov 0xff, %eax

You can’t easily use kprobes to sneak in between those two lines of C code, allocate a word of memory, point integerp to the allocated word, and circumvent the crash.

Looking at the Sources

The kprobes implementation consists of a generic portion defined in kernel/kprobes.c (and include/linux/kprobes.h), and an architecture-dependent part found in arch/your-arch/kernel/kprobes.c (and include/asm-your-arch/kprobes.h).

Peek inside Documentation/kprobes.txt for some examples on kprobes, jprobes, and return probes.

Sreekrishnan Venkateswaran has been working for IBM India for over ten years. His recent projects include putting Linux onto a wristwatch, a cellphone, and a pacemaker programmer. You can reach Krishnan at krishhna@gmail.com.

Fatal error: Call to undefined function aa_author_bios() in /opt/apache/dms/b2b/linux-mag.com/site/www/htdocs/wp-content/themes/linuxmag/single.php on line 62