Peering Into Linux

How to find Linux performance problems using SystemTAP.
Linux has a new way to solve both performance and functional problems: the System Tracing And Profiling framework, or SystemTAP. SystemTAP simplifies tracing of the Linux kernel, collecting data that can be analyzed separately and at a later time. With SystemTAP, there’s no need to instrument, rebuild, reinstall, and reboot kernels (often repeatedly) to collect vital data.
SystemTAP (http://sourceware.org/SystemTAP/) is based on the Kprobes technology (http://sourceforge.net/projects/dprobes) available in the Linux kernel. SystemTAP provides a simple awk- like scripting language to define kernel instrumentation called probe points. A simple command-line interface activates probe points of interest. SystemTAP lets system administrators and developers gather real time performance data that’s crucial to solving problems in production systems. With SystemTAP, You can peer into some of the operating system’s inner workings.

Static and Dynamic Tracing

A tracing facility records a sequence of system events and/or function calls. For example, you might follow function calls to watch both parameters and return codes. Tracing can have two forms: static tracing and dynamic tracing.
Static tracing predefines a set of trace points. As you might expect, static tracing isn’t always suitable to solve specific performance problems, as it records a small amount of predefined data at relatively few predefined points. Typically, static trace points are pre-built into software and can only be turned on or off. For instance, each system call might have two trace points, one after entering the API call (the pre-trace point), and one directly before leaving the API call (post-trace point). The pre-trace point provides for inspection of API parameters, while the post-trace point reveals the result (return code).
Dynamic tracing is more flexible, as it allows you to capture the right data in the right routine. Dynamic tracing adds probes where they are needed and can instrument a program without a restart. Additionally, dynamic tracing imposes no performance overhead when data collection is not needed, unlike static tracing. Dynamic tracing also lets you localize probe points to precise areas of interest.
SystemTAP uses dynamic tracing. It allows the experimenter to set and change trace points on-the-fly, and depending on what is captured in the trace, allows precise measurements of the time taken by various operating system activities.
Listing One demonstrates one of the static tracing methods available for Linux. The code is based on the IRQ processing for the x86 architecture found in the kernel source code file linux/arch/i386/kernel/irq.c file. Line 443 adds an entry called TRACE_IRQ_ENTRY() to start the tracing for the IRQ routine. The entry on line 460, called TRACE_IRQ_EXIT(), ends the tracing for the IRQ routine.
LISTING One: IRQ trace points

443 TRACE_IRQ_ENTRY(irq, !(user_mode(regs)));
445 status = 1; /* Force the “do bottom halves” bit */
455 if (!(action->flags & SA_INTERRUPT))
456 __cli();
458 irq_exit(cpu, irq);
462 return status;

As a comparison, Figure One shows the steps that a probe script goes through to dynamically add one or more trace points into the kernel. A probe script is first parsed and expanded and eventually translated into C source code. The code is then compiled, yielding a kernel module that, once loaded, places the script’s defined probes dynamically into the kernel using Kprobes. When the probes are no longer needed, the kernel module is unloaded. Then the fun starts — analyzing the data produced by the trace.
FIGURE ONE: The steps required to dynamically add trace points to the kernel

Preparing to Use SystemTAP

To use SystemTAP, you must enable certain features in your kernel and install some additional software.
SystemTAP requires Kprobes, which is enabled under the “Instrumentation Support& gt; Kprobes.” Kprobes gives SystemTAP the capability to insert probes into the kernel. SystemTAP also needs RelayFS, which can be enabled by choosing “General Setup& gt; Kernel& gt; User space relay support.” RelayFS is a file system that provides an efficient mechanism to transfer large amounts of data with very low overhead. RelayFS is one method that SystemTAP uses to capture the trace data from the kernel and move it to user space. (SystemTAP’s default mode of transportation of kernel data is the procfs.)
There are three RPMs required to use SystemTAP: kernel-debuginfo, kernel-devel, and SystemTAP.
*kernel-debuginfo provides debug symbols for a standard kernel RPM that you already have installed. The SystemTAP tools require the debug symbols.
*kernel-devel provides the kernel headers and Makefiles needed to build modules against the kernel package.
*Finally, the SystemTAP RPM includes the translator, stap, and runtime libraries.
SystemTAP is available now in Red Hat Linux, Fedora Core, and Gentoo Linux, and soon will be available in SuSE SLES 10. The example system used for this article was Fedora Core 5 on an x86 machine. kernel-debuginfo for Fedora Core 5 can be downloaded from http://download.fedora.redhat.com/pub/fedora/linux/core/5/i386/debug/.
To make installation easy, let’s use yum, as in yum install package. For example, the command the command yum install abiword automatically Abiword and all of its prerequisites and dependencies. To install kernel-debuginfo and kernel-devel, type (as root):
# yum install kernel-debuginfo
# yum install kernel-devel
The SystemTAP package can be selected during the install process (it’s located in the Development Tools package grouping.)

Adding a Probe

You don’t need a deep understanding of the Linux kernel to apply SystemTAP. There are several tapsets that have been or are being developed that abstract away the kernel implementation. Tapsets are libraries of script or C code used to extend the capability of a basic script. Tapsets make it easier for an administrator to extract information about kernel events from a running system, effectively tracing the machine’s kernel at that point in time.
The simplest kind of tapset is one that uses the ordinary script language to define new probes or auxiliary functions for invocation by an end-user script or another tapset. Let’s look focus on a couple of probes, so you can see how probes are created and how data is captured from the kernel.
Let’s add a probe to the sys_open() kernel routine. Listing Two contains the source for sys_open(), found in /usr/src/linux/fs/open.c file. Listing Three shows a probe that captures the number of open calls occurring on the system.
LISTING TWO: The Linux kernel’s sys_open() routine
01asmlinkage long sys_open(const char __user * filename, int flags, int mode)
03 char * tmp;
04 int fd, error;
06#if BITS_PER_LONG != 32
07 flags |= O_LARGEFILE;
09 tmp = getname(filename);
10 fd = PTR_ERR(tmp);
11 if (!IS_ERR(tmp)) {
12 fd = get_unused_fd();
13 if (fd >= 0) {
14 struct file *f = filp_open(tmp, flags, mode);
15 error = PTR_ERR(f);
16 if (IS_ERR(f))
17 goto out_error;
18 fd_install(fd, f);
19 }
21 putname(tmp);
22 }
23 return fd;
26 put_unused_fd(fd);
27 fd = error;
28 goto out;
The global variable count_for_sys_open holds the count. Variables may be declared global, so that they can be shared among all probes; globals stay as long as the entire SystemTAP session is active. (There is one namespace for all global variables, regardless of which probe script file creates them. Take care when creating the names of your global variables and choose unique names.)
LISTING THREE: sys_open.stp, a probe to count the number of system open calls

global count_for_sys_open

probe kernel.function(“sys_open”)

probe begin
print(“starting sys open probe”)

probe end
print(“ending sys_open probe”)
printf(ìsys_open() was called %d times\nî, count_for_sys_open)

The stap command compiles a script such as Listing Three into a loadable kernel module and loads the module. The message starting sys open probe displays when the module is loaded and indicates that data is being collected. There are different ways to stop data collection. Perhaps the simplest is to enter Control-C. When data collection stops, SystemTAP removes the defined probe and prints the number of times that sys_open() was called. For example:
# stap sys_open.stp
starting sys open probe
… Press Control-C …
ending sys_open probe
sys_open() was called 1022 times
Listing Four contains the source for the sys_read() and sys_write() kernel routines located in /usr/src/linux/fs/read_write.c. Both routines have a parameter called count, which is the requested number of bytes to read or write, depending on which of the routines is called. Let’s capture the value of count in a second probe, shown in Listing Five. Each call returns the number of bytes read or written as output.
LISTING FOUR: The source code for the Linux sys_read() and sys_write() kernel routines

asmlinkage ssize_t sys_read(unsigned int fd, char __user * buf, size_t count)
struct file *file;
ssize_t ret = -EBADF;
int fput_needed;

file = fget_light(fd, &fput_needed);
if (file) {
loff_t pos = file_pos_read(file);
ret = vfs_read(file, buf, count, &pos);
file_pos_write(file, pos);
fput_light(file, fput_needed);

return ret;

asmlinkage ssize_t sys_write(unsigned int fd, const char __user * buf, size_t count)
struct file *file;
ssize_t ret = -EBADF;
int fput_needed;

file = fget_light(fd, &fput_needed);
if (file) {
loff_t pos = file_pos_read(file);
ret = vfs_write(file, buf, count, &pos);
file_pos_write(file, pos);
fput_light(file, fput_needed);

return ret;

The probe in Listing Five is a little more involved than the previous example. Probes are placed in sys_read() and sys_write(). Global variable count_sys_read holds the number of times the sys_read() routine is called, while global count_sys_write() holds the number of times the sys_write routine() is called. On line 3, sys_read_bytes_req is holds the total number of requested bytes to read; on line 4, sys_rd_bytes captures the total number of bytes read. Line 5 creates sys_write_bytes_req, which holds the total number of requested bytes to write. On the next line, sys_wr_bytes holds the total number of bytes written.
On line 11, the parameter count, which is passed into sys_read(), is added to sys_read_bytes_req and thus keeps track of the requested total bytes to read. Line 26 performs a similar operation to store the requested total number bytes to write.
LISTING FIVE: sys_read_write.stp creates two probes, one for sys_read() and another for sys_write()
01global count_sys_read
02global count_sys_write
03global sys_read_bytes_req
04global sys_rd_bytes
05global sys_write_bytes_req
06global sys_wr_bytes
08probe kernel.function(“sys_read”)
10 ++count_sys_read;
11 sys_read_bytes_req += $count;
14probe kernel.function(“sys_read”).return
16 rnb = retval();
18 if (rnb == -1)
19 print(” +++Read not successful+++ \n”);
20 else
21 sys_rd_bytes += rnb;
24probe kernel.function(“sys_write”)
26 ++count_sys_write;
27 sys_write_bytes_req += $count;
30probe kernel.function(“sys_write”).return
32 wnb = retval();
34 if (wnb == -1)
35 print(” ***Write not successful*** \n”);
36 else
37 sys_wr_bytes += wnb;
40probe begin
42 print(“starting probe sys_read and sys_write \n”)
45probe end
47 print(“ending probe sys_read and sys_write \n”)
48 printf(“%d bytes requested to read \n”,sys_read_bytes_req)
49 printf(“Total bytes read %d \n”,sys_rd_bytes)
50 printf(“%d calls to sys_read() \n”,count_sys_read)
51 printf(“%d bytes requested to write \n”,sys_write_bytes_req)
52 printf(“Total bytes written %d \n”,sys_wr_bytes)
53 printf(“%d calls to sys_write() \n”,count_sys_write)
The results of running the probe would be similar to below.
# stap sys_read_write.stp
starting probe sys_read and sys_write
ending probe sys_read and sys_write
6696757 bytes requested to read
Total bytes read 6696757
329 calls to sys_read()
120020 bytes requested to write
Total bytes written 120020
155 calls to sys_write()
The output indicates that the number of read calls (329) is more than the number of write calls (155) going on in this system. The total number of bytes read, 6696757 is significantly more than the number of bytes written, 120020.

Finding Functions to Probe

Using stap, there’s an easy way to find all of the functions that can be probed in the kernel. To list the functions that can be probed, use the last-pass option to the translator, –p2.
# stap –p2 –e ’probe kernel.function("*") {}’ \
| sort | uniq

Convenience, Smarts on Demand

SystemTAP is a dynamic instrumentation infrastructure that enables novices and as well as experts to gather performance data to solve system problems. It’s lightweight and ideal for troubleshooting performance problems that arise during interactions between applications and the kernel. Data from kernel structures can also be included in a probe. Better yet, SystemTAP doesn’t require re-instrumentation, compilation, installation, and reboot usually required to collect data on the kernel.

Steve Best works in the IBM Linux Technology Center in Beijing, China. Steve has worked on Linux projects since 1999, doing extensive operating system development work in file systems, internationalization, and security. Steve is the author of the book Linux Debugging and Performance Tuning, and is currently working on Linux-related products. He can be contacted at class="emailaddress">sbest@us.ibm.com.

Comments are closed.