Profiling and Tracing

Profiling helps find code bottlenecks. Trace tools provide the ability to extract and interpret details about code execution.
Last month’s column showed you how to use debuggers to debug kernel code. This month, let’s look a bunch of other tools you should have in your Linux kernel debugging repertoire.

Kernel Profiling using OProfile

A profiler identifies those portions of code that burn more CPU cycles. Profilers come in different flavors. The OProfile kernel profiler uses hardware assist to gather profile data, while the gprof application profiler relies on compiler assist to collect profile information.
OProfile is included in the 2.6 kernel. (If you’re using a 2.4 kernel, you will need to patch your sources with OProfile support. Visit http://oprofile.sourceforge.net to download the patches.) Profiling is done by sampling data at regular intervals using hardware performance counters supported by many processors. The performance counters can be programmed to count an event, such as the number of cache misses. On some laptops where the processor does not support performance counters, OProfile can get limited information by collecting data during timer events.
OProfile consists of:
*A kernel layer that collects profile information. To enable OProfile in your kernel, set CONFIG_PROFILING, CONFIG_OPROFILE, and CONFIG_APIC to y and recompile.
*The oprofiled daemon and a suite of tools that perform detailed analysis, including opcontrol, opreport, and op_help. (These tools are included in newer distributions like Fedora, but if your distribution doesn’t have them, you can download pre-compiled binaries.)
To explore the basics of kernel profiling, let’s simulate a bottleneck in the file system layer and use OProfile to detect it. The code of interest is the part of the file system layer that reads directories (function vfs_readdir() in fs/readdir.c).
First, use opcontrol to configure OProfile:
$ opcontrol ––setup \
––vmlinux=/path/to/kernelsources/vmlinux \
The ––event option directs OProfile to collect samples during GLOBAL_POWER_EVENTS (time during which the processor is not stopped). The numerals denote the sampling count in clock cycles, a unit mask filter, kernel space counting, and use space counting, respectively. If you’d like to sample N times every second, your sample count should approximately be cpu_speed/N. A larger count generates a finer profile, but also results in more CPU overhead. The events supported by OProfile, depend on the processor that is used.
$ opcontrol –l  
GLOBAL_POWER_EVENTS: (counter: 0, 4)
time during which processor is not stopped (min count: 3000)
BRANCH_RETIRED: (counter: 3, 7)
retired branches (min count: 3000)
retired mispredicted branches (min count: 3000)
BSQ_CACHE_REFERENCE: (counter: 0, 4)

After configuring OProfile, run a benchmarking tool or a utility that stresses the part of the kernel that you’d like to profile. (The site http://lbs.sourceforge.net maintains a list of benchmarking projects on Linux.) For example, to exercise the Virtual File System (VFS) layer, recursively list all of the files in the system.
$ opcontrol ––start
$ ls –lR /
$ opcontrol ––dump
opcontrol ––start starts data collection; opcontrol ––dump saves the data.
Use opreport to look at the profiling results.
$ opreport –l /path/to/kernelsources/vmlinux

CPU: P4 / Xeon, speed 2992.9 MHz (estimated)
Counted GLOBAL_POWER_EVENTS events (time during which processor is not
stopped) with a unit mask of 0x01 (count cycles when processor is active)
count 100000
samples % symbol name
914506 24.2423 vgacon_scroll
406619 10.7789 do_con_write
273023 7.2375 vgacon_cursor
206611 5.4770 __d_lookup

1380 0.0366 vfs_readdir

1 2.7e-05 vma_prio_tree_next
The % column gives a measure of the function’s load on the system. Time spent in vgacon_scroll can be ignored, since it’s the time consumed to print output to the console. vfs_readdir() is still inexpensive.
Given that baseline, let’s now introduce a bottleneck in the VFS code by introducing a one-millisecond delay in vfs_readdir(), as shown in Listing One.
LISTING ONE: Simulating a” bootleneck” in vfs_readdir()

int vfs_readdir(struct file *file, filldir_t filler, void *buf)
struct inode *inode = file->f_dentry->d_inode;
int res = –ENOTDIR;

/* Introduce a millisecond bottleneck (HZ=1000) */
unsigned long timeout = jiffies+1;
while (time_before(jiffies, timeout));
/* End of bottleneck */

/* … */

Compile the kernel with this change and recollect the profile. The new data will look like Figure One.
FIGURE ONE: Viewing the impact of a bottleneck
$ opreport –l /path/to/kernelsources/vmlinux

CPU: P4 / Xeon, speed 2993.08 MHz (estimated)
Counted GLOBAL_POWER_EVENTS events (time during which processor is not
stopped) with a unit mask of 0x01 (count cycles when processor is active)
count 100000
samples % symbol name
6178015 57.1640 vfs_readdir
1065197 9.8561 vgacon_scroll
479801 4.4395 do_con_write

As you can see, the bottleneck is clearly reflected in the profiled data. vfs_readdir() has now jumped to the top of the list!
You can use OProfile to get a whole lot of more information. You could, for example, gather the percentage of data cache line misses.
(Caches are fast memory closest to the processor. Fetches to the cache are done in units of the processor cache line (32 bytes for Pentium 4). If the data that you need is not already present in the cache (a cache miss), the processor must fetch it from main memory, burning more CPU cycles. However, subsequent accesses to that memory (and the surrounding bytes touched into the cache line) remain faster until the corresponding cache line gets invalidated.)
You can configure OProfile to count the number of cache misses by profiling your kernel code for the event BSQ_CACHE_REFERENCE (for Pentium 4). You can then tune your code (possibly by realigning fields in data structures) to achieve better cache utilization.
On a Pentium 4 machine, a 2.6.9 kernel yielded the L2 cache miss data shown in Figure Two. The unit mask 0×100 x100 denotes an L2 cache miss.
FIGURE TWO: L2 cache misses in the 2.6.9 kernel
$ opcontrol ––setup ––event=BSQ_CACHE_REFERENCE:50000:0x100:1:1
$ opcontrol ––start
$ ls –lR /
$ opcontrol ––dump
$ opreport –l /path/to/kernelsources/vmlinux

CPU: P4 / Xeon, speed 2993.68 MHz (estimated)
Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus unit)
with a unit mask of 0x100 (read 2nd level cache miss) count 50000
samples % symbol name
73 29.6748 find_inode_fast
59 23.9837 __d_lookup
27 10.9756 inode_init_once
If you run OProfile on different kernel versions and look at the corresponding change logs for each kernel, you might be able to figure out reasons for code changes in different parts of the kernel.

Profiling Applications

If you only need to profile an application process without profiling the kernel code that might get executed on its behalf, you can use gprof instead of OProfile. gprof relies on additional code generated by the compiler to profile C, Pascal, or Fortran programs.
Let’s use gprof to profile the code snippet in Listing Two.
LISTING TWO: Profiling a small program using gprof

main ()
int i;

for (i=0; i<10; i++) {
if (!do_task ()) { /* Perform task */
do_error_handling (); /* Handle errors */

Compile the code with the –pg option to have the compiler include extra code that generates a call graph profile when the program runs. The –g option generates symbolic information.
$ gcc –pg –g –o myprog myprog.c 
$ ./myprog
Running ./myprog produces gmon.out, a call graph of myprog. Run
gprof to view the profile, shown in Figure Three.
FIGURE THREE: Sample results from gprof
$ gprof –p –b myprog  

Flat profile:

Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
65.17 2.75 2.75 2 1.38 1.38 do_error_handling
34.83 4.22 1.47 10 0.15 0.15 do_task
This shows that the error path was followed twice during execution. You can tune the code to further reduce or eliminate fewer traversals of the error path and then re-run gprof to generate new profiling data.

The Linux Trace Toolkit

Tracing gives insight into behavioral problems that manifest during interaction between different code modules.
A common way to obtain execution traces is by using printk s. While printk is perhaps the most heavily-used method for kernel debugging (there are around 60,000 printk statements in the 2.6.11 source tree), it’s not sophisticated enough for high volume tracing.
The Linux Trace Toolkit (http://www.opersys.com/LTT/) is a powerful tool that lets you obtain complex system level traces with minimum overhead. LTT is not yet part of the mainstream kernel, but was recently included as a release candidate in a kernel patch by Andrew Morton, one of the maintainers of the kernel tree (see the 2.6.11-rc1-mm1 patch, downloadable from ftp://ftp.kernel.org/). Unlike OProfile, which collects data by sampling events at regular intervals, LTT gives exact traces of events when they occur.
LTT consists of four components:
*The core kernel trace module that provides trace services to the rest of the kernel. The collected traces are copied to a kernel buffer.
*Code that makes use of the trace services. These are inserted at points where you want trace dumps.
*A trace daemon that pulls the trace information from the kernel buffer to a permanent location in the file system.
*And utilities like tracereader and tracevisualizer that interpret raw trace data into a human readable form. If you are developing code for an embedded device that has no GUI support, you can transparently run these tools on another machine.
Let’s learn to use LTT with the help of a simple example. Assume that you’re seeing data corruption when your application reads from a device. You would first want to figure out whether the device itself is sending bad data or whether a kernel layer is introducing the corruption. To do that, let’s dump data packets and data structures at the device driver level. Listing Three initializes the LTT events that are to be generated.
LISTING THREE: Creating LTT Events

#include <linux/trace.h>

int data_packet, driver_data; /* Trace Events */

/* Driver init entry point */
static int __init mydriver_init(void)
/* … */

/* Event to dump packets received from the device */
data_packet = trace_create_event(“data_pkt”,

/* Event to dump a driver structure */
driver_data = trace_creat58e_event(“dvr_data”,

/* … */

Next, let’s add trace hooks to dump received packets and data structures when the driver reads data from the device. This is done in Listing Four in the device driver read() entry point.
LISTING FOUR: Getting Trace Dumps

struct mydriver_data driver_data; /* Private device structure */

/* Driver read entry point */
mydriver_read(struct file *file, char *buf,
size_t count, loff_t *ppos)
char * buffer;

/* Read numbytes bytes of data from the device into buffer */

/* … */

/* Dump Data Packet. If you see the problem only
under certain conditions, say, when the packet length is
greater than a value, use that as a filter
if (condition) {
/* See Listing Three for the definition of data_packet*/
trace_raw_event(data_packet, numbytes, buffer);

/* Dump driver data structures */
if (some_other_condition) {
/* See Listing Three for the definition of driver_data */
trace_raw_event(driver_data, sizeof(driver_data), &driver_data);

/* … */

Compile and run this code as part of the kernel or as a module. Remember to turn on trace support in the kernel by setting CONFIG_TRACE while configuring the kernel.
The next step is to start the trace daemon.
$ tracedaemon –ts60 /dev/tracer mylog.txt mylog.proc
/dev/tracer is the interface used by the trace daemon to access the trace buffer; –ts60 asks the daemon to run for 60 seconds; mylog.txt is the file where you want to store the generated raw trace; and mylog.proc is where you want to save the system state obtained from /proc. Current versions of LTT use a mechanism called relayfs instead of the /dev/tracer device for efficiently transferring data from the kernel trace buffer to user space.
Now run the application that reads the (corrupt) data from the device.
$ ./application
mylog.txt should now be populated with the requested traces. The generated raw trace can be analyzed using the tracevisualizer tool. Choose the Custom Events option and search for data_pkt and dvr_data events. The output will look like Figure Four.
FIGURE FOUR: The output of tracevisualer
Event Time SECS MICROSEC PID Length Description
data_pkt 1,110,563,008,742,457 0 27 12 43 AB AC 00 01 0D 56
data_pkt 1,110,563,008,743,151 0 27 01 D4 73 F1 0A CB DD 06
dvr_data 1,110,563,008,743,684 0 25 0D EF 97 1A 3D 4C
The traced data is seen in the last column. The time stamp shows the instant when the trace was collected. If the data looks corrupt, the device itself could be sending bad data. Otherwise, the problem must be in a higher kernel layer and can be further isolated by obtaining traces from a different point in the data flow path.
If you want to perform limited tracing of a single Linux application, you can use the strace utility instead of LTT. strace uses the ptrace support in the kernel to intercept system calls. It prints out a list of system calls made by your application, along with the corresponding arguments and return values.

Sreekrishnan Venkateswaran has been working for IBM India since 1996. His recent Linux projects include putting Linux onto a wristwatch, a PDA, and a pacemaker programmer. You can reach Krishnan at class="emailaddress">krishhna@gmail.com.

Comments are closed.