Intro to IO Profiling of Applications

One of the sorely missing aspects of storage is analyzing and understanding the IO patterns of applications. This article will examine some techniques for performing IO profiling of an application to illustrate what information you can gain.

One of the classic problems in designing storage solutions is that we don’t know what kind of IO performance applications need. We may have some empirical knowledge, “switching to SSD’s improved by wall clock time by 25%,” or, “adding an extra drive improved my application performance by 11%.” While this information is helpful to some degree what is missing is the understanding of why performance improved and by extension, what are the driving mechanisms behind the performance improvement. Perhaps equally important for the application developers is that understanding the driving forces of IO performance for your application can be used to improve the IO performance, if needed.

This article is an introduction to using a common Linux system administration tool – strace. Strace output from an application can be used to produce a huge amount of information about the IO behavior of your application.

Strace is your Friend

The goal behind IO profiling is to understand the application not necessarily to understand the storage system itself. The information is to be about the application – how does it perform writes, how does it perform reads, how much lseek happens, what kind of IOPS performance does the application require, etc. However, to obtain this information was can’t divorce ourselves completely from the storage system but we can at least gain something beyond an empirical knowledge.

The tool that we’re going to use is strace. This tool is commonly used for understanding .so libraries that are added during runtime as well as gathering other debugging information about misbehaving applications. What strace does is trace system calls and signals. Since the system is doing the IO strace can be used to trace what is happening with IO syscalls calls. Moreover, you can use strace to gather additional information about the system calls (syscall) such as how much time was spent performing the syscall. All of this information can be used to profile an application’s IO patterns.

Strace has a large number of options (just run “man strace”). Some of them are relevant to this discussion and most are not. The specific options used in this article are:

% strace -T -ttt -o strace.out 


The “-T” option cause strace to print out the elapsed time of the system call at the end of the line in the form, <0.000197>. This is amount of time used to complete the syscall. The option “-ttt” causes microsecond timing but does so in seconds since the epoch (makes life easier for applications that run a long time). The last option, “-o”, sends the output to a file – in this case “strace.out.”

Quick strace example

Let’s look at a quick example of using strace. Here is a simple C code that creates an array then dumps it to a file using the function fwrite().

#include 
#define BUFSIZE 100000
int main(int argc, int *argv[] )
{
   int i;
   float buf[BUFSIZE];
   char filename[128];
   FILE *myfile;

   for (i=0; i < BUFSIZE; i++)
      buf[i] = 2.5 + BUFSIZE * i;
   sprintf(filename, "testfile");
   myfile = fopen(filename, "w");
   fwrite(buf, sizeof(float), BUFSIZE, myfile);
   fclose(myfile);
}


The code takes an array of float variables and dumps the entire array to a file ("testfile") using the fwrite function.

The code is built using gcc 4.3.2 (Ubuntu 4.3.2-lubuntul2) with no options. The strace output using the options "-T -ttt" is the following:

1267394331.467245 execve("./test1c", ["./test1c"], [/* 37 vars */]) = 0 <0.000211>
1267394331.467773 brk(0)                = 0x951e000 <0.000012>
1267394331.467862 access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory) <0.000016>
1267394331.467973 mmap2(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb80d8000 <0.000014>
1267394331.468055 access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory) <0.000016>
1267394331.468137 open("/etc/ld.so.cache", O_RDONLY) = 3 <0.000017>
1267394331.468212 fstat64(3, {st_mode=S_IFREG|0644, st_size=121820, ...}) = 0 <0.000012>
1267394331.468349 mmap2(NULL, 121820, PROT_READ, MAP_PRIVATE, 3, 0) = 0xb80ba000 <0.000014>
1267394331.468405 close(3)              = 0 <0.000011>
1267394331.468480 access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory) <0.000016>
1267394331.468573 open("/lib/tls/i686/cmov/libc.so.6", O_RDONLY) = 3 <0.000021>
1267394331.468655 read(3, "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0\340g\1"..., 512) = 512 <0.000014>
1267394331.468750 fstat64(3, {st_mode=S_IFREG|0755, st_size=1425800, ...}) = 0 <0.000012>
1267394331.468873 mmap2(NULL, 1431152, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0xb7f5c000 <0.000014>
1267394331.468934 mmap2(0xb80b4000, 12288, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x158) = 0xb80b4000 <0.000022>
1267394331.469009 mmap2(0xb80b7000, 9840, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0xb80b7000 <0.000017>
1267394331.469078 close(3)              = 0 <0.000011>
1267394331.469152 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7f5b000 <0.000014>
1267394331.469215 set_thread_area({entry_number:-1 -> 6, base_addr:0xb7f5b6b0, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}) = 0 <0.000013>
1267394331.469382 mprotect(0xb80b4000, 8192, PROT_READ) = 0 <0.000016>
1267394331.469441 mprotect(0x8049000, 4096, PROT_READ) = 0 <0.000019>
1267394331.469506 mprotect(0xb80f5000, 4096, PROT_READ) = 0 <0.000014>
1267394331.469560 munmap(0xb80ba000, 121820) = 0 <0.000021>
1267394331.471100 brk(0)                = 0x951e000 <0.000012>
1267394331.471153 brk(0x953f000)        = 0x953f000 <0.000013>
1267394331.471224 open("testfile", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 3 <0.000035>
1267394331.471325 fstat64(3, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0 <0.000013>
1267394331.471446 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb80d7000 <0.000013>
1267394331.471509 write(3, "\0\0 @@Q\303G\240PCHP|\222HPP\303HP$\364H(|\22I(\346*I"..., 397312) = 397312 <0.001326>
1267394331.472913 write(3, "\0\25\240N\r\30\240N\33\33\240N(\36\240N5!\240NB$\240N"..., 2688) = 2688 <0.000031>
1267394331.473014 close(3)              = 0 <0.000014>
1267394331.473067 munmap(0xb80d7000, 4096) = 0 <0.000018>
1267394331.473133 exit_group(0)         = ?


Sorry about the line wraps but I think you can read the output well enough.

There are a few interesting things to note about the strace output. First, the open() syscall is used even for .so libraries (sharable object - dynamic link libraries). The lines 1 to 25 do a bunch of "setup" work to start the application, allocate memory, read and load any needed functions from any .so libraries. Finally, on line 26 we get to the point where our application starts to do some IO. Line 26 opens the file and assigns the file descriptor "unit" of 3. Then line 29 actually writes 397,312 bytes of data to the file and line 30 writes 2,688 bytes to the file. The sum of the two writes is 400,000 bytes which is what we expected given that the test system is 32 bits, we're using floats, and we have an array of 100,000 elements (4 bytes each). Finally, the application closes the file on line 31 and the application stops shortly after that line.

You can also see that for the write() syscalls, the first few bits of data that are being written inside the quotations in the strace output. For this example, the data is binary so the data doesn't much sense. But if the data is text or non-binary data, then you might be able to read the first part of the data. Strace does not list all of the data being written to the file. The very last number, after the equals sign, is the number of bytes actually written to the file descriptor. For this example, there are two write syscalls. The first one writes 397,312 bytes and the second writes 2,688 bytes. The total of the fwrite functions is 400,000 bytes (as it should be).

Another interesting observation, and one that I can't explain, is why the compiler forces the single fwrite command into two write() syscalls. Through some experimentation, if the number of float elements is set with anything up to 1,024 elements (4,096 bytes) then you will get a single write() syscall in the strace output. However, for 1,025 elements and up (at least to 1,000,000 elements), you will get two write() syscalls in the strace output. My only thoughts are that 4,096 Bytes (4KB) is the virtual memory pagesize as well as the sector size on the drive. I'm sure it has something to do with the virtual memory pagesize but I haven't been able to determine why (if you have any information please post it!).

Further Analysis of the strace Output

Comments on "Intro to IO Profiling of Applications"

dougalb

would you not consider \’oprofile\’ or \’systemtap\’ also effective tools for gathering IO inforamtion?

Reply
laytonjb

@dougalb,

Those are system profilers, not application profiles. I\’m after the ability to look at what the application wants/needs. I don\’t think either can do that.

I haven\’t used oprofile before on systems though. Looks interesting.

I saw some things on lwn.net that indicate systemtap is not a favorite of many kernel hackers. I think there is something new being developed.

Jeff

Reply
jsoda

the write is broken up into two calls because fwrite is a library call that executes its own logic before calling write(2). if you replace fopen/fwrite/fclose with open/write/close and update your code accordingly, i suspect you\’ll see one write.
page size writing alignments do not come into play at this level.

Reply
laytonjb

@jsoda,

Thanks for the insight! Any idea why fwrite splits the write?

Thanks!

Jeff

Reply
john.fusco

The -c option is useful for profiling as well. It doesn\’t give you elapsed time with the -T option, unfortunately. However, it does give you counts which is helpful.

I use it on running processes when I want to know what an unresponsive process is doing (if anything).

Reply
gordonmessmer

Your understanding of strace seems to be slightly flawed. It does not trace \”function calls\” in libraries. Instead it traces system calls to the kernel. If you wanted to trace library function calls you might use ltrace instead. That\’s an important distinction since there is a difference between the write() function call provided by glibc and the write() system call provided by the kernel.

The dual-write call is not a function of your compiler. Your request is passed as a single call to glibc\’s fwrite(), which then decides how to pass it to the kernel. It appears to pass one block of data which is composed of as many full filesystem blocks (gathered by the fstat() call) as possible, followed by any remaining data. Beyond that, I don\’t know why it doesn\’t pass the whole request to the kernel and let Linux figure it out.

You contend that there are no other tools for gathering this information, but that\’s not quite true. :)

As others have mentioned, Oprofile can give you a good idea of both what your application and your system are doing during its execution. You can also use tools like gprof and valgrind (with callgrind) to get \”real\” profiling information and use tools that are already available for graphing and visualizing the resulting data. Hopefully that will save you some time when writing the upcoming articles you mentioned at the end of your article.

I\’m very glad to see this written, and look forward to additional articles. I fully agree that profiling is important, but widely neglected.

Reply
laytonjb

@gordonmessmer

You are correct that my writing about strace was flawed. Believe it or not I actually understand strace but I didn\’t explain it well :) Thanks for pointing that out. I\’ve been so close to profiling for so long I just forgot how to explain it.

I didn\’t know about Oprofile until it was mentioned here. I will definitely give it a whirl to see what it can do for profiling applications.

Once again – thanks for pointing how my flawed explanation of strace. I will fix that ASAP (but it takes a while to propagate).

Thanks!

Jeff

Reply
jirihorky

Hi Jeff,
Nice article. I have written something similar with a GUI interface and ability to trace child processes too (-f switch). It also plots the access pattern diagram for each file. If you are interested, please check it out on http://code.google.com/p/ioapps/

Reply

Hmm. This article is a bit old, but I stumbled across it. Having written my own fwrite routine in 1982, I can make a good guess why glibc’s fwrite splits the write into two parts. First fwrite() is buffering writes, most likely in blocks of 4096 bytes. When you write enough data to fwrite, at some point (probably when it is more than 4096 bytes), it will pass that data directly to the system via a write() call, but it will up to the largest multiple of 4096 bytes. The rest of the data, in your case 2688 bytes, it keeps in its buffer waiting for more data. In your case, the fclose() flushes those remaining 2688 bytes to disk in a second call to the kernel.

Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>