Intro to IO Profiling of Applications

One of the sorely missing aspects of storage is analyzing and understanding the IO patterns of applications. This article will examine some techniques for performing IO profiling of an application to illustrate what information you can gain.

The really cool part of using strace is that you have a great deal of information at your finger tips. For example you have the amount of data written or read and the amount of time it took to perform a syscall. You can use this information to compute the throughput of each individual write() or read() system call. You can also add the amount of data written or read to produce the total amount of data. In addition, you can count the number of read() and write() system calls in a given amount of time to compute the Read IOPS or Write IOPS. Moreover, you can count the total number of IO syscalls to get the Total IOPS.

There is a great deal you can do with this information including performing a statistical analysis of the results. You can look at the average size and standard deviation size of read() and write() syscalls. You can look at the distribution of the “sizes” in the read() and write() syscalls (i.e. the amount of data in the syscall). You can examine histograms (results as a function of time) for reading, writing, IOPS, movement of the file offset (lseek), etc. In other words – there is a huge number of possibilities you can do with the data.

Let’s walk through the strace output and examine some of the statistics for the IO sycalls. First let’s examine the first write() syscall. The elapsed time is 0.001326 seconds (1326 microseconds). The amount of data written was 397,312 bytes. So the throughput was,

397312 Bytes / 0.001326 secs. = 299,631,975.867 Bytes/s = 299.632 MB/s


Don’t forget that we are just measuring what the OS gives us from strace. We don’t have any way to measure what is going in the storage system stack using strace (to my knowledge). It is very likely that what we are seeing is the amount of time it took to write the data (0.4 MB) to the system cache or the disk cache – not the time it took to write the data to the drive. But, perhaps most important, we are seeing the amount of time the application took to write the data. Since we are interested in the IO behavior from the perspective of the application then this amount of time is very pertinent to our investigation.

We can do the same computations to the second write():

2688 Bytes / 0.000031 secs = 86,709,677.4194 Bytes/s = 86.7097 MB/s

The total elapsed time for the application is also easy to compute. You just take the difference in the time from the time of the last syscall plus it’s elapsed time (line 33) and the beginning of the fist syscall (line 1). Since the last syscall did not have any elapsed time then it’s just the time for the start of the last syscall:

 1267394331.473133 - 1267394331.467245 = 0.005888 seconds (5,888 microseconds).

We can also compute the amount of time for all IO syscalls. This gets a little more involved since we need to add the time for the open(), fstat64(), write(), and close() syscalls. Skipping the open syscalls for files that the application does not specifically open or close, results in the following:

0.000035 (line 26) + 0.000013 (line 27) + 0.001326 (line 29) + 0.000031 (line 30) + 0.000014 (line 31) = 0.001419 seconds (1419 microseconds)


We can then determine how much time was spent doing IO versus everything else:

(0.001419 / 0.005888) * 100 = 24.1%


So for this application, almost 25% of the time is spent performing IO. Since the application only does IO, except for filling up the array, 25% is quite understandable.

The last thing I will examine in this article is the IOPS. The basic definition of Write IOPS is the number of write syscalls in a given amount time. It is usually measured in number of Write IO operationss per second. The same is true for Read IOPS except for read() syscalls. And finally, the Total IOPS includes all IO syscalls.

The Write IOPS is fairly easy to compute. Since the application runs for less than 1 second, you just count the number of write() syscalls to get the Write IOPS.

Write IOPS = 2 IOPS


The Total IOPS is found in a similar manner by just counting the number of IO related operations.

Total IOPS = 5 IOPS


This application is almost as simple as you can get so these IOPS are very, very small.

Summary

Understanding the IO profile or IO behavior of your application is almost non-existent yet it is one of the most important things needed when designing a storage system. If you understand how your application(s) is performing IO then you can determine the performance requirements of your storage. How much capacity? How much throughput? How many IOPS? How much lseek capability? How much time is spent performing IO? And on and on. Empirical evidence such as “adding a disk improved throughput by 11%”, can help but it falls short in truly understanding the storage requirements of your application. Moreover, if the requirements aren’t to you liking (or your budget), then if you own the application and have the source, you can modify the application to adjust the requirements.

In this article strace is used as the mechanism to gather the IO system information from the application. Yes, running strace impacts the behavior of the application and if you dump the strace information to the same file system being used by the application, you are impacting the timing behavior. But, to be bluntly honest, there are no other tools to gather IO system information (to my knowledge). In addition, strace does not require the application to be modified to use an independent library for tracing IO syscalls nor does it require that a new library be installed to capture this information (which also impacts IO behavior). Finally, strace is easy to use and can produce some extremely useful information (as we saw).

This article is the first in a series examining the use of strace to gather a great deal of IO statistics about an application. It’s a quick introduction but I hope it shows that you can generate a great of information from simple computations. The example application was trivial but I wanted to show how you go about computing statistics and IO measurements such as throughput and IOPS.

In the next article I will discuss how to use strace for MPI (Message Passing Interface) applications commonly encountered in HPC (High Performance Computing). Future articles will also present tools for processing the strace output files since performing the math for several hundred thousand IO syscalls can get a little tedious (and time consuming).

Jeff Layton is an Enterprise Technologist for HPC at Dell. He can be found lounging around at a nearby Frys enjoying the coffee and waiting for sales (but never during working hours).

Comments on "Intro to IO Profiling of Applications"

dougalb

would you not consider \’oprofile\’ or \’systemtap\’ also effective tools for gathering IO inforamtion?

Reply
laytonjb

@dougalb,

Those are system profilers, not application profiles. I\’m after the ability to look at what the application wants/needs. I don\’t think either can do that.

I haven\’t used oprofile before on systems though. Looks interesting.

I saw some things on lwn.net that indicate systemtap is not a favorite of many kernel hackers. I think there is something new being developed.

Jeff

Reply
jsoda

the write is broken up into two calls because fwrite is a library call that executes its own logic before calling write(2). if you replace fopen/fwrite/fclose with open/write/close and update your code accordingly, i suspect you\’ll see one write.
page size writing alignments do not come into play at this level.

Reply
laytonjb

@jsoda,

Thanks for the insight! Any idea why fwrite splits the write?

Thanks!

Jeff

Reply
john.fusco

The -c option is useful for profiling as well. It doesn\’t give you elapsed time with the -T option, unfortunately. However, it does give you counts which is helpful.

I use it on running processes when I want to know what an unresponsive process is doing (if anything).

Reply
gordonmessmer

Your understanding of strace seems to be slightly flawed. It does not trace \”function calls\” in libraries. Instead it traces system calls to the kernel. If you wanted to trace library function calls you might use ltrace instead. That\’s an important distinction since there is a difference between the write() function call provided by glibc and the write() system call provided by the kernel.

The dual-write call is not a function of your compiler. Your request is passed as a single call to glibc\’s fwrite(), which then decides how to pass it to the kernel. It appears to pass one block of data which is composed of as many full filesystem blocks (gathered by the fstat() call) as possible, followed by any remaining data. Beyond that, I don\’t know why it doesn\’t pass the whole request to the kernel and let Linux figure it out.

You contend that there are no other tools for gathering this information, but that\’s not quite true. :)

As others have mentioned, Oprofile can give you a good idea of both what your application and your system are doing during its execution. You can also use tools like gprof and valgrind (with callgrind) to get \”real\” profiling information and use tools that are already available for graphing and visualizing the resulting data. Hopefully that will save you some time when writing the upcoming articles you mentioned at the end of your article.

I\’m very glad to see this written, and look forward to additional articles. I fully agree that profiling is important, but widely neglected.

Reply
laytonjb

@gordonmessmer

You are correct that my writing about strace was flawed. Believe it or not I actually understand strace but I didn\’t explain it well :) Thanks for pointing that out. I\’ve been so close to profiling for so long I just forgot how to explain it.

I didn\’t know about Oprofile until it was mentioned here. I will definitely give it a whirl to see what it can do for profiling applications.

Once again – thanks for pointing how my flawed explanation of strace. I will fix that ASAP (but it takes a while to propagate).

Thanks!

Jeff

Reply
jirihorky

Hi Jeff,
Nice article. I have written something similar with a GUI interface and ability to trace child processes too (-f switch). It also plots the access pattern diagram for each file. If you are interested, please check it out on http://code.google.com/p/ioapps/

Reply

Hmm. This article is a bit old, but I stumbled across it. Having written my own fwrite routine in 1982, I can make a good guess why glibc’s fwrite splits the write into two parts. First fwrite() is buffering writes, most likely in blocks of 4096 bytes. When you write enough data to fwrite, at some point (probably when it is more than 4096 bytes), it will pass that data directly to the system via a write() call, but it will up to the largest multiple of 4096 bytes. The rest of the data, in your case 2688 bytes, it keeps in its buffer waiting for more data. In your case, the fclose() flushes those remaining 2688 bytes to disk in a second call to the kernel.

Reply

Leave a Reply to gordonmessmer Cancel reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>