IO Profiling of Applications: strace_analyzer

In the last couple of articles we have talked about using strace to help examine the IO profile of applications (including MPI applications; think HPC). But strace output can contain hundreds of thousands of lines. In this article we talk about the using a tool called strace_analyzer to help sift through the strace output.

Strace is a very useful tool for examining the IO profile of applications, as it comes standard on every Linux distro. However, as we’ll see in this article, strace can produce hundreds of thousands of lines of output. Trying to develop statistics and trends from a files of this size is virtually impossible to do by hand.

In this article, we will take a look at a tool to do a statistical analysis on strace output: strace_analyzer. This tool can take an individual strace file that has been created with the “-T -ttt” options and produce a statistical analysis of the IO portion of the strace. It also produces data files and .csv (comma delimited files for spreadsheets) files that can be used for plotting.

Using the strace_analyzer

strace_analyzer is a Perl app that parses through the strace file looking for specific IO functions. When run, it looks for the following IO functions,

  • open
  • read
  • write
  • close
  • lseek
  • llseek
  • lseek64
  • stat
  • stat64
  • fstat
  • chmod
  • access
  • rename
  • mkdir
  • getdents
  • fcntl
  • unlink
  • fseek
  • rewind
  • ftell
  • fgetpos
  • fsetpos
  • fclose
  • fsync
  • creat
  • readdir
  • rewinddir
  • scandir
  • telldir
  • flock
  • lockf

Obviously this list does not include all IO functions but it is a work in progress.

The first step in using strace_analyzer is to create an strace output file using the “-T -ttt” options. I also used the “-o” option to send the data to a file. Then you can run the strace_analyzer to process the file with the following command:

./ -dat -short < strace.out > summary.out

There are several command line options with “-dat” and “-short” being used in the previous example. You can always type “./ -help” to get a listing of the options. For this example, “-dat” produces data files for plotting tools such grace or simple python codes that use matplotlib. The “-short” option produces a shorter summary output file that doesn’t contain all the details of a particular IO function.

When you run the strace_analyzer it also produces a file called “strace_analyzer.log” that some additional information information about the run, primarily for open() syscalls (more on that later). The code also produces a subdirectory called “RESULTS_DAT” where the .dat and .csv files are contained. Depending upon the application there can be a large number of files and some of them can be fairly large. For applications that have a large number of IO functions sometimes the .csv files cannot be loaded into a spreadsheet because there are too many rows. In addition, trying to plot the data in a spreadsheet can also be impossible because many spreadsheet programs have a limit on the number of data points that can be plotted.

The output from the strace_analyzer is divided into a few main groups.

  • Time statistics (e.g. how much wall clock time was spent doing IO, etc.)
  • IO syscall Count
  • Write Statistics
  • Read Statistics
  • Close Statistics
  • Open Statistics
  • lseek activity statistics
  • IOPS Statistics
  • File Statistics (on a per file basis)

Without the “-short” option you get a great more detail in each group. With the “-short” flag you get more of a summary.

Rather than walk through the strace_analyzer step by step and open myself to ridicule of my coding style, my choice of languages, my lack of using regular expressions for parsing, and other assorted programming criticisms, let’s apply the strace_analyzer to an example so we can see how the tool is used and what we can possibly learn by using it. By the way – I’m open to patches and criticism, but they should be constructive. if you don’t like my coding style that’s fine, but help me understand why a different coding style is better. Otherwise, talk to the hand, because the ears aren’t listening (and the eyes aren’t reading and the fingers are moving the file to the trash).

Examining Abaqus/Explicit

The application I will be examining is called Abaqus/Explicit and is developed, supported, and marketed by Simulia. It is what is called a finite element program that is used for examining dynamic events. For example, it can be used to examine how a container of laundry detergent behaves when it’s dropped from a grocery shelf onto the floor, or what happens to a cell phone when it’s dropped onto a floor, or when a bullet penetrates the wing of an aircraft, or, an even more fun analysis, how a diaper behaves when it is “loaded” to capacity. Abaqus/Explicit is one of the leading applications for dynamic analysis and it a common HPC application.

For this example, the e5 data set will be used. e5 is a simple benchmark where a stiffened steel plate is subjected to a high intensity blast load. The application with the e5 benchmark was run under the auspices of the HPC Advisory Council. It was run on the Osiris cluster that the HPC Advisory Council uses. It is a cluster of Dell SC 1435 servers with two Quad-Core AMD OpteronTM 2382 SE processors and a Mellanox ConnectX DDR InfiniBand fabric. The storage for the problem is fairly simple – the master node of the cluster had a simple SCSI hard drive that was exported via NFS over a GigE network to all of the compute nodes.

Abaqus/Explicit uses MPI so the application can run in parallel on multiple processors. For this example, it was run on two nodes with 8-cores each for a total of 16 processes. Each MPI process created a simple strace output file. strace_analyzer was then run on each process. The tables and plots below are from the output from strace_analyzer.

As mentioned previously, the first output group from the strace output is a timing summary. Table 1 below contains the Total Run Time in seconds of each of the 16 processes as well as the total time in seconds spent performing IO using the list of IO functions previously mentioned. The last column of data is the percentage of time the application spent doing IO relative to the total run time.

Table 1 – Time Statistics for the 16 Processes

Process Total Run
Time (secs)
IO Times
% Time for
12419 424.719 0.7889 0.185%
12420 425.292 0.0991 0.023%
12421 425.373 0.1222 0.028%
12422 425.433 0.1155 0.027%
12423 424.517 0.1023 0.024%
12424 425.291 0.1250 0.029%
12425 425.331 0.1293 0.030%
12427 425.291 0.1355 0.532%
14297 418.912 2.2275 0.532%
14298 418.827 2.3856 0.570%
14299 425.785 2.6418 0.620%
14300 425.112 2.3226 0.546%
14301 418.706 2.1369 0.510%
14302 424.769 2.1317 0.502%
14303 419.200 2.0271 0.486%
14304 418.868 1.8862 0.450%

Notice that the amount of time spent doing IO is very small compared to the total wall clock time. The large amount of time is 2.65 seconds or 0.62% of the total run time. It’s probably safe to say that this particular application is not IO bound.

The next output group is an IO syscall count. The output is a list of the IO commands the application is capable of recognizing and how many times these syscalls were used during the run of the application. Table 2 below contains the IO syscalls for all 16 processes.

Table 2 – IO syscall Count

Process access lseek fcntl stat unlink open close fstat read mkdir getdents write
12419 14 19,689 31 136 1 287 297 148 6,349 69 8 74,328
12420 5 3,150 5 78 0 241 243 103 1,584 0 8 1,870
12421 5 3,159 5 78 0 241 243 104 1,581 0 8 1,882
12422 5 3,155 5 78 0 241 243 104 1,579 0 8 1,880
12423 5 3,159 5 78 0 241 243 104 1,581 0 8 1,882
12424 5 3,149 5 78 0 241 243 104 1,577 0 8 1,876
12425 5 3,147 5 78 0 241 243 104 1,581 0 8 1,870
12427 5 3,149 5 78 0 241 243 104 1,583 0 8 1,870
14297 5 3,149 5 78 0 246 248 104 1,588 0 8 1,870
14298 5 3,147 5 78 0 246 248 104 1,586 0 8 1,870
14299 5 3,517 5 78 0 246 248 104 1,586 0 8 1,880
14300 5 3,160 5 78 0 246 248 104 1,587 0 8 1,882
14301 5 3,156 5 78 0 246 248 104 1,585 0 8 1,880
14302 5 3,163 5 78 0 246 248 104 1,588 0 8 1,884
14303 5 3,149 5 78 0 246 248 104 1,588 0 8 1,870
14304 5 3,147 5 78 0 246 248 104 1,586 0 8 1,870

This table gives us some interesting information. For example, the first process (12149) has 6 times more lseek() syscalls than any other process. It also has about 4 times the read() syscalls than other process. Perhaps more importantly, it has 40 times the number of write() syscalls than any other process.

In the world of MPI and parallel programs there are many ways to perform IO. One of the most classic methods is to have the first MPI process, called the rank-0 process since it is the first MPI process, do all of the reading and writing for the application. The other MPI processes receive/send their data from/to the rank-0. This approach ensures that the other MPI processes don’t overwrite the data from any other process.

From the results in Table 2 it appears that the first process (12419) is the rank-0 MPI process and it is doing the vast majority of the IO for the application. There is other output from strace_analyzer to help support this supposition as we’ll see later. But at this point it’s pretty obvious that process 12419 is the rank-0 process and is doing the vast majority of the reading and writing for the application.

If indeed process 12419 is the rank-0 process then why are the other processes doing so many read() and write() system calls? We’ll see more of that later and discover that read() and write() are very common system calls and are used in a variety of ways.

Another interesting observation is that for every 3.77 write syscalls there is an lseek() syscall. An lseek moves the file pointer within the file to a new location. Depending upon how far the file pointer is repositioned this means that any data stored in the drive cache or the VFS cache, or in the IO scheduler cache or any other cache is no longer useful. Basically it hurts throughput (sometimes even kills it).

The next output group contains the write syscall statistics. Below is a sample of the summary for process 12419:

-- Write Statistics --

-- File sizes for write() syscall --

IO Size Range         		     Number of syscalls
(  1)    0KB <   <    1KB		          12999
(  2)    1KB <   <    8KB		          61049
(  3)    8KB <   <   32KB		            268
(  4)   32KB <   <  128KB		              0
(  5)  128KB <   <  256KB		              0
(  6)  256KB <   <  512KB		              0
(  7)  512KB <   < 1000KB		              0
(  8) 1000KB <   <   10MB		              0
(  9)   10MB <   <  100MB		              0
( 10)  100MB <   <    1GB		              0
( 11)    1GB <   <   10GB		              0
( 12)   10GB <   <  100GB		              0
( 13)  100GB <   <    1TB		              0
( 14)    1TB <   <   10TB		              0 

Total number of Bytes written = 134,122,801 (134.122801  MB)
Number of Write syscalls =   74,328 

Average (mean) Bytes per syscall = 1,804.76345605253 (bytes) (0.00180476345605253  MB)
   Standard Deviation: 1,137.06808888182 (bytes) (0.00113706808888182  MB)
   Mean Absolute Deviation: 1,016.65380745704 (bytes) (0.00101665380745704  MB)
Median Bytes per syscall =         2,048 (bytes) (0.002048 MB)
   Median Absolute Deviation: 473.416424457721 (bytes) (0.000473416424457721  MB) 

Time for slowest write syscall (secs) = 0.050185
   Line location in file: 749638 

Smallest write syscall size: 1 (Bytes)
Largest write syscall size: 14336 (Bytes)

Comments on "IO Profiling of Applications: strace_analyzer"


I really enjoyed that read: what a useful tool. I will definitely be using this from now on. Many thanks!



very nice tool! I have recently written very similar tool to do IO profiling based on strace. It is a GUI application with a core written in C (for speed) and the GUI in Python. The IOprofiler has integrated histogram and access pattern plotting. However it doesn’t (yet) provide all the statistics this tool can.

Check it out here, if you are interested:

It can also replay all the traces to become a standalone benchmark with the same IO worload pattern as your original application.

Jeff, I am sorry to not continue on your code, but at least the ioreplay part must have been written in C (and thus all file descriptors handling as well). I’ve also integrated some more logic to file descriptor-to-file name mappings.

Jiri Horky


Dear Jeffrey

Can I get a copy of python version, please?



> Versions and Source for Python
> Coming soon – if you need it now, please contact me.


thank you for share!


Leave a Reply

Your email address will not be published. Required fields are marked *


You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>