In the last couple of articles we have talked about using strace to help examine the IO profile of applications (including MPI applications; think HPC). But strace output can contain hundreds of thousands of lines. In this article we talk about the using a tool called strace_analyzer to help sift through the strace output.
Strace is a very useful tool for examining the IO profile of applications, as it comes standard on every Linux distro. However, as we’ll see in this article, strace can produce hundreds of thousands of lines of output. Trying to develop statistics and trends from a files of this size is virtually impossible to do by hand.
In this article, we will take a look at a tool to do a statistical analysis on strace output: strace_analyzer. This tool can take an individual strace file that has been created with the “-T -ttt” options and produce a statistical analysis of the IO portion of the strace. It also produces data files and .csv (comma delimited files for spreadsheets) files that can be used for plotting.
Using the strace_analyzer
strace_analyzer is a Perl app that parses through the strace file looking for specific IO functions. When run, it looks for the following IO functions,
- open
- read
- write
- close
- lseek
- llseek
- lseek64
- stat
- stat64
- fstat
- chmod
- access
- rename
- mkdir
- getdents
- fcntl
- unlink
- fseek
- rewind
- ftell
- fgetpos
- fsetpos
- fclose
- fsync
- creat
- readdir
- rewinddir
- scandir
- telldir
- flock
- lockf
Obviously this list does not include all IO functions but it is a work in progress.
The first step in using strace_analyzer is to create an strace output file using the “-T -ttt” options. I also used the “-o” option to send the data to a file. Then you can run the strace_analyzer to process the file with the following command:
./strace_analyzer_ng_0.08.pl -dat -short < strace.out > summary.out
There are several command line options with “-dat” and “-short” being used in the previous example. You can always type “./strace_analyzer_ng_0.08.pl -help” to get a listing of the options. For this example, “-dat” produces data files for plotting tools such grace or simple python codes that use matplotlib. The “-short” option produces a shorter summary output file that doesn’t contain all the details of a particular IO function.
When you run the strace_analyzer it also produces a file called “strace_analyzer.log” that some additional information information about the run, primarily for open() syscalls (more on that later). The code also produces a subdirectory called “RESULTS_DAT” where the .dat and .csv files are contained. Depending upon the application there can be a large number of files and some of them can be fairly large. For applications that have a large number of IO functions sometimes the .csv files cannot be loaded into a spreadsheet because there are too many rows. In addition, trying to plot the data in a spreadsheet can also be impossible because many spreadsheet programs have a limit on the number of data points that can be plotted.
The output from the strace_analyzer is divided into a few main groups.
- Time statistics (e.g. how much wall clock time was spent doing IO, etc.)
- IO syscall Count
- Write Statistics
- Read Statistics
- Close Statistics
- Open Statistics
- lseek activity statistics
- IOPS Statistics
- File Statistics (on a per file basis)
Without the “-short” option you get a great more detail in each group. With the “-short” flag you get more of a summary.
Rather than walk through the strace_analyzer step by step and open myself to ridicule of my coding style, my choice of languages, my lack of using regular expressions for parsing, and other assorted programming criticisms, let’s apply the strace_analyzer to an example so we can see how the tool is used and what we can possibly learn by using it. By the way – I’m open to patches and criticism, but they should be constructive. if you don’t like my coding style that’s fine, but help me understand why a different coding style is better. Otherwise, talk to the hand, because the ears aren’t listening (and the eyes aren’t reading and the fingers are moving the file to the trash).
Examining Abaqus/Explicit
The application I will be examining is called Abaqus/Explicit and is developed, supported, and marketed by Simulia. It is what is called a finite element program that is used for examining dynamic events. For example, it can be used to examine how a container of laundry detergent behaves when it’s dropped from a grocery shelf onto the floor, or what happens to a cell phone when it’s dropped onto a floor, or when a bullet penetrates the wing of an aircraft, or, an even more fun analysis, how a diaper behaves when it is “loaded” to capacity. Abaqus/Explicit is one of the leading applications for dynamic analysis and it a common HPC application.
For this example, the e5 data set will be used. e5 is a simple benchmark where a stiffened steel plate is subjected to a high intensity blast load. The application with the e5 benchmark was run under the auspices of the HPC Advisory Council. It was run on the Osiris cluster that the HPC Advisory Council uses. It is a cluster of Dell SC 1435 servers with two Quad-Core AMD OpteronTM 2382 SE processors and a Mellanox ConnectX DDR InfiniBand fabric. The storage for the problem is fairly simple – the master node of the cluster had a simple SCSI hard drive that was exported via NFS over a GigE network to all of the compute nodes.
Abaqus/Explicit uses MPI so the application can run in parallel on multiple processors. For this example, it was run on two nodes with 8-cores each for a total of 16 processes. Each MPI process created a simple strace output file. strace_analyzer was then run on each process. The tables and plots below are from the output from strace_analyzer.
As mentioned previously, the first output group from the strace output is a timing summary. Table 1 below contains the Total Run Time in seconds of each of the 16 processes as well as the total time in seconds spent performing IO using the list of IO functions previously mentioned. The last column of data is the percentage of time the application spent doing IO relative to the total run time.
Table 1 – Time Statistics for the 16 Processes
| Process |
Total Run Time (secs) |
IO Times (secs) |
% Time for IO |
| 12419 |
424.719 |
0.7889 |
0.185% |
| 12420 |
425.292 |
0.0991 |
0.023% |
| 12421 |
425.373 |
0.1222 |
0.028% |
| 12422 |
425.433 |
0.1155 |
0.027% |
| 12423 |
424.517 |
0.1023 |
0.024% |
| 12424 |
425.291 |
0.1250 |
0.029% |
| 12425 |
425.331 |
0.1293 |
0.030% |
| 12427 |
425.291 |
0.1355 |
0.532% |
| 14297 |
418.912 |
2.2275 |
0.532% |
| 14298 |
418.827 |
2.3856 |
0.570% |
| 14299 |
425.785 |
2.6418 |
0.620% |
| 14300 |
425.112 |
2.3226 |
0.546% |
| 14301 |
418.706 |
2.1369 |
0.510% |
| 14302 |
424.769 |
2.1317 |
0.502% |
| 14303 |
419.200 |
2.0271 |
0.486% |
| 14304 |
418.868 |
1.8862 |
0.450% |
Notice that the amount of time spent doing IO is very small compared to the total wall clock time. The large amount of time is 2.65 seconds or 0.62% of the total run time. It’s probably safe to say that this particular application is not IO bound.
The next output group is an IO syscall count. The output is a list of the IO commands the application is capable of recognizing and how many times these syscalls were used during the run of the application. Table 2 below contains the IO syscalls for all 16 processes.
Table 2 – IO syscall Count
| Process |
access |
lseek |
fcntl |
stat |
unlink |
open |
close |
fstat |
read |
mkdir |
getdents |
write |
| 12419 |
14 |
19,689 |
31 |
136 |
1 |
287 |
297 |
148 |
6,349 |
69 |
8 |
74,328 |
| 12420 |
5 |
3,150 |
5 |
78 |
0 |
241 |
243 |
103 |
1,584 |
0 |
8 |
1,870 |
| 12421 |
5 |
3,159 |
5 |
78 |
0 |
241 |
243 |
104 |
1,581 |
0 |
8 |
1,882 |
| 12422 |
5 |
3,155 |
5 |
78 |
0 |
241 |
243 |
104 |
1,579 |
0 |
8 |
1,880 |
| 12423 |
5 |
3,159 |
5 |
78 |
0 |
241 |
243 |
104 |
1,581 |
0 |
8 |
1,882 |
| 12424 |
5 |
3,149 |
5 |
78 |
0 |
241 |
243 |
104 |
1,577 |
0 |
8 |
1,876 |
| 12425 |
5 |
3,147 |
5 |
78 |
0 |
241 |
243 |
104 |
1,581 |
0 |
8 |
1,870 |
| 12427 |
5 |
3,149 |
5 |
78 |
0 |
241 |
243 |
104 |
1,583 |
0 |
8 |
1,870 |
| 14297 |
5 |
3,149 |
5 |
78 |
0 |
246 |
248 |
104 |
1,588 |
0 |
8 |
1,870 |
| 14298 |
5 |
3,147 |
5 |
78 |
0 |
246 |
248 |
104 |
1,586 |
0 |
8 |
1,870 |
| 14299 |
5 |
3,517 |
5 |
78 |
0 |
246 |
248 |
104 |
1,586 |
0 |
8 |
1,880 |
| 14300 |
5 |
3,160 |
5 |
78 |
0 |
246 |
248 |
104 |
1,587 |
0 |
8 |
1,882 |
| 14301 |
5 |
3,156 |
5 |
78 |
0 |
246 |
248 |
104 |
1,585 |
0 |
8 |
1,880 |
| 14302 |
5 |
3,163 |
5 |
78 |
0 |
246 |
248 |
104 |
1,588 |
0 |
8 |
1,884 |
| 14303 |
5 |
3,149 |
5 |
78 |
0 |
246 |
248 |
104 |
1,588 |
0 |
8 |
1,870 |
| 14304 |
5 |
3,147 |
5 |
78 |
0 |
246 |
248 |
104 |
1,586 |
0 |
8 |
1,870 |
This table gives us some interesting information. For example, the first process (12149) has 6 times more lseek() syscalls than any other process. It also has about 4 times the read() syscalls than other process. Perhaps more importantly, it has 40 times the number of write() syscalls than any other process.
In the world of MPI and parallel programs there are many ways to perform IO. One of the most classic methods is to have the first MPI process, called the rank-0 process since it is the first MPI process, do all of the reading and writing for the application. The other MPI processes receive/send their data from/to the rank-0. This approach ensures that the other MPI processes don’t overwrite the data from any other process.
From the results in Table 2 it appears that the first process (12419) is the rank-0 MPI process and it is doing the vast majority of the IO for the application. There is other output from strace_analyzer to help support this supposition as we’ll see later. But at this point it’s pretty obvious that process 12419 is the rank-0 process and is doing the vast majority of the reading and writing for the application.
If indeed process 12419 is the rank-0 process then why are the other processes doing so many read() and write() system calls? We’ll see more of that later and discover that read() and write() are very common system calls and are used in a variety of ways.
Another interesting observation is that for every 3.77 write syscalls there is an lseek() syscall. An lseek moves the file pointer within the file to a new location. Depending upon how far the file pointer is repositioned this means that any data stored in the drive cache or the VFS cache, or in the IO scheduler cache or any other cache is no longer useful. Basically it hurts throughput (sometimes even kills it).
The next output group contains the write syscall statistics. Below is a sample of the summary for process 12419:
----------------------
-- Write Statistics --
----------------------
-- File sizes for write() syscall --
IO Size Range Number of syscalls
=======================================================
( 1) 0KB < < 1KB 12999
( 2) 1KB < < 8KB 61049
( 3) 8KB < < 32KB 268
( 4) 32KB < < 128KB 0
( 5) 128KB < < 256KB 0
( 6) 256KB < < 512KB 0
( 7) 512KB < < 1000KB 0
( 8) 1000KB < < 10MB 0
( 9) 10MB < < 100MB 0
( 10) 100MB < < 1GB 0
( 11) 1GB < < 10GB 0
( 12) 10GB < < 100GB 0
( 13) 100GB < < 1TB 0
( 14) 1TB < < 10TB 0
-- WRITE SUMMARY --
Total number of Bytes written = 134,122,801 (134.122801 MB)
Number of Write syscalls = 74,328
Average (mean) Bytes per syscall = 1,804.76345605253 (bytes) (0.00180476345605253 MB)
Standard Deviation: 1,137.06808888182 (bytes) (0.00113706808888182 MB)
Mean Absolute Deviation: 1,016.65380745704 (bytes) (0.00101665380745704 MB)
Median Bytes per syscall = 2,048 (bytes) (0.002048 MB)
Median Absolute Deviation: 473.416424457721 (bytes) (0.000473416424457721 MB)
Time for slowest write syscall (secs) = 0.050185
Line location in file: 749638
Smallest write syscall size: 1 (Bytes)
Largest write syscall size: 14336 (Bytes)
Comments on "IO Profiling of Applications: strace_analyzer"
I really enjoyed that read: what a useful tool. I will definitely be using this from now on. Many thanks!
Hi,
very nice tool! I have recently written very similar tool to do IO profiling based on strace. It is a GUI application with a core written in C (for speed) and the GUI in Python. The IOprofiler has integrated histogram and access pattern plotting. However it doesn’t (yet) provide all the statistics this tool can.
Check it out here, if you are interested: http://code.google.com/p/ioapps/
It can also replay all the traces to become a standalone benchmark with the same IO worload pattern as your original application.
Jeff, I am sorry to not continue on your code, but at least the ioreplay part must have been written in C (and thus all file descriptors handling as well). I’ve also integrated some more logic to file descriptor-to-file name mappings.
Jiri Horky
Dear Jeffrey
Can I get a copy of python version, please?
Regards,
Yonny
> Versions and Source for Python
> Coming soon – if you need it now, please contact me.