IO Profiling of Applications: strace_analyzer

In the last couple of articles we have talked about using strace to help examine the IO profile of applications (including MPI applications; think HPC). But strace output can contain hundreds of thousands of lines. In this article we talk about the using a tool called strace_analyzer to help sift through the strace output.

Recall that I used the “-short” flag. If you don’t use this flag you will get considerably more output in this section. Using the “-short” flag is a good place to start and perhaps you can try leaving out that flag at some point to get more information.

The first section of this output group presents the distribution of the size of the write() syscalls. For example, the first line lists the number of write() syscalls that have more than 0KB and less than 1KB of data. In this case it’s 12,999 system calls.

The second section of the output group contains a statistical summary of the write system calls. It lists a great deal of information:


  • Total amount of data written in bytes (sum of all write() system calls)
  • Total number of write() syscalls
  • Average (mean) of the number of bytes per write() syscall (in this case it means the Arithmetic Mean
  • Standard Deviation of the number of bytes per write() syscall using the mean (in this case it’s the Sample Standard Deviation
  • Mean Absolute Deviation of the number of bytes per write() syscall (this is the Absolute Deviation using the mean)
  • The median number of bytes in a write() syscall (Definition of median)
  • The Median absolute deviation of the number of bytes in a write() syscall (this is the Absolute Deviation using the median)
  • The time for the slowest write() syscall in seconds
  • The location in the strace file of the slowest write() syscall (line number in the file)
  • The smallest number of bytes in a write() syscall
  • The largest number of bytes in a write() syscall

Notice that I’m using the decimal definition for Megabyte rather than the binary definition which would be listed as MiB. An upcoming option in strace_analyzer will allow you to flip between the two with decimal being the default.

The write output group is created for every process for Abaqus/Explicit (16 processes). Table 3 below contains the write() syscall “size” distribution for all 16 processes. The table is truncated for 128KB to 256KB because none of the processes had a write() syscall greater than 256KB.

Table 3 – Write syscall Interval Statistics

Process 0 < < 1KB 1KB < < 8KB 8KB < < 32KB 32KB < < 128KB 128KB < < 256KB
12419 12,999 61,049 368 0 0
12420 72 1,796 0 0 0
12421 72 1,808 0 0 0
12422 72 1,806 0 0 0
12423 72 1,808 0 0 0
12424 72 1,802 0 0 0
12425 72 1,796 0 0 0
12427 72 1,796 0 0 0
14297 72 1,796 0 0 0
14298 72 1,796 0 0 0
14299 72 1,806 0 0 0
14300 72 1,808 0 0 0
14301 72 1,806 0 0 0
14302 72 1,810 0 0 0
14303 72 1,796 0 0 0
14304 72 1,796 0 0 0

The idea that process 12419 is the rank-0 process doing all the IO for the application is further supported by this table. Process 12419 does an enormous number of write() syscalls less than 1KB relative to the other processes. The same is true for the other write syscall “size intervals”.

The second section of the write output group is summarized in Table 4 below.

Table 4 – Write Statistical Summary

Process Total Bytes written Mean Bytes per
syscall
Standard Deviation
(bytes)
Mean Absolute
Deviation
(bytes)
Median Bytes
per syscall
Median Absolute
Deviation (bytes)
Slowest Write
time (secs)
12419 134,122,801 1,804.76 1,137.07 1,016.65 2,048 473.42 0.0502
12420 7,362,795 3,941.54 771.62 3,222.57 4,096 154.46 0.0012
12421 7,411,947 3,942.53 769.25 3,225.41 4,096 153.48 0.0024
12422 7,403,755 3,942.36 769.64 3,224.94 4,096 153.54 0.0020
12423 7,411,947 3,942.53 769.25 3,225.41 4,096 153.48 0.0013
12424 7,387,371 3,942.03 770.43 3,223.99 4,096 153.97 0.0018
12425 7,362,795 3,941.54 771.62 3,222.57 4,096 154.46 0.0012
12427 7,362,795 3,941.54 771.62 3,222.57 4,096 154.46 0.0020
14297 7,362,790 3,941.54 771.63 3,222.56 4,096 154.46 0.0016
14298 7,362,789 3,941.54 771.64 3,222.46 4,096 154.46 0.0503
14299 7,403,757 3,942.36 769.64 3,224.94 4,096 153.64 0.0014
14300 7,411,947 3,942.36 769.54 3,225.41 4,096 153.48 0.0017
14301 7,403,748 3,942.36 769.66 3,224.92 4,096 153.64 0.0017
14302 7,420,139 3,942.69 768.86 3,225.99 4,096 153.31 0.0012
14303 7,362,792 3,941.54 771.64 3,222.56 4,096 154.46 0.0016
14304 7,362,792 3,941.54 771.64 3,222.56 4,096 154.46 0.0016

Again we can see that process 12419 is likely to be the rank-0 process since it writes 18 times more data than any other process. What is also interesting is that the average (mean) number of bytes per write() syscall is approximately 1.8KB. Many people expect HPC applications to do large writes but this particular application does a great deal of very small writes with average being just 1.8KB. Moreover the mean number of bytes per write() syscall is only 2,048 bytes (2 KiB).

The next output group is very similar to the write output group but covers read() syscalls. A sample of the output from strace_analyzer for this output group for process 12419 is below.

---------------------
-- Read Statistics --
--------------------- 

-- File sizes for read syscalls() -- 

IO Size Range         		     Number of syscalls
=======================================================
(  1)    0KB <   <    1KB		           3819
(  2)    1KB <   <    8KB		           2523
(  3)    8KB <   <   32KB		              7
(  4)   32KB <   <  128KB		              0
(  5)  128KB <   <  256KB		              0
(  6)  256KB <   <  512KB		              0
(  7)  512KB <   < 1000KB		              0
(  8) 1000KB <   <   10MB		              0
(  9)   10MB <   <  100MB		              0
( 10)  100MB <   <    1GB		              0
( 11)    1GB <   <   10GB		              0
( 12)   10GB <   <  100GB		              0
( 13)  100GB <   <    1TB		              0
( 14)    1TB <   <   10TB		              0 

-- READ SUMMARY --
Total number of Bytes read = 10,240,234 (10.240234  MB)
Number of Read syscalls =    6,349 

Average (mean) Bytes per syscall = 1,612.8892739014 (bytes) (0.0016128892739014  MB)
   Standard Deviation: 1,842.22279620697 (bytes) (0.00184222279620697  MB)
   Mean Absolute Deviation: 1,761.4177699287 (bytes) (0.0017614177699287  MB)
Median Bytes per syscall =         4,096 (bytes) (0.004096 MB)
   Median Absolute Deviation: 2,506.53740746574 (bytes) (0.00250653740746574  MB) 

Time for slowest read syscall (secs) = 0.000226
   Line location in file: 2382 

Smallest read syscall size: 0 (Bytes)
Largest read syscall size: 14336 (Bytes)

As you can see this group looks very similar to the write() output group.

Table 5 below presents the read syscall interval "size" summary which lists the number of read() syscalls that read data within the specified interval. As with the write syscall summary, no data is listed in the table for greater than a 256KB read() syscall because the count is zero.

Table 5 - Read syscall Interval Statistics

Process 0 < < 1KB 1KB < < 8KB 8KB < < 32KB 32KB < < 128KB 128KB < < 256KB
12419 3,819 2,523 7 0 0
12420 227 1,357 0 0 0
12421 227 1,354 0 0 0
12422 227 1,352 0 0 0
12423 227 1,354 0 0 0
12424 227 1,350 0 0 0
12425 227 1,354 0 0 0
12427 227 1,356 0 0 0
14297 232 1,356 0 0 0
14298 232 1,354 0 0 0
14299 232 1,354 0 0 0
14300 232 1,355 0 0 0
14301 232 1,353 0 0 0
14302 232 1,356 0 0 0
14303 232 1,356 0 0 0
14304 232 1,354 0 0 0

The suspected rank-0 process does about 17 times more read() syscalls with data larger than 0KB and smaller than 1KB. What is also interesting is that the other processes do a number of read() syscalls even though it appears they don't read or write much data.

Table 6 below lists the statistical read() syscall summary for all 16 processes. it is very similar to Table 4 but for read() syscalls.

Table 6 - Read Statistics Summary

Process Total Bytes Mean Bytes per
syscall
Standard Deviation
(bytes)
Mean Absolute
Deviation
(bytes)
Median Bytes
per syscall
Median Absolute
Deviation (bytes
Slowest Write
time (secs)
12419 10,240,234 1,612.89 1,842.22 1,761.42 4,096 2,506.547 0.0002
12420 5,611,044 3,542.33 1,349.55 2,500.97 4,096 553.97 0.0002
12421 5,598,755 3,541.27 1,350.61 2,499.74 4,096 554.73 0.0134
12422 5,590,563 3,540.57 1,351.33 2,498.94 4,096 555.43 0.0138
12423 5,598,756 3,541.28 1,350.61 2,499.75 4,096 554.72 0.0001
12424 5,582,372 3,539.87 1,352.04 2,498.12 4,096 556.13 0.0135
12425 5,598,756 3,541.28 1,350.61 2,499.75 4,096 554.72 0.0135
12427 5,606,948 3,541.98 1,349.90 2,500.57 4,096 554.02 0.0135
14297 5,607,058 3,530.89 1,362.13 2,488.24 4,096 565.11 0.0423
14298 5,598,865 3,530.18 1,362.84 2,487.43 4,096 565.82 0.1441
14299 5,598,866 3,530.18 1,362.84 2,487.43 4,096 565.82 0.1441
14300 5,602,962 3,530.54 1,362.48 2,487.83 4,096 565.46 0.0667
14301 5,594,769 3,529.82 1,363.19 2,487.02 4,096 566.18 0.0446
14302 5,607,058 3,530.89 1,362.13 2,488.24 4,096 565.11 0.0554
14303 5,607,058 3,530.89 1,362.13 2,488.24 4,096 565.11 0.0618
14304 5,598,866 3,530.18 1,362.84 2.487.43 4,096 565.82 0.0197

The suspected rank-0 process, process 12419, does about twice the total amount of reading as do the other processes. What is also interesting is that the average (mean) bytes per read() syscall is 1.6KB while for the other processes its about 3.5KB.

The next three output groups; close statistics, open statistics, and lseek statistics, are usually not very interesting when using the "-short" option. The output from the strace_analyzer for these output groups from process 12419 is presented below

----------------------
-- Close Statistics --
---------------------- 

-- CLOSE SUMMARY --
Total number of close syscalls = 297
Average time for close syscalls (secs) = 0.000006
Maximum Time for close syscalls (secs) = 0.024903
   Line location in file: 166719 

---------------------
-- Open Statistics --
--------------------- 

-- OPEN SUMMARY --
Total number of open syscalls = 287
Average time for open syscalls (secs) = 0.000000
Maximum Time for open syscalls (secs) = 0.000039
   Line location in file: 2351 

------------------------------------
-- lseek unit activity Statistics --
------------------------------------ 

unit 	Number of lseeks
========================
4   	               1
7   	               2
8   	               1
9   	             464
10  	               5
11  	            2706
12  	               1
14  	           16506
15  	               3

For both open() and close() syscalls the summary lists the total number of respective syscalls, the average (median) time for the close() and open() syscalls, the time for the slowest open() or close() syscall and where that syscall is located in the strace file.

The lseek summary is a little different. It lists the number of lseek calls by their file descriptor (fd). In the brief table the file descriptor is listed as a "unit" (guess that shows my Fortran roots).

The next to last output group is for the IOPS Statistics. This is a little more interesting because it examines three IOPS measures: (1) Write IOPS, (2) Read IOPS, and (3) Total IOPS. IOPS stands for the Input/Output Operations per Second - literally how many IO operations per second. The strace_analyzer computes these statistics by counting the number of write() syscalls, read syscalls(), and all IO syscalls (those that it tracks), in a given amount of time. For example, it computes Write IOPS by counting the number of write() syscalls in 1 second intervals during the run of the application. So it counts the number of write() syscalls in the first second of the application, then the second second, then the third second, and so on. The Read IOPS and Total IOPS are computed in a similar fashion.

The IOPS output group lists the maximum Write IOPS, Read IOPS, and Total IOPS, and during what second "time" interval it occurs. It also computes the Average Write IOPS, Average Read IOPS, and Average Total IOPS by counting all of the respective IO syscalls and dividing by the total run time of the application. A sample of the IOPS output group for process 12419 is shown below:

---------------------
-- IOPS Statistics --
--------------------- 

Maximum Write IOPS = 4339  occurred at 423 seconds
Average Write IOPS = 2752 

Maximum Read IOPS = 2652  occurred at 8 seconds
Average Read IOPS = 244 

Maximum Total IOPS = 10042  occurred at 9 seconds
Average Total IOPS = 3753 

*****************
Final IOPS report
Average Write IOPS = 2752
Average Read IOPS = 244
Average Total IOPS = 3753
*****************

You can see that the strace_analyzer lists the maximum IOPS (Write, Read, Total) as well as when they occurred along with the average IOPS (again for Write, Read, and Total).

A summary of the IOPS for the 16 processes is contained below in Table 7. It lists the maximum IOPS measures and when they occurred during the run as well as the average IOPS measures.

Table 7 - IOPS Statistics

Process Average Write
IOPS
Average Read
IOPS
Average Total
IOPS
Max Write
IOPS
Max Write
IOPS Time (secs)
Max Read
IOPS
Max Read
IOPS Time (secs)
Max Total
IOPS
Max Total
IOPS Time (secs)
12419 2,752 244 3,753 4,339 423 2,652 8 10,042 9
12420 311 396 1,213 898 10 1,335 9 2,676 9
12421 313 395 1,216 904 10 1,332 9 2,670 9
12422 313 394 1,215 903 10 1,330 9 2,666 9
12423 313 395 1,216 904 9 1,355 8 2,742 8
12424 312 394 1,213 901 10 1,328 9 2,662 9
12425 311 395 1,212 898 10 1,332 9 2,670 9
12427 311 395 1,213 898 10 1,334 9 2,674 9
14297 373 529 1,216 898 417 1,334 2 36,602 2
14298 467 528 1,458 899 2 1,332 2 1,458 2
14299 313 317 914 903 424 1,332 9 3,972 9
14300 313 396 1,219 904 9 1,333 8 2,672 8
14301 469 528 1,462 904 2 1,331 2 4,474 2
14302 376 397 1,220 906 8 1,334 8 4,484 8
14303 373 539 1,459 898 3 1,352 2 2,724 2
14304 467 528 1,458 899 2 1,332 2 4,466 2

The largest values for Write IOPS, Read IOPS, and Total IOPS and their averages all occurred with process 12419, the rank-0 process.

The final output group is a per file statistics section. It lists the file primary and distinctive files opened by the application that are not associated with communication such as sockets, the number of write bytes, the average write throughput, the number of read bytes, and the average read throughput. A simple of this output group is shown below (sorry for the width but strace_analyzer writes out the entire path for the file since that can be important).

Filename                                                  Read Bytes	Avg. Bytes/sec   Write Bytes	   Avg. Bytes/sec
=========================================================================================================================
/application/Simulia/benchmark/e5_dellamd_8core.prt                0	134,350,403.33     3,079,008	   384,279,539.18
/dev/infiniband/uverbs0                                            0	 71,436,679.29	       5,260	     1,304,000.23
/sys/class/infiniband_verbs/uverbs0/abi_version                    2	    816,798.94	           0	                0
/proc/cpuinfo                                                  6,248	  1,239,084.77	           0	                0
/etc/protocols                                                 4,096	143,908,378.00	           0	                0
/application/Simulia/benchmark/e5_dellamd_8core.use.1              0	292,571,428.57	       4,060	    11,938,059.90
/application/Simulia/benchmark/e5_dellamd_8core.pac.1      1,880,064    478,023,617.34	           0	                0
/application/Simulia/benchmark/e5_dellamd_8core.sel.1         12,288	379,259,259.26	       8,192	   463,238,095.24
/application/Simulia/benchmark/e5_dellamd_8core.abq.1      3,653,632	763,882,173.43     7,430,144	   398,649,798.81
/application/Simulia/benchmark/e5_dellamd_8core.sta                0	             0	       4,440	    11,999,119.00
/application/Simulia/benchmark/e5_dellamd_8core.msg.1              0	             0	         177	     6,321,428.57
/application/Simulia/benchmark/e5_dellamd_8core.odb        1,473,576	 82,406,285.85   123,591,248	   253,690,560.35
/proc/12441/status                                               765	725,553,148.58	           0	                0

From this listing we can tell that most of the reading takes place in three files:


  • /application/Simulia/benchmark/e5_dellamd_8core.pac.1 (1.88 MB)
  • /application/Simulia/benchmark/e5_dellamd_8core.abq.1 (3.65 MB)
  • /application/Simulia/benchmark/e5_dellamd_8core.odb (1.47 MB)

But most of the writing takes place in the files,

  • /application/Simulia/benchmark/e5_dellamd_8core.odb (123.59 MB)
  • /application/Simulia/benchmark/e5_dellamd_8core.abq.1 (7.43 MB)
  • /application/Simulia/benchmark/e5_dellamd_8core.prt (3.08 MB)

The /application directory is NFS mounted on all of the nodes. Hence the vast majority of the IO takes place over NFS.

The strace_analyzer also produces a large number of files for plotting if you use the "-dat" option. This information can be very useful because it can give you an idea of what the application is doing as a function of time. Since it's fairly obvious that process 12419 is the rank-0 process and does virtually all of the IO for the application and we want to keep this article smaller than a novel, I will plot only the results for process 12419.

The first plot I like to examine is the distribution of write() syscalls. Figure 1 below plots the size of the write() syscalls on the x-axis and the number of write() syscalls for that size on the y-axis.

write_syscall_distribution.png

Figure 1: Write syscall size distribution

This plots supplements the table in the write output group of the output. You can see that the vast majority of the write() syscalls are all below 4,000 bytes. But the peak occurs at around 2,000 bytes (actually it's 2,048 bytes).

People usually think HPC applications do large streaming writes but this plot shows that this HPC application does all of it's writes() in very small increments (2,048 bytes). If you couple this with the knowledge that there are a large number of lseek syscalls, it's fairly obvious that this application does not do large streaming writes.

We can create the same plot for read syscalls as in Figure 2 below.

read_syscall_distribution.png

Figure 2: Read syscall size distribution

As with the write syscalls, the read syscalls are all very small - smaller than about 4,000 bytes. The vast majority of them are all smaller than 2,000 but the large number of read system calls are right around 4,000 bytes (actually it's 4,096 bytes).

After examining the distribution of write and read syscall sizes, I tend to plot the time history of the application write() and read() syscalls. Actually I like to plot the write syscall size vs. time above a plot of the Write IOPS vs. time. This plot allows me to look for increased IOPS activity relative to the size of the data in the write syscall. Figure 3 below is this plot for process 12419.

write_syscall_iops.png

Figure 3: Write syscall size time history (top) and Write IOPS time history (bottom)

We can see that at regular intervals the application wrote small amounts of data (less than 2,000 bytes per syscall) but it also wrote larger amounts of data in syscalls that are just a bit larger than 14,000 bytes. These intervals are correlated with increased Write IOPS. The peak Write IOPS occurs near the end of the run but there is a secondary peak near the beginning (could be writing the initial data to the output file).

After examining the write syscall sizes, I also like to examine the write throughput vs. time relative to the Write IOPS. This type of plot can help explain the IO behavior a bit more. For example, if there is a great deal of throughput and little IOPS then the application may be doing large streaming writes (checking the previous plot will also help determine that). Or the plot can help explain why you are getting a large Write IOPS value without much throughput where perhaps there are small writes with a large number of lseeks in between. Figure 4 plots the write throughput time history in MB/s on top and the Write IOPS time history on the bottom.

write_throughput_iops.png

Figure 4: Write throughput time history in MB/s (top) and Write IOPS time history (bottom).

The write throughput is correlated to the Write IOPS and the peaks occur at regular intervals. The peaks never exceed 1.8GB/s which seems very high for a single disk using NFS over GigE but you have to remember that there are data caches at multiple levels in the system. The IO scheduler has a data cache, the file system has a cache, and the drives have a cache. So the 1.8GB/s may not be the actual performance to the drive. However, I am interested in the performance the application is experiencing. It is very important to note that this plot will be dependent upon the storage system, OS, distro, etc., that you use for running the actual application. This means that this plot will change if you run the application on a different system.

I also plot the cumulative write syscall size as a function of time. In general this plot is useful because it can tell you when most of the write IO occurs. Figure 5 below plots the cumulative write syscall size in MB versus time. Please note that this is the cumulative data for all files used in the application. So it may not match the size of the files.

write_cumulative_syscall.png

Figure 5: Cumulative Write syscall history.

The application is writing small amounts of data at periodic intervals. The amount of data is fairly small but you can see that it happens at regular intervals. These regular intervals make sense since the application is simulating the time history of a dynamic event. So it will write out the state of the object being simulated at fixed time intervals.

We can do the same thing for reads as we just did for writes. The same plot as Figure 3 can be created for read syscalls. This is shown below in Figure 6.

read_syscall_iops.png

Figure 6: Read syscall and IOPS history.

This plot is a little different than the write plot (Figure 3). The application does most of the reading at the beginning of the run. This makes sense since the application has to load the input data before running. After the first few seconds of the application's run time, we can see that it does small reads at periodic intervals, closely matching the write intervals. These intervals also match the increased Read IOPS, although the values are very small (the maximum Read IOPS occurs at the beginning as one would expect).

The read throughput and Read IOPS are shown in Figure 7 below.

read_throughput_iops.png

Figure 7: Read throughput and IOPS history.

The peak read throughput occurs near the beginning of the run as expected. The peak read throughput is about 1.4 GB/s. Again we used a single drive over NFS with GigE so 1.4 GB/s is a bit unreasonable indicating that again caching effects are involved (perhaps from the IO scheduler). After the initial read, the application reads at specified intervals. This intervals corresponds to small jumps in the Read IOPS as show in the bottom plot of Figure 7.

The cumulative read plot is show in Figure 8. This plot is the cumulative read total for all files.

read_cumulative_syscall.png

Figure 8: Cumulative Read syscall history.

From the plot we can see that almost all of the reading happens at the beginning of the application. The reading that happens at certain intervals as show in Figures 6 and 7, also show here, but don't add much to total number of bytes read.

Strace_analyzer produces some additional data files for plotting of the file offset. The data file contains the location of the file pointer relative to the start of the file in bytes (this is the file offset) as a function of time. This type of plot is very useful for examining the access pattern of the application. Does the application access the data sequentially? Or does it do something different such as reading backwards and then writing ahead?

Figure 9 plots the file offset for the file e5_dellamd_8core.odb. This file is the primary output file for the application. Since the rank-0 process (12419) does the IO for the other processes, it makes sense to plot the offset for this file on process 12419.

offset_odb.png

Figure 9: File Offset for e5_dellamd_8core.odb.

What is very interesting in this figure is that at certain intervals the file offset is moved. It appears that the file offset is moved to the beginning of the file via lseek() syscalls (most likely) and moved forward either through reading, writing, or lseek syscalls. So at specified intervals the application lseeks to the beginning of the file and then either reads, writes, lseeks, or some combination of them to near the end of the file where data is written to the file.

While IO is not a big part of the execution of this application this access pattern does not help throughput. As time progresses, the size of the output file grows. Moving the file offset to the beginning pretty much guarantees that the data that is in the disk cache, the VFS cache, the IO scheduler cache, and the file system cache is worthless. This can be detrimental to performance.

Summary Observations of Abaqus/Explicit

There have been a number of tables and plots of the application as well as some comments. I wanted to pull everything together to give a summary of what we learned about this application.


  • The application is definitely not IO limited
  • It appears that the rank-0 process does all of the IO for the application
  • The application reads and writes data at certain intervals. During these intervals the application lseeks to the beginning of the file and the moves the file offset to where it writes data. These intervals correspond to increased in Write IOPS, Read IOPS, and Total IOPS.

One of the curious things I didn't comment on earlier was that if the the rank-0 process is doing all of the IO, why do the other processes read and write data? We can gain some insight by looking at the file statistics output group for one of the non-zero processes. For example, here is the listing for process 14297 (again, my apologies for the width).

Filename                                              Read Bytes   Avg. Bytes/sec   Write Bytes	   Avg. Bytes/sec
=================================================================================================================
/proc/14317/status                                           766   138,766,889.62	      0	                0
/dev/infiniband/uverbs0                                        0    65,677,860.45	  5,260	     1,332,205.04
/sys/class/infiniband_verbs/uverbs0/abi_version                2       896,428.57	      0	                0
/proc/cpuinfo                                              6,248     1,268,732.18	      0	                0
/application/Simulia/benchmark/e5_dellamd_8core.use.9          0    64,269,784.08	  1,007	    47,999,342.28
/application/Simulia/benchmark/e5_dellamd_8core.pac.9  1,884,160   495,357,157.13	      0	                0
/application/Simulia/benchmark/e5_dellamd_8core.sel.9     12,288   395,114,845.94	  8,192	   497,371,428.57
/application/Simulia/benchmark/e5_dellamd_8core.abq.9  3,612,672   687,962,921.56     7,348,224	   537,920,980.74
/application/Simulia/benchmark/e5_dellamd_8core.msg.9          0	        0	    107	    11,888,888.89

From the list of files you see that this process is actually reading and writing to files which look to be specific to that process.


  • /application/Simulia/benchmark/e5_dellamd_8core.use.9
  • /application/Simulia/benchmark/e5_dellamd_8core.pac.9
  • /application/Simulia/benchmark/e5_dellamd_8core.sel.9
  • /application/Simulia/benchmark/e5_dellamd_8core.abq.9
  • /application/Simulia/benchmark/e5_dellamd_8core.msg.9

But you also see that it reads other files such as those in the /proc file system as well as the /dev file system (related to the InfiniBand network). So while the rank-0 process does most of the IO for the application, the other processes read and write to their own files.

Another interesting observation that one can make is that the number of open() syscalls do not match the number of close() syscalls? The explanation is fairly easy - sometimes the applications don't actually close the files they use.

Work in Progress Comments

The strace_analyzer is a work in progress and is being applied to other examples. Consequently, you will see lines of code commented out that have been used to debug the application (I loathe debuggers so I use print() statements). Over time these lines will disappear.

A current limitation of the application is that all the data structures contained the information from the strace file are stored in memory. So if your strace file is too large or has too many IO syscalls, then you can cause the application to swap. In the future, a database version of the application will be available so that much larger strace files can be examined (the largest file in this example had about 1.2 million lines and the system used for the analysis have 3GB of memory and the OS didn't swap).

If the strace files are getting too large you can use an additional option with strace to only capture the syscalls of interest. For example:

strace -T -ttt -e trace=open,close,read,write,lseek,lseek64,mkdir -o file.out 

The additional option "-e" allows you to specify which syscalls you want to track eliminating syscalls that don't have anything to do with IO and reducing the size of the strace files.

Summary

This article, while a bit long, hopefully showed you that there is a way to process large strace files to examine IO behavior of applications. A simple application called strace_analyzer was written to do precisely that - examine strace files for IO syscalls and develop statistics around those syscalls. The application is a work in progress and it's still very early, but it has proven to be useful in several projects. The application is released under GPL v2 so you are free to use it as you like. If you have patches or comments, please send them along.

The strace_analyzer was applied to an MPI based HPC application in this article to illustrate what could be done. Using the data produced by the strace_analyzer we could examine lots of details about the IO pattern of the application. Plus we were able to plot the behavior of the IO patterns over time (a picture is worth a thousand words type of thing).

One final comment about IO profiling of applications - using strace to examine the IO syscalls in an application does change the timing behavior of the application. There is an overhead associated with using strace that changes the timings of the application. Moreover, the strace files have to be written to a file system somewhere, usually separate from the one being used by the application. This too can have an impact on the application behavior. However, strace is easy to use, comes with all distros, produces a huge amount of information, and is very useful for understanding how the application is performing IO. So it is something of a tradeoff; gaining insight into the IO behavior of the application while perhaps slightly altering it's timings. My personal opinion is that this is a worthwhile trade.

Fatal error: Call to undefined function aa_author_bios() in /opt/apache/dms/b2b/linux-mag.com/site/www/htdocs/wp-content/themes/linuxmag/single.php on line 62