Using strace for Fun and Profit

New tool for taking the output from strace_analyzer and creating a simulation of the I/O pattern of your code on different storage hardware and file systems.

Using strace to look at the I/O pattern of codes, particularly for MPI codes, makes my life a lot easier. I use strace to examine the bandwidth of read and write functions performed during the execution of a code. I also look at other statistics in the strace output such as the number of I/O function calls, the size of the I/O operations (read and write), the fastest and slowest I/O function calls, and some other statistics that are fairly useful for analyzing the I/O pattern of codes.

But the tough part of doing this is taking the strace output and computing statistics and looking for patterns. In this column I decided to tackle the task writing an strace analyzer. But I’m also going to start using it to examine the strace of an ISV code.

The Basics of the strace Analyzer

It’s too difficult to explain all of the intricacies of the strace analyzer which I will be calling strace_analyzer (sorry but I’m not very creative when it comes to names). I’ll talk about some of the basic pieces of strace_analyzer but I can’t talk about the gory details of it due to the length of the column. Plus I want to actually exercise the analyzer on an ISV code.

Since the analyzer will be processing what is basically text, I chose to use Perl as the language for the analyzer. I’m not a Perl programmer but with the help of Google and Joe Landman at Scalable Informatics I managed to write the analyzer that produced the results in this column (Plus I wanted to learn some Perl anyway).

As I mentioned in a previous article the way to get useful information out of strace is to use the following options.

/usr/bin/strace -tt -o /tmp/strace.out.$$ <path-to-code>/<executable> $@

The first option (-tt)gives you micro-second timings for the function calls and the second option (-o) writes the output to a specific file. Using the resulting information we can compute some useful information.

What Should an Analyzer Produce?

Before I started writing strace_analyzer, I sat down and asked myself what I wanted to see. I decided to focus on these I/O functions calls at first.

List of I/O Function Calls of Interest

  1. read()
  2. write()
  3. close()
  4. open()
  5. llseek() and lseek()
  6. stat()

Based on these functions, here’s the list of information I wanted to see.

List of Output Information

  1. read/write bandwidth (bytes/s and MB/s)
  2. How many open and close functions there are as well as the rate
  3. Slowest and fastest function calls (the list above)
  4. Names of files that were opened
  5. Number of function calls for each file

So the analyzer computes these various metrics for a given strace output file and writes the results to the standard output (stdout) unit.

To run the analyzer you just type the name of the analyzer followed by the name of the strace file.

% ./strace_analyzer strace.out.5213

The analyzer then writes out the information to standard output. By default it will echo the lines of the strace output file to standard output. If the strace output file is long I recommend redirecting standard output (stdout) to a file.

% ./strace_analyzer strace.out.5213 > \
  strace.out.5213.analyzed

Then you can look at the file strace.out.5213.analyzed to get the required information. I haven’t really pimped out the tool yet so if don’t want to see the strace output echoed to the the screen you can modify the code and comment out the printf line that writes the line from the strace file to standard output.

Table 1 summarizes just part of the results from applying the tool to the first file (strace.out.5213) from the MPI code in the last column.

TABLE 1: Summary of Results for Sample MPI Output

Statistic Value
Number of open() Function Calls 5
Number of read() Function Calls 26
Number of write() Function Calls 20
Number of llseek() Function Calls 2
Number of stat() Function Calls 5
Number of close() Function Calls 12
Time for slowest read() 0.052859 secs.
Time for slowest write() 0.005340 secs.
Time for slowest open() 0.000252 secs.
Time for slowest close() 0.000164 secs.

The tool will also produce a list of the amount of data either read or written. Table 2 contains the reads and Table 3 contains the writes.

TABLE 2: Read File Sizes

File Size Number of Files
(1) <1K Bytes 26
(2) 1K< <8K Bytes 0
(3) 8K< <1M Bytes 0
(4) >1M Bytes 0
Total Number of Bytes read 2,811 Bytes

TABLE 3: Write File Sizes

File Size Number of Files
(1) <1K Bytes 19
(2) 1K< <8K Bytes 0
(3) 8K< <1M Bytes 0
(4) >1M Bytes 1
Total Number of Bytes read 400,613 Bytes

If you recall the simple MPI example from the previous column, then you will notice the single write() function used with 400,000 bytes. The code was a very simple code that just opened a file and wrote 400,000 bytes of dummy data. So we would expect to see probably one write() function call. But as you can see in the tables, there is quite a bit of small size I/O going on. For example, there are 19 write() function calls that are less than 1KB in size. There are also 26 read() function calls that are smaller than 1KB. So it looks like MPI codes do a fair amount of I/O in the background, outside your code (at least MPICH2 did for this simple example).

Applying strace_analyzer to an ISV Code

Using strace_analyzer against the output from a simple code is not really a big deal, but I wanted to show you what it could do. Now I want to apply it against an strace file from an ISV code.

I don’t really want to name the ISV and the code since the purpose of showing you this example is to illustrate how to apply strace_analyzer against an ISV code. I don’t want to start a long discussion about the ISV code. While a discussion could be a good thing, I’ve also seen people use strace information to start bad-mouthing the code and that’s not what I want.

The ISV code I want to analyze is what is commonly called an “Implicit” Finite Element Method (FEM) code. There are several that are very popular. The reason I want to use an Implicit FEM code is that they do a lot of I/O. Consequently, they will give us a ton of I/O information.

The particular problem solved is not important, but it was run on a single node with two sockets and dual-core CPUs in each socket for a total of 4 CPUs. But I used only a single core to make processing the strace output file a bit easier. So let’s reproduce Table 1 for this ISV code (shown as Table 4 below). Recall that this table of data contains statistics on the I/O functions used.

TABLE 4: Summary of Results for ISV strace File

Aspect Value
Number of open() Function Calls 169
Number of read() Function Calls 143,618
Number of write() Function Calls 267,719
Number of llseek() Function Calls 173,439
Number of stat() Function Calls 0
Number of close() Function Calls 176

Unfortunately, the strace file wasn’t in the correct format for the strace_analyzer so I couldn’t get the times for the read() and write() operations.

Table 5 contains the read function statistics and Table 6 contains the write function statistics.

TABLE 5: Read File Sizes

File Size Number of Files
(1) <1K Bytes 888
(2) 1K< <8K Bytes 10
(3) 8K< <1M Bytes 142,708
(4) >1M Bytes 12
Total Number of Bytes read 9,518,258,252

[ BEGIN TABLE 6 -- "Write File Sizes" ]

File Size Number of Files
(1) <1K Bytes 211,580
(2) 1K< <8K Bytes 1,199
(3) 8K< <1M Bytes 54,934
(4) >1M Bytes 6
Total Number of Bytes written 3,361,660,572

What do the strace Statistics Tell Us?

In addition to just collecting the I/O statistics from the strace file you have to critically look at the results. In the case of the simple MPI code, we noticed that the MPI implementation, in this case MPICH2, did a few really small (less than 1KB) read() and write() functions. This is not something that many people notice. But this was a simple MPI code so there were no real surprises.

In the case of the ISV code we noticed something far different. There are very large numbers of read(), write(), and lseek() function calls. In my eyes, this means that the code is doing a lot of I/O, particularly when the code only ran for 214 secs. This is an average of about 671 read(), 810 seek(), and 1,251 write() function calls every second.

If we look at the file size statistics we can see that there is a great deal of I/O performed – 3.36GB in writes, and 9.52GB in reads. This is an average of 15.7 MB/s for writes and 44.48 MB/s for reads. This is quite a bit of I/O given that these are averages.

We can also see that most of the write I/O is done on small chunks that are less than 1KB in size but most of the reads are done in chunks of 8KB to 1MB. This tells me that the code the code does a lot of small I/O for writes (not a good thing for performance) but the reads are done on larger chunks which helps performance.

I can’t put any sections of the strace file here because of length, but in looking at the strace file I can see that there are many seek() functions mixed in the middle of both read() and write() functions. This usually means that the code is performing some I/O function (read or write) and then moves the file pointer to a new position. This can cause some slowdown in the I/O because the file pointer has to move. If you couple this with large number of write() functions, it becomes obvious that this code is dominated by small chunks of data (not great for performance).

Here is a quick summary of the observations I made from the statistics and looking at the actual strace data:

Observations of I/O Pattern in ISV code

  1. The code does a lot of read(), write(), and seek() function calls. The large number of seeks is somewhat unusual.
  2. Writes are dominated by small chunks (less than 1KB). This is not good for performance.
  3. Reads are dominated by larger chunks (8KB to 1MB). This is better for performance.
  4. There are many seek() functions in between reads and writes. This is generally not good for performance.

My observations are consistent with what people have seen for almost all of the “Implicit” FEM codes. These codes do a great deal of I/O and quite a bit of it is dominated by smaller data sizes for reads and writes. They also do a lot of seek() functions to move the file pointer. I could write another column about why the codes do this, but let me just say that this is typical for these codes and the decisions behind this general I/O pattern were thought out.

Next Steps

While I would love to continue to write about strace and using it to analyze the I/O pattern of your codes, I think I would be falling down a rat hole by using the strace_analyzer on various codes and looking at the results. While we can learn a lot from doing this, I think it’s more appropriate to leave that as homework for you.

I hope you have learned something from these columns on using strace to analyze the I/O pattern of your codes. My goal was to show you how easy it is to get some data on the I/O patterns and then use that data to collect some statistics about the I/O pattern. The effort in doing this can only help you, particularly when you start to look for storage hardware and file systems for your HPC systems.

You can download the strace_analyzer tool here. I’m going to continue to develop it to add things such as CSV output (so you can analyze and plot anything you want in a spreadsheet), histogram output, and an output format for a new tool I hope to develop.

This new tool will be able to take the output from strace_analyzer and create a simulation of the I/O pattern of your code but using dummy data. You can then take the resulting simulation and try it on different storage hardware and file systems to see how they perform on the I/O portion of your code. Not a bad idea – eh? Keep watching the URL and this column for updates on how I’m doing.

If you want to help, please don’t hesitate to grab the code and hack away. Just let me know if you come up with anything cool by sending me email. I also watch the Beowulf mailing list, if you want to post something there I’ll probably see it. In the meantime, happy stracing!

Fatal error: Call to undefined function aa_author_bios() in /opt/apache/dms/b2b/linux-mag.com/site/www/htdocs/wp-content/themes/linuxmag/single.php on line 62