strace: The Friend You Never Knew You Had

While strace is often used for troubleshooting and debugging, you can also use strace to get started on examining the I/O pattern of your serial codes.

One of the handiest tools used by admins to track down problems is strace, a debugging tool that traces system calls and shows where programs encounter problems. While strace is often used for troubleshooting and debugging, you can also use strace to get started on examining the I/O pattern of your serial codes.

With High Performance Computing (HPC) applications, it’s always better to have more information about the behavior of the code than too little. Knowing the behavior of your code means knowing how it scales, if it’s memory bandwidth intensive, if it’s floating point intensive, how integer intensive is it, how much message passing is done, what kind of messages are passed, what size messages are passed, and on and on. With this information you can then start to configure your cluster to match your code.

Configuring or designing a cluster to match your code is a very different approach than in the past. With previous high performance systems, you had to modify your code to match the machine. This meant that every time a new series came out, you had to modify and tune your code, which is a long and laborious process. But now, clusters give you so many options that you can tune the hardware and the software to match your code. However, to do this, you need to know your code (s) very well.

One aspect of knowing your code that many people either miss, skip, or forget, is the I/O pattern. Many times the reason they don’t know is that they are not quite sure how to determine the I/O pattern and develop I/O requirements from them. In some cases, they don’t even know how different I/O rates affect the run time of their code. In this and the next several columns, I hope to give you a basic tool and some tips to examine your I/O patterns and develop basic metrics from them.

I’m sure there are some people who already know how to use strace to examine the I/O of a code, but for those who don’t know how, let’s walk through a simple example based on a simple serial code.

Simple Serial Example

The code below is a simple C program that just creates an array of floats and writes it to a file. Sure it’s a really simple code, but I want to show you how to use strace to look at the I/O pattern of the code.

#include 
#define BUFSIZE 100000
int main(int argc, int *argv[] ){
   int i;
   float buf[BUFSIZE];
   char filename[128];
   FILE *myfile;
   for (i=0; i < BUFSIZE; i++)
      buf[i] = 2.5 + BUFSIZE * i;
   sprintf(filename, "testfile");
   myfile = fopen(filename, "w");
   fwrite(buf, sizeof(float), BUFSIZE, myfile);
   fclose(myfile);
}

I compiled the code using gcc on my home box which has a simple software RAID-1 device with a couple of 40GB Seagate drives and uses Ext3. We will use strace to run the resulting binary. In general, strace is used to capture the function calls from the binary. This includes calls to I/O functions such as open, llseek, write, read, and close. The command line I will use is,

strace -tt -o ./serial.strace ./serial

Notice that I named the binary serial.

I used two options with strace. The first option, -tt gives microsecond timings for all function calls. The second option, -o serial.strace, sends the strace output to a file called serial.strace. The strace output is too long for this column, but I will show you a few lines from the output and how you can derive some metrics from it.

Analyzing the strace Output

The I/O pattern for the code is really simple. It just opens a file, writes the floating data to the file and then closes the data. It's pretty simple, but in this case, there are actually two writes. The first one writes about 397KB of data and the second writes only about 2.7KB. So we have a reasonably large write followed by a very small write. More complicated codes will have different I/O patterns (more on that in future columns). Let's take a closer look at the I/O function calls.

We can ignore most of the stuff in the output since we're really interested in the I/O part of it. The relevant I/O portion (s) of the output are below:

14:47:03.559980 open("testfile", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 3
14:47:03.560250 write(3, "\0\0 @@Q\303G\240PCHP|\222HPP\303HP$\364H(|\22I(\346*I"..., 397312) = 397312
14:47:03.562485 write(3, "\0\25\240N\r\30\240N\33\33\240N(\36\240N5!\240NB$\240N"..., 2688) = 2688
14:47:03.562613 close(3)

The first line that I've listed opens the file. The next two lines are important because they actually write out the data. You can find out what the function write does by looking in the man pages, but you may have to use the command man 2 write to get the correct man pages.

The function write writes data to a file descriptor (basically a file). You can see the first few bits of data that are being written inside the quotations in the strace output. For this example, the data is binary so the data displayed here doesn't make much sense. But if the data is text or non-binary data, then you might be able to read the first part of the data. strace does not list all of the data being written to the file. The very last number, after the equals sign, is the number of bytes actually written to the file descriptor. For this example, there are two writes. The first one writes 397,312 bytes and the second writes 2,688 bytes. The total of the fwrite functions is 400,000 bytes.

The fwrite function calls in the strace output file also contain more useful information. The time that the particular function begins is printed before the function. I used the -tt option which gives the time including microseconds. If we look at the time for the next function (the next line in the output), we can take the difference between the times to get the elapsed time to perform the previous function.

In this example, it's the amount of time it took to perform the fwrite function. Usually it's good enough to look at the seconds and microseconds to determine the amount of time for the write or read function. In this case the elapsed time to perform the first fwrite function is 2,235 microseconds. We get this by comparing the timestamp of the start of the first write (14:47:03.560250) and the timestamp (14:47:03.562485) on the start of the second write.

We also know that the amount of data is 397,312 bytes. So by doing the math (397,312 bytes+ 0.002235 seconds) we find that the resulting I/O rate is 177,768,233 bytes/sec. This is the same as 178 MB/s, assuming that a MB is 1,000,000 bytes.

The amount of time the second fwrite function took is 128 microseconds. The amount of data written is 2,688 bytes so the I/O rate is 2.1x10^7 bytes/sec. This is the same as 21 MBs, a bit slower than the previous write.

Now you can go through and compute the I/O rate for all read and write functions in your strace output! (Doesn't this cry out for automation?).

Observations

Let's compare the I/O rates for the two fwrite functions. The first fwrite writes out a reasonable amount of data- 397KB. But the second fwrite only writes about 2.7KB, a much smaller amount of data. Now compare the two I/O rates. The first I/O rate is about 178 MB/s and the second I/O rate is only about 21 MB/s. Why is there such a big difference?

A number of factors influence the I/O rate. The latency of the drive where it seeks to a certain location on the disk, the actual speed that the drive can write the data on the disk, how fast the file system can translate the data into blocks for the drive controller, any llseek function calls that move the data location to a different point in the file. And believe it not, the amount of data written has an impact on the I/O rate.

For very small amounts of data the amount of time to write the data is almost the same as the seek time or the latency of the drive. This is similar to sending messages across a network. That is, for very small packets, the time to send the data is about the same as the latency of the network.

Examining the time for the second fwrite, we can see that the time is very short and the amount of data is much much smaller than the first fwrite. We can observe that writing small amounts of data results in a very low I/O rate. Consequently, we can conclude that it is good to avoid small writes if possible. This is because small amounts of data have low I/O rates, which can slow down our code (s).

What Have We Learned?

I hope this column has introduced you to using strace to examine the I/O pattern of your code. For this simple example we found out that the resulting binary does a reasonably large write (397KB) followed by a very small write (2.7KB). The strace output can also be used to examine the I/O rates of your code. For this simple serial code we found that the first fwrite function writes data at about 178 MB/s (caching effects included) and the second, much smaller fwrite, has an I/O rate of 21 MB/s.

From these numbers it can be observed that it's a good idea to avoid small writes because the I/O rate is so low. So if your code does a great deal of small writes, and there are many applications that do this, your overall I/O throughput will be very low.

In the next column I will show you how to use strace with an MPI code. We'll also continue to look at more complicated I/O patterns. Until then take a few moments and use strace to look at the I/O of you favorite applications. Just be careful because strace can produce a huge amount of data.

Fatal error: Call to undefined function aa_author_bios() in /opt/apache/dms/b2b/linux-mag.com/site/www/htdocs/wp-content/themes/linuxmag/single.php on line 62