IO Profiling of Applications: MPI Apps

In the last article we looked at using strace to examine the IO patterns of simple serial applications. In the High Performance Computing (HPC) world, applications use MPI (Message Passing Interface) to create parallel applications. This time around we discuss how to attack parallel applications using strace.

A Simple Example

Let’s start with a simple example from the MPI-2 book by Bill Gropp, et al. In Chapter 2 the authors present a simple example of an MPI code where each process of a total of N processes writes data to an individual file (this is usually referred to as N-N IO). I modified the code to write more data than originally presented. Here is the modified C code:

/* example of parallel Unix write into separate files */
#include "mpi.h"
#define BUFSIZE 100000

int main(int argc, char *argv[])
   int i, myrank, buf[BUFSIZE];
   char filename[128];
   FILE *myfile;

   MPI_Init(&argc, &argv);
   MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
   for (i=0; i < BUFSIZE; i++)
      buf[i] = myrank * BUFSIZE + i;
   sprintf(filename, "testfile.%d", myrank);
   myfile = fopen(filename, "w");
   fwrite(buf, sizeof(int), BUFSIZE, myfile);
   return 0;

I don't want to cover the MPI functions used in the code in this article but you can see that the basic code is almost the same as the serial code in the previous article. This program has each of the MPI processes create it's own output file ("testfile.#" where "#" is the number of the MPI process) and write some data to it.

The two scripts, main.sh and code1.sh, that are used to run the application are fairly simple. The main script, main.sh, looks like the following:

mpirun -machinefile ./MACHINEFILE -np 4 /home/laytonj/TESTING/code1.sh

I didn't hard-code the path to mpirun in the script but I could easily do that to make sure I get the correct mpirun (this can sometimes be a problem in the HPC world). However, I did hard-code the path to the code script to make sure I executed the script I wanted.

The "code1.sh" script that actually contains the command line we want to execute the MPI executable but also using strace.

/usr/bin/strace -T -ttt -o /tmp/strace.out.$$ /home/laytonj/TESTING/test1_mpi $@

where "test1_mpi" is the name of the MPI executable I created from the above code.

Next, you run the "main.sh" script. When the job is finished you have to go to each node used in the run, and copy the files from /tmp back to whatever file system is more permanent than /tmp. You could write all of the strace output files to a central file system, but you run the risk that you could get two PIDs that are the same. The chances of this are fairly small, but I don't like to take this chance.

Analyzing the Strace Output

When I ran the code using the "main.sh" script on my simple quad-core desktop, four output files were created in the /tmp directory:

  • strace.out.4301
  • strace.out.4302
  • strace.out.4303
  • strace.out.4304

In the serial equivalent of the MPI code presented here, the strace file contained only 33 lines (the number of lines can vary depending upon your distribution, compiler, etc.). For the MPI example each strace file contained 1,240 lines! A great deal of the output is related to MPI - loading MPI .so libraries, running MPI functions, network connections, etc.

Let's pull out some highlights from the first strace output file, strace.out.4301. Note that line numbers are included to illustrate relative locations of interesting data in the file.

      1 1267931716.455334 execve("/home/laytonjb/ARTICLES/STORAGE039/test1_mpi", ["/home/laytonjb/ARTICLES/STORAGE0"...], [/* 78 vars */]) = 0 <0.000105>
      2 1267931716.455687 brk(0)                = 0x602000 <0.000021>
      3 1267931716.455748 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f09bd9b5000 <0.000009>
      4 1267931716.455817 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f09bd9b4000 <0.000022>
      5 1267931716.455875 access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory) <0.000016>
      6 1267931716.455952 open("/home/laytonjb/bin/openmpi-1.4-gcc-4.1.2/lib/tls/x86_64/libmpi.so.0", O_RDONLY) = -1 ENOENT (No such file or directory) <0.000023>
      7 1267931716.456020 stat("/home/laytonjb/bin/openmpi-1.4-gcc-4.1.2/lib/tls/x86_64", 0x7fffe1b72300) = -1 ENOENT (No such file or directory) <0.000011>
      8 1267931716.456070 open("/home/laytonjb/bin/openmpi-1.4-gcc-4.1.2/lib/tls/libmpi.so.0", O_RDONLY) = -1 ENOENT (No such file or directory) <0.000013>
      9 1267931716.456124 stat("/home/laytonjb/bin/openmpi-1.4-gcc-4.1.2/lib/tls", 0x7fffe1b72300) = -1 ENOENT (No such file or directory) <0.000011>
     10 1267931716.456180 open("/home/laytonjb/bin/openmpi-1.4-gcc-4.1.2/lib/x86_64/libmpi.so.0", O_RDONLY) = -1 ENOENT (No such file or directory) <0.000024>
     11 1267931716.456242 stat("/home/laytonjb/bin/openmpi-1.4-gcc-4.1.2/lib/x86_64", 0x7fffe1b72300) = -1 ENOENT (No such file or directory) <0.000021>
     12 1267931716.456308 open("/home/laytonjb/bin/openmpi-1.4-gcc-4.1.2/lib/libmpi.so.0", O_RDONLY) = 3 <0.000012>
     13 1267931716.456505 read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\200\243\1\0\0\0\0\0"..., 832) = 832 <0.000015>
     14 1267931716.456565 fstat(3, {st_mode=S_IFREG|0755, st_size=818907, ...}) = 0 <0.000011>


   1166 1267931718.542514 open("testfile.0", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 10 <0.000026>
   1167 1267931718.542567 fstat(10, {st_mode=S_IFREG|0664, st_size=0, ...}) = 0 <0.000004>
   1168 1267931718.542603 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f09bd253000 <0.000007>
   1169 1267931718.542629 write(10, "\0\0\0\0\1\0\0\0\2\0\0\0\3\0\0\0\4\0\0\0\5\0\0\0\6\0\0\0\7\0\0\0"..., 397312) = 397312 <0.000792>
   1170 1267931718.543460 write(10, "\0\204\1\0\1\204\1\0\2\204\1\0\3\204\1\0\4\204\1\0\5\204\1\0\6\204\1\0\7\204\1\0"..., 2688) = 2688 <0.000019>
   1171 1267931718.543501 close(10)             = 0 <0.000006>


   1231 1267931718.546397 open("/openmpi-sessions-laytonjb@test64_0/3400466433/0", O_RDONLY|O_NONBLOCK|O_DIRECTORY) = -1 ENOENT (No such file or directory) <0.000007>
   1232 1267931718.546443 munmap(0x7f09bcc2b000, 2106776) = 0 <0.000014>
   1233 1267931718.546499 munmap(0x7f09bce2e000, 2101136) = 0 <0.000010>
   1234 1267931718.546547 munmap(0x7f09bd02f000, 2115040) = 0 <0.000011>
   1235 1267931718.546582 munmap(0x7f09bbfe7000, 2103408) = 0 <0.000010>
   1236 1267931718.546619 munmap(0x7f09bbde5000, 2102824) = 0 <0.000009>
   1237 1267931718.546651 munmap(0x7f09ba9a2000, 2111512) = 0 <0.000010>
   1238 1267931718.546682 munmap(0x7f09b92d3000, 2138064) = 0 <0.000010>
   1239 1267931718.546712 munmap(0x7f09b90c7000, 2146256) = 0 <0.000011>
   1240 1267931718.546766 exit_group(0)         = ?

So you can see that the first 1,165 lines do all kinds of things, mostly related to MPI but also starting the application itself. Then at line 1,166 the local file testfile.# is opened. Lines 1,169 and 1,170 actually perform the write() syscalls to the output file.

While not pertinent to the discussion of using strace, notice that the file name listed on line 1,166 has the file name "testfile.0". The code uses the "number" of the particular MPI process - in this case, this is the 0-th (zero-th) process. The other processes will have file names such as "testfile.1", "testfile.2", and "testfile.3".

As with the serial example in the previous article, let's walk through the strace output and examine some of the statistics for the IO functions, starting with the write() syscalls. The elapsed time for the first write() syscall is 0.000792 seconds (792 microseconds). The amount of data written was 397,312 bytes (the same as the serial case). So the throughput was,

397312 Bytes / 0.000792 secs. = 501,656,565.7 Bytes/s = 501.66 MB/s

We can do the same computations to the second write():

2688 Bytes / 0.000019 secs = 141,473,684.2 Bytes/s = 141.4736 MB/s

The total elapsed time for the application is also easy to compute.

1267931718.546766 - 1267931716.455334 = 2.091432 seconds


While getting strace output is generally easy, in this article we found that not to be the case for general MPI codes. We had to create two scripts so we could get the strace output from each MPI process (which is what we really want) rather than the strace from the MPI command used to start the application. After those scripts are created getting the strace output for any number of MPI processes is quite easy (Note: you can always add some lines to the scripts to copy the strace files back to your home directory or some centralized location).

One thing that is fairly obvious is that even simple codes can produce a great deal of output. Going from a simple serial code to an MPI code increased the number of lines by a factor of 40. Most of that addition is the startup and shutdown code for the application which is somewhat the same for any application but which can increase with the number of processes used. Just imagine having a real application that could produce several hundred thousand lines of strace output.

What happens if you have a much more complicated application? Would you examine the IO syscalls by hand? Actually the whole examination process cry's out for some sort of automation.

Fatal error: Call to undefined function aa_author_bios() in /opt/apache/dms/b2b/linux-mag.com/site/www/htdocs/wp-content/themes/linuxmag/single.php on line 62