In the last article we looked at using strace to examine the IO patterns of simple serial applications. In the High Performance Computing (HPC) world, applications use MPI (Message Passing Interface) to create parallel applications. This time around we discuss how to attack parallel applications using strace.
A Simple Example
Let’s start with a simple example from the MPI-2 book by Bill Gropp, et al. In Chapter 2 the authors present a simple example of an MPI code where each process of a total of N processes writes data to an individual file (this is usually referred to as N-N IO). I modified the code to write more data than originally presented. Here is the modified C code:
/* example of parallel Unix write into separate files */
#include "mpi.h"
#include
#define BUFSIZE 100000
int main(int argc, char *argv[])
{
int i, myrank, buf[BUFSIZE];
char filename[128];
FILE *myfile;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
for (i=0; i < BUFSIZE; i++)
buf[i] = myrank * BUFSIZE + i;
sprintf(filename, "testfile.%d", myrank);
myfile = fopen(filename, "w");
fwrite(buf, sizeof(int), BUFSIZE, myfile);
fclose(myfile);
MPI_Finalize();
return 0;
}
I don't want to cover the MPI functions used in the code in this article but you can see that the basic code is almost the same as the serial code in the previous article. This program has each of the MPI processes create it's own output file ("testfile.#" where "#" is the number of the MPI process) and write some data to it.
The two scripts, main.sh and code1.sh, that are used to run the application are fairly simple. The main script, main.sh, looks like the following:
#!/bin/bash
mpirun -machinefile ./MACHINEFILE -np 4 /home/laytonj/TESTING/code1.sh
I didn't hard-code the path to mpirun in the script but I could easily do that to make sure I get the correct mpirun (this can sometimes be a problem in the HPC world). However, I did hard-code the path to the code script to make sure I executed the script I wanted.
The "code1.sh" script that actually contains the command line we want to execute the MPI executable but also using strace.
#!/bin/bash
/usr/bin/strace -T -ttt -o /tmp/strace.out.$$ /home/laytonj/TESTING/test1_mpi $@
where "test1_mpi" is the name of the MPI executable I created from the above code.
Next, you run the "main.sh" script. When the job is finished you have to go to each node used in the run, and copy the files from /tmp back to whatever file system is more permanent than /tmp. You could write all of the strace output files to a central file system, but you run the risk that you could get two PIDs that are the same. The chances of this are fairly small, but I don't like to take this chance.
Analyzing the Strace Output
When I ran the code using the "main.sh" script on my simple quad-core desktop, four output files were created in the /tmp directory:
- strace.out.4301
- strace.out.4302
- strace.out.4303
- strace.out.4304
In the serial equivalent of the MPI code presented here, the strace file contained only 33 lines (the number of lines can vary depending upon your distribution, compiler, etc.). For the MPI example each strace file contained 1,240 lines! A great deal of the output is related to MPI - loading MPI .so libraries, running MPI functions, network connections, etc.
Let's pull out some highlights from the first strace output file, strace.out.4301. Note that line numbers are included to illustrate relative locations of interesting data in the file.
1 1267931716.455334 execve("/home/laytonjb/ARTICLES/STORAGE039/test1_mpi", ["/home/laytonjb/ARTICLES/STORAGE0"...], [/* 78 vars */]) = 0 <0.000105>
2 1267931716.455687 brk(0) = 0x602000 <0.000021>
3 1267931716.455748 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f09bd9b5000 <0.000009>
4 1267931716.455817 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f09bd9b4000 <0.000022>
5 1267931716.455875 access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory) <0.000016>
6 1267931716.455952 open("/home/laytonjb/bin/openmpi-1.4-gcc-4.1.2/lib/tls/x86_64/libmpi.so.0", O_RDONLY) = -1 ENOENT (No such file or directory) <0.000023>
7 1267931716.456020 stat("/home/laytonjb/bin/openmpi-1.4-gcc-4.1.2/lib/tls/x86_64", 0x7fffe1b72300) = -1 ENOENT (No such file or directory) <0.000011>
8 1267931716.456070 open("/home/laytonjb/bin/openmpi-1.4-gcc-4.1.2/lib/tls/libmpi.so.0", O_RDONLY) = -1 ENOENT (No such file or directory) <0.000013>
9 1267931716.456124 stat("/home/laytonjb/bin/openmpi-1.4-gcc-4.1.2/lib/tls", 0x7fffe1b72300) = -1 ENOENT (No such file or directory) <0.000011>
10 1267931716.456180 open("/home/laytonjb/bin/openmpi-1.4-gcc-4.1.2/lib/x86_64/libmpi.so.0", O_RDONLY) = -1 ENOENT (No such file or directory) <0.000024>
11 1267931716.456242 stat("/home/laytonjb/bin/openmpi-1.4-gcc-4.1.2/lib/x86_64", 0x7fffe1b72300) = -1 ENOENT (No such file or directory) <0.000021>
12 1267931716.456308 open("/home/laytonjb/bin/openmpi-1.4-gcc-4.1.2/lib/libmpi.so.0", O_RDONLY) = 3 <0.000012>
13 1267931716.456505 read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\200\243\1\0\0\0\0\0"..., 832) = 832 <0.000015>
14 1267931716.456565 fstat(3, {st_mode=S_IFREG|0755, st_size=818907, ...}) = 0 <0.000011>
...
1166 1267931718.542514 open("testfile.0", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 10 <0.000026>
1167 1267931718.542567 fstat(10, {st_mode=S_IFREG|0664, st_size=0, ...}) = 0 <0.000004>
1168 1267931718.542603 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f09bd253000 <0.000007>
1169 1267931718.542629 write(10, "\0\0\0\0\1\0\0\0\2\0\0\0\3\0\0\0\4\0\0\0\5\0\0\0\6\0\0\0\7\0\0\0"..., 397312) = 397312 <0.000792>
1170 1267931718.543460 write(10, "\0\204\1\0\1\204\1\0\2\204\1\0\3\204\1\0\4\204\1\0\5\204\1\0\6\204\1\0\7\204\1\0"..., 2688) = 2688 <0.000019>
1171 1267931718.543501 close(10) = 0 <0.000006>
...
1231 1267931718.546397 open("/openmpi-sessions-laytonjb@test64_0/3400466433/0", O_RDONLY|O_NONBLOCK|O_DIRECTORY) = -1 ENOENT (No such file or directory) <0.000007>
1232 1267931718.546443 munmap(0x7f09bcc2b000, 2106776) = 0 <0.000014>
1233 1267931718.546499 munmap(0x7f09bce2e000, 2101136) = 0 <0.000010>
1234 1267931718.546547 munmap(0x7f09bd02f000, 2115040) = 0 <0.000011>
1235 1267931718.546582 munmap(0x7f09bbfe7000, 2103408) = 0 <0.000010>
1236 1267931718.546619 munmap(0x7f09bbde5000, 2102824) = 0 <0.000009>
1237 1267931718.546651 munmap(0x7f09ba9a2000, 2111512) = 0 <0.000010>
1238 1267931718.546682 munmap(0x7f09b92d3000, 2138064) = 0 <0.000010>
1239 1267931718.546712 munmap(0x7f09b90c7000, 2146256) = 0 <0.000011>
1240 1267931718.546766 exit_group(0) = ?
So you can see that the first 1,165 lines do all kinds of things, mostly related to MPI but also starting the application itself. Then at line 1,166 the local file testfile.# is opened. Lines 1,169 and 1,170 actually perform the write() syscalls to the output file.
While not pertinent to the discussion of using strace, notice that the file name listed on line 1,166 has the file name "testfile.0". The code uses the "number" of the particular MPI process - in this case, this is the 0-th (zero-th) process. The other processes will have file names such as "testfile.1", "testfile.2", and "testfile.3".
As with the serial example in the previous article, let's walk through the strace output and examine some of the statistics for the IO functions, starting with the write() syscalls. The elapsed time for the first write() syscall is 0.000792 seconds (792 microseconds). The amount of data written was 397,312 bytes (the same as the serial case). So the throughput was,
397312 Bytes / 0.000792 secs. = 501,656,565.7 Bytes/s = 501.66 MB/s
We can do the same computations to the second write():
2688 Bytes / 0.000019 secs = 141,473,684.2 Bytes/s = 141.4736 MB/s
The total elapsed time for the application is also easy to compute.
1267931718.546766 - 1267931716.455334 = 2.091432 seconds
Summary
While getting strace output is generally easy, in this article we found that not to be the case for general MPI codes. We had to create two scripts so we could get the strace output from each MPI process (which is what we really want) rather than the strace from the MPI command used to start the application. After those scripts are created getting the strace output for any number of MPI processes is quite easy (Note: you can always add some lines to the scripts to copy the strace files back to your home directory or some centralized location).
One thing that is fairly obvious is that even simple codes can produce a great deal of output. Going from a simple serial code to an MPI code increased the number of lines by a factor of 40. Most of that addition is the startup and shutdown code for the application which is somewhat the same for any application but which can increase with the number of processes used. Just imagine having a real application that could produce several hundred thousand lines of strace output.
What happens if you have a much more complicated application? Would you examine the IO syscalls by hand? Actually the whole examination process cry's out for some sort of automation.
Comments on "IO Profiling of Applications: MPI Apps"
Your use of an intermediate shell script to kick off your code gives undefined behavior according to the MPI-2 standard section 5.3.2: Starting Processes and Establishing Communication:
It\’ll probably work just fine on many implementations, but there\’s always the chance it\’ll bomb outright. Or bomb in subtle ways.
NOTE: When using $@, it should normally be enclosed in double quotes: \”$@\”. With the quotes, bash will quote each individual argument when expanding the list of arguments. As you wrote it, without the quotes, any script arguments containing spaces will be improperly parsed for the executable; with the double-quotes all is well.
FWIW, you can use /tmp/strace.out.$OMPI_COMM_WORLD_RANK. This is an Open MPI-specific way to know which MPI process you are — each process will have a unique value of $OMPI_COMM_WORLD_RANK in the range [0,N).
This might be slightly more friendly than a random PID number.
@rhysu – didn\’t know that. So far I\’ve used this approach with Open MPI, Intel MPI, HP MPI (before it was sold), Platform MPI, MVAPICH, and MPICH2. So pretty much all of the worked at some point. But as you point out there are no guarantees.
Jeff S. – any comment on why that section is in the MPI-2 standard?
@tjrob – thanks for the bash-fu. It shows my lack of serious bash skills :) But to make sure I understand if I have a bunch of arguments separate by spaces then it gets expanded incorrectly for the executable? I\’ve run codes before using this method and the arguments had spaces – so I\’m wondering if the executable just parsed the arguments correctly by chance or if there is something else going on.
@jsquyres – Good point. It would also be good to echo that env variable to the output file so that we know how the PID-files match to the ranks.
Thanks!
Jeff
@trojb\’s comment:
Use double quotes to prevent word splitting. [2] An argument enclosed in double quotes presents itself as a single word, even if it contains whitespace separators.
More here: http://tldp.org/LDP/abs/html/quotingvar.html