In the last article we looked at using strace to examine the IO patterns of simple serial applications. In the High Performance Computing (HPC) world, applications use MPI (Message Passing Interface) to create parallel applications. This time around we discuss how to attack parallel applications using strace.
Tuesday, March 16th, 2010
A Simple Example
Let’s start with a simple example from the MPI-2 book by Bill Gropp, et al. In Chapter 2 the authors present a simple example of an MPI code where each process of a total of N processes writes data to an individual file (this is usually referred to as N-N IO). I modified the code to write more data than originally presented. Here is the modified C code:
/* example of parallel Unix write into separate files */
#define BUFSIZE 100000
int main(int argc, char *argv)
int i, myrank, buf[BUFSIZE];
for (i=0; i < BUFSIZE; i++)
buf[i] = myrank * BUFSIZE + i;
sprintf(filename, "testfile.%d", myrank);
myfile = fopen(filename, "w");
fwrite(buf, sizeof(int), BUFSIZE, myfile);
I don't want to cover the MPI functions used in the code in this article but you can see that the basic code is almost the same as the serial code in the previous article. This program has each of the MPI processes create it's own output file ("testfile.#" where "#" is the number of the MPI process) and write some data to it.
The two scripts, main.sh and code1.sh, that are used to run the application are fairly simple. The main script, main.sh, looks like the following:
I didn't hard-code the path to mpirun in the script but I could easily do that to make sure I get the correct mpirun (this can sometimes be a problem in the HPC world). However, I did hard-code the path to the code script to make sure I executed the script I wanted.
The "code1.sh" script that actually contains the command line we want to execute the MPI executable but also using strace.
where "test1_mpi" is the name of the MPI executable I created from the above code.
Next, you run the "main.sh" script. When the job is finished you have to go to each node used in the run, and copy the files from /tmp back to whatever file system is more permanent than /tmp. You could write all of the strace output files to a central file system, but you run the risk that you could get two PIDs that are the same. The chances of this are fairly small, but I don't like to take this chance.
Analyzing the Strace Output
When I ran the code using the "main.sh" script on my simple quad-core desktop, four output files were created in the /tmp directory:
In the serial equivalent of the MPI code presented here, the strace file contained only 33 lines (the number of lines can vary depending upon your distribution, compiler, etc.). For the MPI example each strace file contained 1,240 lines! A great deal of the output is related to MPI - loading MPI .so libraries, running MPI functions, network connections, etc.
Let's pull out some highlights from the first strace output file, strace.out.4301. Note that line numbers are included to illustrate relative locations of interesting data in the file.
So you can see that the first 1,165 lines do all kinds of things, mostly related to MPI but also starting the application itself. Then at line 1,166 the local file testfile.# is opened. Lines 1,169 and 1,170 actually perform the write() syscalls to the output file.
While not pertinent to the discussion of using strace, notice that the file name listed on line 1,166 has the file name "testfile.0". The code uses the "number" of the particular MPI process - in this case, this is the 0-th (zero-th) process. The other processes will have file names such as "testfile.1", "testfile.2", and "testfile.3".
As with the serial example in the previous article, let's walk through the strace output and examine some of the statistics for the IO functions, starting with the write() syscalls. The elapsed time for the first write() syscall is 0.000792 seconds (792 microseconds). The amount of data written was 397,312 bytes (the same as the serial case). So the throughput was,
While getting strace output is generally easy, in this article we found that not to be the case for general MPI codes. We had to create two scripts so we could get the strace output from each MPI process (which is what we really want) rather than the strace from the MPI command used to start the application. After those scripts are created getting the strace output for any number of MPI processes is quite easy (Note: you can always add some lines to the scripts to copy the strace files back to your home directory or some centralized location).
One thing that is fairly obvious is that even simple codes can produce a great deal of output. Going from a simple serial code to an MPI code increased the number of lines by a factor of 40. Most of that addition is the startup and shutdown code for the application which is somewhat the same for any application but which can increase with the number of processes used. Just imagine having a real application that could produce several hundred thousand lines of strace output.
What happens if you have a much more complicated application? Would you examine the IO syscalls by hand? Actually the whole examination process cry's out for some sort of automation.