IO Profiling of Applications: MPI Apps

In the last article we looked at using strace to examine the IO patterns of simple serial applications. In the High Performance Computing (HPC) world, applications use MPI (Message Passing Interface) to create parallel applications. This time around we discuss how to attack parallel applications using strace.

Strace is one of the those all-purpose tools that can be used for debugging problems on your system(s). It can also be used for digging into the IO profile of applications – even if you don’t have the source code (but with Linux you should always have access to the source). In the last article it was shown how strace can be used to gather a great deal of information about the IO behavior of applications.

The reason that strace can be useful is because IO is performed using libraries on Linux (for the vast majority of applications). Because of this strace can record the information of the specific system call (syscall) in a form that is very useful.

This article discusses using strace for MPI (Message Passing Interface) applications that are common to HPC. Along the way we learn a bit more about using strace.

MPI Overview

This article is not specifically about MPI but for those that may not be familiar with it let’s do a quick 50,000 foot fly-by.

MPI is an API that allows programs systems to communicate with one another and send data back and forth. MPI is a standard set of functions that allow you to send information from one system to another (point-to-point) or to send data from a single system to many systems or vice-versa (collective operations). The systems can be on the same physical hardware (e.g. SMP) or they can be distributed (distinct hardware). As along as the programs can open a communication connection of some type then they can share data.

The basic concepts in MPI are fairly simple. For example, if you have an application running on one system and it needs to exchange data with an application running on a different system, then MPI can be used for exchanging data. The “sending” application calls the function “MPI_Send” to send the data to the target system. The target system uses a function, “MPI_Recv” to receive the data. Typically, the application on the target system and the receiving system is actually the same binary but running on different systems, with some code that determines which system is the “sender” or the “receiver”.

There are many tutorials that will teach you how to write MPI code. In addition, there are some very good MPI libraries, such as Open MPI, MPICH2, and MVAPICH, that provide the needed functions for a variety of communication protocols and networks. These include TCP/IP, InfiniBand, and Myrinet MX to name just a few.

MPI applications are executed in several ways. Probably the most common method is that the application is executed once for every core on a system. So a quad-core system would have four instances of the application started. If we have three systems, each with four cores, then we could start 12 instances of the application. When the various instances of the application start they communicate with each other to establish who’s who and where everyone is located, etc. There can be some synchronization between applications as well to make sure they are all in lock-step. Then the applications start computing and sending/receiving data back and forth until the overall application is finished.

Using strace with MPI codes

MPI codes, while a bit more complicated than serial codes, don’t necessarily have to be difficult to use with strace. Ideally, we would like to have one strace output for every MPI process (assuming there are no forks or vforks in the code). This includes having one output for each process even on the same system. So if we had four cores on a node, we would want four strace output files per node. The reason we want one output file per MPI process is so we can tell which MPI process is performing I/O, how much I/O, and it’s performance.

Usually MPI codes are launched by using mpirun or mpiexec or something equivalent that comes with the MPI library. But the problem is that if you try to use strace with this command you end up getting the strace of mpirun or mpiexec itself, not the strace of the actual application, which is what you want. So we need a way to use strace and separate the output files for each process.

For the example below, I’ll be using Open MPI. Open MPI has a utility to start codes called mpirun. A sample command line for Open MPI to run an MPI code is:

mpirun -machinefile ./MACHINEFILE -np 4 <path-to-code>/<executable> <code-options>

where MACHINEFILE is the name of the file containing a list of the machines (host names) being used, <path-to-code> is the path to where the executable is located, <executable> is the name of the actual executable, and <code-options> are any command-line arguments to the executable.

To use strace with an MPI application the first thing people might try is to change the command line to look like:

/usr/bin/strace mpirun -machinefile ./MACHINEFILE -np 4 <path-to-code>/<executable> <code-options>

but all this does is run strace against mpirun, not against the executable as we want. How do we fix this?

The way I run strace against an MPI binary is convert the single command line into two scripts. The first script is for the mpirun command and the second script is for the actual MPI executable. The first script, which I’ve named “main.sh”, is fairly easy:

#!/bin/bash
mpirun -machinefile ./MACHINEFILE -np 4 <path-to-script>/code1.sh <code-options>

It’s not too different than the mpirun command line previously presented except rather than specify the executable, I specify a script, “code1.sh”, and I give the path to this second script. The second script, which I’ve named code1.sh in this example, is for the actual MPI executable plus strace.

#!/bin/bash
/usr/bin/strace -T -ttt -o /tmp/strace.out.$$ <path-to-code>/<executable> $@

In this second script all of the strace action takes place. As with the serial code I use the “-ttt” option to get microsecond timing using seconds since the epoch, the elapsed times for the syscall using the “-T”option, and I specify the strace output using the -o option. In this case, I’m sending the output to /tmp and naming it strace.out.$$. The.$$ after strace.out is a special bash variable that contains the ProcessID (PID) of the script. Since each script will get a unique PID we will have separate strace files for each MPI process.

The second bit of bash knowledge is the option $@ at the end of the script. This is a predefined bash variable that contains all of the command line options after the script code1.sh. These are the command-line arguments for the actual executable. $@ will contain arg1, arg2, arg3, and so on. It’s important to make sure you understand how to use $@. So let’s look at a really quick example.

There is an I/O benchmark called IOR from Lawrence Livermore Labs that has a number of arguments you can pass to the code that describe the details of how to run the benchmark. Here’s an example:

IOR -r -w -a MPIIO -b 25m -N 4 -s 25 -t 10m -v -o <file location>

where IOR is the name of the executable. Don’t won’t worry about what all of the options mean, but let me point out one option. The option “-N 4″ tells the code to use four MPI processes. You can change the value of 4 to correspond to what the scheduler defines. Now how do we pass these arguments to the script that actually runs the code?

Sticking with the IOR example the main.sh script would look like the following:

#!/bin/bash
mpirun -machinefile ./MACHINEFILE -np 4 /home/laytonj/TESTING/code1.sh \
-r -w -a MPIIO -b 25m -N 4 -s 25 -t 10m -v -o <file location>

Notice how I’ve taken the command-line arguments and put them in the main.sh script. With the $@ bash predefined variable in the code script (code1.sh), the options from the main script are passed to the code script. The code script doesn’t change at all (except for the name of the binary):

#!/bin/bash
/usr/bin/strace -T -ttt -o /tmp/strace.out.$$ /home/laytonj/TESTING/IOR $@

The only thing that changed was the name of the binary from code1 to IOR. So if you want to change the arguments to a code you have to modify the main script. Even if your code doesn’t have any command-line arguments I would recommend just leaving $@ in the code for future reference.

Just a quick note here; Brian Mueller from Panasas was the bash script expert who taught me the “bash-fu” (thanks Brian!).

Next: A Simple Example

Comments on "IO Profiling of Applications: MPI Apps"

rhysu

Your use of an intermediate shell script to kick off your code gives undefined behavior according to the MPI-2 standard section 5.3.2: Starting Processes and Establishing Communication:

MPI does not say what happens if the program you start is a shell script and that shell script starts a program that calls MPI_INIT. Though some implementations may allow you to do this, they may also have restrictions, such as requiring that arguments supplied to the shell script be supplied to the program, or requiring that certain parts of the environment not be changed.

It\’ll probably work just fine on many implementations, but there\’s always the chance it\’ll bomb outright. Or bomb in subtle ways.

Reply
tjrob

NOTE: When using $@, it should normally be enclosed in double quotes: \”$@\”. With the quotes, bash will quote each individual argument when expanding the list of arguments. As you wrote it, without the quotes, any script arguments containing spaces will be improperly parsed for the executable; with the double-quotes all is well.

Reply
jsquyres

FWIW, you can use /tmp/strace.out.$OMPI_COMM_WORLD_RANK. This is an Open MPI-specific way to know which MPI process you are — each process will have a unique value of $OMPI_COMM_WORLD_RANK in the range [0,N).

This might be slightly more friendly than a random PID number.

Reply
laytonjb

@rhysu – didn\’t know that. So far I\’ve used this approach with Open MPI, Intel MPI, HP MPI (before it was sold), Platform MPI, MVAPICH, and MPICH2. So pretty much all of the worked at some point. But as you point out there are no guarantees.

Jeff S. – any comment on why that section is in the MPI-2 standard?

@tjrob – thanks for the bash-fu. It shows my lack of serious bash skills :) But to make sure I understand if I have a bunch of arguments separate by spaces then it gets expanded incorrectly for the executable? I\’ve run codes before using this method and the arguments had spaces – so I\’m wondering if the executable just parsed the arguments correctly by chance or if there is something else going on.

@jsquyres – Good point. It would also be good to echo that env variable to the output file so that we know how the PID-files match to the ranks.

Thanks!

Jeff

Reply
rpnabar

@trojb\’s comment:

Use double quotes to prevent word splitting. [2] An argument enclosed in double quotes presents itself as a single word, even if it contains whitespace separators.

More here: http://tldp.org/LDP/abs/html/quotingvar.html

Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>