HPC cluster optimization is often simple. Avoiding assumptions is hard.
Last month, when I began writing about optimization, I assumed I could cover the high points in a few pages of text. As I began writing, though, I realized my prose would exceed my monthly quota, and I still had plenty more to talk about. My intention was to provide an overview of some very complex topics, and provide enough key works and context to let you Google yourself to the level of optimization master. This month, I attempt to neatly finish this important topic by looking at some knobs and switches that you may want to try with your applications. But first, a parable.
There was man who had three sons. Two of the sons went off and had nothing to do with computers, so that was the end of them. The third son, decided to trade a donkey and about $600 for a small Intel Core-Duo system to investigate the art of multi-core. Along the way, he also included a new 250 GB SATA disk drive to add rapid storage and retrieval, as he was told by the pharisees.
When the time came to use the SATA drive, the son assumed that everything would just work, which it did. Alas, when he saw the actually performance was 3.91 MB/sec, he was troubled and wondered if he could get the donkey back. Upon hearing of his sadness, a blind man said, “Did you check your BIOS?” The son immediately set out to check his BIOS only to find that safe Legacy mode was the default. Upon returning to his father he said, “I’m now getting 64.20 MB/sec out of my SATA drive! ” His father replied, “Where the hell is the donkey?”
The first lesson is clear, don’t trade livestock for computers. The second lesson is a bit more subtle: Just because a system works, it may not be working well. And, in the parable above, things were working 16 times slower than they should have been.
Forget the Past
When I think back about my experiences with cluster performance, I conclude the biggest impediment to better performance is assumptions. I don’t really care to count the number of times I just assumed something was working optimally because it was working. In addition, there are what I call folklore assumptions based on anecdotal evidence or notions that seem to or indeed may have held some truth at some moment in time. Folklore assumptions are the worst kind of assumptions, because such errant convictions often get ingrained in the common wisdom.
For example, I recall one HPC user, adept at Unix RISC, who was stunned that a Pentium II performed better than his SGI box. Conventional wisdom told him that the x86 was a slow CISC chip, while the MIPS was a fast RISC chip â€” the latter was simply faster, no contest. At one point in time, his notion was certainly correct. Yet, when he actually ran his application, he was surprised by the results and even thought something was wrong with the Pentium II system.
Shared Memory and TCP
Another notion floating around is that multi-core MPI communication via shared memory is much faster than TCP on the same processor. This hypothesis certainly sounds reasonable and logical, which means it’s veracity must be tested.
Last year, I ran the NAS parallel benchmark (Size B) on some Intel dual-core Pentium D systems. I used eight one-processor nodes connected by Gigabit Ethenet (GigE). yielding a total of sixteen cores. I also used LAM/MPI version 7.1.1. Like most MPIs, LAM has the option to use memory-to-memory copying when two processes are on the same node. I ran the test suite two ways, first with just TCP communication, and then with TCP+ SYSV (System V- style style memory transfers).
Table One lists the winning method for sixteen MPI processes. There is no clear best method, and most interesting, the results were very close to each other. This outcome wasn’t unique, either. I have seen these kinds of results for other Intel multi-core systems. I don’t know the exact reason for this behavior, but I would have assumed (wrongly) that shared memory always trumps TCP communications.
TABLE ONE: The winning LAM/MPI communication method for eight dual-core nodes
Building Your Application
Believe it or not, there are a lot of factors that affect how your application runs on a cluster. Before examining these runtime aspects, there are two areas worth considering: the compiler, and the libraries used by your application.
Using a commercial compiler can often increase your single process performance. Indeed, there are enough option combinations that testing all of them could take years. Plus, option settings intended to increase performance may actually decrease performance. Just because an optimization worked on another code, it may not work for yours.
In general, the current versions of gcc and gfortran provide good performance. (However, g77 is no longer part of the GNU tool chain and has been replaced by the almost Fortran 95-compliant gfortran. Furthermore, gfortran is not g95. gfortran is the Fortran 95 compiler that is part of the GCC tool set, and g95 is another Fortran 95 compiler, which is based on GCC.)
In the past, g77 had some limitations, such as parsing all the non-standard, but omnipresent vendor additions to the Fortran 77 standard. With the advent of gfortran, many codes that would not compile with g77 now run happily on clusters. The performance of gfortran is actually better than g77 and may be adequate for many users. In the past, the limited scope and performance of g77 almost mandated that a cluster user purchase a commercial compiler suite to build fast Fortran codes. In some cases this may be still be true, but gfortran has closed the gap.
Of course, most commercial compilers offer free trial versions, so testing your code with these compilers is often well worth the effort. One interesting resource is Polyhedron Software’s Fortran benchmark page, where many of the popular Fortran compilers are compared across sixteen benchmarks. If you look at this page, you will notice that there is no “best at everything” compiler and the right/wrong compiler choice may increase/decrease performance by a factor of two or more in some cases.
Now that you plan to spend four weeks testing every Fortran compiler available, it may be just as important to consider where your program spends most of it’s time. This task can be done using a profiler or just looking at the code.
For example. the famous High Performance Linpack (HPL) shows little sensitivity to the type of compiler. Big performance differences come from the choice of Basic Linear Algebra (BLAS) libraries used. If you use the standard but un-optimized BLAS libraries from Netlib, you may not be pleased with the performance of your new cluster. On the other hand, if you use the Netlib Automatically Tuned Linear Algebra Software (ATLAS) package built specifically for your CPU architecture, your frown may turn upside down (after you find the magic
NB size for your cluster). Other options for BLAS libraries include the famous, hand-tuned GOTO libraries.
Both AMD and Intel have math libraries (including BLAS) for their processors as well. If you use linear algebra, FFTs, or random number generators, check into these packages. The AMD Core Math Library (ACML) is freely available in binary form. In general, before you write any new code, look for an existing and optimized math library for your application. Chances are that someone has put a large amount of time into optimizing such a library â€” that is unless you are using some newly invented mathematics or are convinced that you write code far better than the guys at NASA. Let me know how that works out.
One of the more important library decisions is the choice of MPI. There are currently more MPI library packages than I have fingers and toes, so I’ll focus on the open source variants. The popular packages include MPICH, MPICH2, LAM/MPI, and Open MPI
The Genealogy of MPI
Back in the day, there were primarily two “MPI trees” in use: the MPICH tree developed at Argonne National Lab (ANL), and the LAM tree originally developed at Ohio State, which then moved to Notre Dame, and finally is growing comfortably at Indiana University. While these two tools continue to be very useful the MPI movers and shakers have been working on follow-on versions. ANL has introduced MPICH2, which was a virtual re-write and now supports the MPI2 specification. Open MPI is a collaborative effort by the FT-MPI, LA-MPI, LAM/MPI, and PACX-MPI teams that has all kinds of new goodness as well (including MPI2 support). In terms of the future, most of the effort will be going into MPICH2 and Open MPI, so getting too comfortable with MPICH and LAM is not the best strategy.
Since, MPI is a “standard”, it is usually not to difficult to swap out one MPI library for another and rebuild you code. Thus, testing various MPIs is not that difficult (in theory), as most MPIs include “wrapper” functions to make sure all the required paths and options are set. Determining which MPI results in better performance is not so simple, and it often is determined by how the codes are run on the cluster.
However, I have been using both MPICH and LAM/MPI for over ten years, and I can say with some certainty that, like Fortran compilers, there is no clear MPI winner. I doubt this situation will change with the new MPI versions.
Running Your Code
While MPI provides a portable way to write parallel codes, there is no specification how an MPI implementation is to run codes. Indeed, one of the complaints about MPI is the fact the the user must specify the number of processors (or processes) to run in parallel using the mpirun command (or in newer versions the mpiexec command).
One assumption to check is that mpirun uses the fastest interconnect on your cluster. If your cluster uses GigE exclusively, then this assumption is probably safe. If on the other hand, you are using Myrinet or Infiniband, you may need a specific MPI version. MPI does not natively know or care what is moving the messages between processes. Obviously, if your applications responds to a fast interconnect, accidentally running over GigE will show reduced run times. With other applications it may not be as clear.
There are some other issues that the mpirun command needs to sort out as well. For instance, how to start processes on the remote nodes? How does the user specify the nodes to use? And how are processes mapped on processors?
While the first two issues are certainly MPI-dependent, neither affects performance. The third issue can drastically effect performance, however. Some further background is helpful at this point.
Adroit readers will notice that when I talk about MPI on clusters, I often talk about processes and not processors. The reason for this careful wording is that MPI on clusters has no real concept of processors. The mpirun command starts processes on a given host. If the host is the same for each process, all your processes are started on the same host. (An often forgotten aspect of this is that simple MPI codes can be run and debugged on a single workstation or laptop.)
Most MPI’s try to map a number of processes (the
â€“np argument) to a list of nodes. mpirun simply moves down the list starting a process on each node. If it gets to the bottom of the list and it still has more processes to map, it goes right back to the top and starts another process on the first node.
Obviously, if mpirun has less processes than nodes, it doesn’t use the extra nodes. In addition almost all MPIs have an option to specify how many processors (or cores) are on each node. In general, by specifying the core count for the node, mpirun starts that many processes per node. For example if you have four cores per node, and tell mpirun you need 16 processes, mpirun starts four processes on each node.
The end user has the option to make some of these decisions, but know the consequences. Is it best to run my code in a dispersed (on process per node) or in packed (multi-process per node) configuration? Presumably, if you run two or more MPI processes on the same node, the lot will use shared memory for sending message instead of TCP. That’s a safe assumption, right? The problem is that now you have have two or more processes that may need to share the interconnect on the node. In addition, all the MPI processes probably have the same memory access patterns, thus putting extra contention on the memory sub-system.
If I had one process per node, at least I might have less interconnect traffic memory usage on the node with which to contend. That is if the processes sharing the node with me are good neighbors. If they are interconnect and memory hogs, it may be better to pack my job in a smaller number of nodes.
This situation was tolerable in old dual processor (single core days). But now in the multi-core world, things aren’t so clear. Indeed, other issues now surface. If I have dual socket, dual core node (a total of four cores), is it best to run one, two, or four processes per node. And, if I run two processes, is it best to place them on two separate processors?
If you are noticing that there are more questions than answers, you now have a grasp of optimization in the multi-core age. As I have shown in past columns, on Intel platforms at least, random process placement can result in large performance variations.
Now the really good news: Other than Open MPI, none of the MPIs mentioned have any control over where a process gets placed on a node. The decision as to what core gets what process is up to Linux. Open MPI provides some naive process affinity, but right now, the performance you get on a multi-core system can be more a matter of luck than cunning programing and optimization. Now that is a comforting thought. As I understand it, Intel’s commercial MPI does optimal placement of processes. If you are interested in a little more certainty, this may be a way to proceed.
While the current situation sounds a bit hard to tame, the solution is to play with a few of these parameters and not assume that just because your code runs, it is running optimally. Many of the mpirun changes are extremely simple to try.
And That’s Not All
Once again, I’ve run out of room and there is plenty more to talk about. One important point to remember is that newer MPI versions are almost like compilers when it comes to running the codes. A few moments with the man pages is quite helpful in most cases.
In the case of Open MPI, one of the design goals was to have a totally modular system so that the type of interconnect could be changed at run-time without recompiling the code. As such, there are a multitude of switches, knobs, and buttons, available for Open MPI.
At this point, I realize that with all the possible tweaks, optimizing your code sounds like a lot of work. Well, it can be and, in my opinion, that is why optimization is not done very often. One of the important issues to determine is how much improvement is good enough.
I assume that like many things, the “80/20″ rule applies to optimization.– that is, 80 percent of the performance comes from 20 percent of the effort, and the last 20 percent improvement will take 80 percent of the effort. Of course like all good, time-tested assumptions, this one is ripe for testing.