In June of 2006 the Sandia National Laboratoriesâ€™ 8960-processor Thunderbird Linux cluster was number 6 on the Top500 list. The system logged 38.3 TFLOPS and I'm guessing dimmed the lights a bit when it was running. Skip ahead to November 2006, and while there was some reshuffling of the contenders, Thunderbird managed to stay at the number 6 spot by achieving 53 TFLOPS. What happened?
In June of 2006 the Sandia National Laboratoriesâ€™ 8960-processor Thunderbird Linux cluster was number 6 on the Top500 list. The system logged 38.3 TFLOPS and I’m guessing dimmed the lights a bit when it was running. Skip ahead to November 2006, and while there was some reshuffling of the contenders, Thunderbird managed to stay at the number 6 spot by achieving 53 TFLOPS. What happened?
Of course, one might assume a processor or memory upgrade, but the hardware was the same. Software optimization is really what happened. By replacing and tuning the Infiniband stack they were able to achieve a 38% increase in performance! Standard Top500 disclaimers apply, but in this case, they were really competing against themselves.
In the June 2006 run they achieved, 4.27 GFLOPS per processor, by optimizing their system they were able to add 14.70 TFLOPS which was like adding 3,443 processors to the cluster. In other words, had they not optimized, they would need close to thirty five hundred additional processors to archive 53 TFLOPS. At an average cost of $3000 for a dual processor node, that works out to over $5 million dollars. I’m thinking somebody should be getting a pat on the back over at Sandia.
There is another aspect to this optimization thing that is often not considered in such success stories. That is power and cooling. Most would agree that we now live in Warming Epoch and as good citizens who should be concerned about our planet.
The average dual-socket cluster node currently requires around 300 watts of power. By finding 3443 “extra” processors, Sandia was able to reduce the power needed to archive 53 TFLOPS by 500 Kilowatts.
Of course, had they not optimized, they probably would not buy the extra nodes to maintain their spot on the list, thus spending more money and using more power, but the cluster is now more efficient In terms of real work (throughput) and FLOPS/watt they are much better than they were in the past. The lesson is clear — optimization can make a big difference.
Where to Begin
Of course everyone want to optimize their cluster applications, not for the Top500, but rather because that is what HPC computer geeks do. After all, that is why we are here. The problem is optimization can take place at many levels. In particular, there are often stellar optimizations that occur at the program level. These often include algorithm changes, careful measurements, and general head scratching about exactly where your code may be bottle-necked.
Unfortunately, these types of optimizations tend to be application specific and are not easily discussed within the confines single article. Besides, there are some simpler things we can try first because in addition to wanting to blazing fast optimizations we also want the quick and easy approach like a -O6 option that works for cluster codes.
Maybe someday there will be such and optimization argument, but today we need to roll up our sleeves and tweak our cluster engines by hand. Fortunately, there are some things that are easily changed like compiler options, NFS and kernel options, MPI versions, and even Ethernet parameters. And of course, multi-core adds another set of variables that can be changed.
Drawing a Baseline
Optimization implies increased performance. Therefore we need some kind of baseline with which to measure progress. Keep in mind, that not every change we can make will result in improvement and in some cases changes may degrade performance. In addition, there may be some routine upgrades that may negatively effect performance as well. If there is no “baseline” then we may not know this. A baseline benchmark should be performed when the cluster is first put into service. Preferably, manufacturers/vendors should provide such data as well. As a cluster administrator it will serve you well to have a “day one” snapshot of cluster performance. In the event of something changing or something breaking, reference data is easily at hand.
Open Benchmarks are Preferred
As a good open source citizen, I prefer source code with benchmarks. Not only does it let me compare different compile options, but it also assures me that there is no slight of hand going on behind the scenes. The computer industry is replete with new binary only benchmarks that seem to do extremely well on newly released products. Of course, open benchmarks are an invitation to change code, but as stated, our goal is to stay away from code changes and focus on what we can change with little effort.
If you had followed my Intel multi-core saga over the last 6 months, you recall that at one point I ran some tests five times and reported the standard deviation because the variation was so large. In general, one of the biggest mistakes one can is to run a benchmark (or application) one time and then take the performance as though it were written in stone. In the past, when there was one core, one memory bank, one interconnect, and one hard drive, such assumptions were a bit more safe. In the age of multi-core such assumptions may not hold. To the Linux operating system, multi-core designs by both Intel and AMD look like homogeneous processors. There are no assumptions about memory locality or pathways. Where a process is placed by the OS can have a significant impact on performance.
Recording multiple runs is time consuming and if you are investigating the variation of some optimization parameter, then benchmarking can require large investment in time. Collecting results and placing them in a spread sheet (and using the built-in standard deviation function) takes additional time. There is, therefore, a return on investment consideration that comes into play. If I have to invest 5 days of careful benchmarking to get a 5% performance increase on a program I run once a month, then maybe it is not worth it. If on the other hand, I can get a 10% performance increase and I need to run several hundred times over the next year, then it may be worth the effort.
A Benchmark Tool Bag
Now that we have some ground rules set, it is time to talk about benchmarks. First, it is important to break benchmarks into categories. To make things easy I like to use micro-benchmarks and macro-benchmarks. Micro-benchmarks usually measure a single aspect for the system (i.e. disk drive, memory, or network performance). In general these benchmarks are constructed to measure some aspect of the system. Macro-benchmarks, on the other hand are usually based on on a real application. Sometimes macro benchmarks can contain just the computational portion of a real program, but still are derived from real applications.
Second, the best macro-benchmark is your application (or suite of applications). Often time you have an intimate knowledge and feel for how the application should run.
In terms of micro benchmarks there are several programs of which I use regularly. These include, bonnie++ for hard drive performance, stream for memory performance, netperf for general network performance, netpipe for detailed network performance, unixbench for general Unix benchmarks, and lmbench for micro Linux benchmarks. (See the Sidebar for download information.)
My favorite macro-benchmarks are the NAS Suite (NAS parallel tests) and Gromacs. The NAS suite is actually eight benchmark kernels representing real application areas. All the programs except one are written in Fortran 77 (the exception is written in C). They can be compiled with almost any Fortran compiler (included g77 and the more recent gfortran) Gromacs is a real application. There is even a Gromacs benchmark suite available. Running Gromacs takes some setup work but it does stress both the floating point and networking aspects of your cluster. The NAS suite and Gromacs use MPI as well.
Obtaining the Benchmark Programs
Fiddling About with NFS Thank you for your patience.
The previous ground work is important because in the end we want to be right — right? By now the question on everybody’s mind must be, “Great stuff, macro, micro, who cares, where is the stuff we can change ?” I’m glad you asked. There are many areas where we can find “knobs to turn.” We will not go into great depth due to space limitations, so if you want to further investigate a specific topic use man and your friend Google.
First, let’s consider the OS because these parameters can effect all applications. One of the first things we check is NFS. If your cluster uses NFS, then there are some things worth investigation. While NFS optimization is a big subject, I’ll mention some things that are important for clusters.
The underlying file system can make a big difference. Most systems use the default ext3 file system but other file systems have shown better performance Also, if you use ext3, investigate the “data=journal” mode as well – SGR (Some Googling Required). In terms of NFS options, the most common parameters are the data block sizes (rsize and wsize). To examine NFS behavior, there is the nfsstat utility. Another variable is the sync or async which are set in the /etc/exports file on the NFS server. The async option yields better performance because it does not wait for the data to be written to the local hard drives, but it is more dangerous in the event of a problem on the host. If you want to live further away from the edge, then use sync.
Another important setting is the number of NFS daemons to run. There is no hard rule, but the more NFS activity the more daemons will help. A general rule is the number of processors (cores)/2 for heavy I/O and any number less for less loaded systems. The default is normally 8 daemons. There are other NFS tweaks, including the network packet size and choosing between TCP/UDP protocols, but for now we will take a look at some network options.
Many people are often surprised to learn that Ethernet has quite a few setting with which one can fiddle. I assume this is due to the fact the Ethernet has always been a plug and play kind of service. One of the easiest to change is the MTU size. The default MTU size is 1500 bytes (The MTU is the low level packet size with which Ethernet will send data). This packet sized worked well for standard and Fast Ethernet, but it may create unwanted interrupt load on the CPU with Gigabit Ethernet. This value is easily changed using ifconfig. For instance if you network card and switch supports a larger MTU size:
ifconfig eth1 down
ifconfig eth1 up mtu 4000
Will change the MTU from the default 1500 bytes to 4000 for eth1. The MTU must be changed on both ends and the switch or router must support these packet sizes. You can find out the current MTU size for your network card by running ifconfig eth1|grep MTU or running tracepath utility.
Be aware that large MTU sizes do not always provide better performance for cluster applications. If you share NFS and MPI communication over the same network, and your application sends small 100 byte messages, then the large MTU will add extra overhead to your packets.
On the hardware end, the old (but standard) Ethernet packet size is small and thus the makers of Gigabit Ethernet chip sets (e.g. Intel, Broadcom) often offer some options to help reduce the number of interrupts produced by the network card. In the simple case, a packet arrives to the network card, an interrupt is generated by the card, the CPU services the interrupt and goes back to work. As mentioned, with Gigabit Ethernet, the number of interrupts could easily swamp the processor. Intel and Broadcom have introduced timers as way to reduce the number of interrupts. The basic idea is to coalesce the packets, only issuing interrupts when the timer expires. For instance, Intel chipsets have an Interrupt Throttle Rate variable that is set by default to 8000 interrupts per second on most Linux systems. Obviously this has a definite impact on small message latency. When running Netpipe tests the single byte latency can drop from 65 microseconds to 29 microseconds by turning th! is feature off. Interestingly not all applications benefit from this change and some run much better when interrupts are throttled indicating that packet size and the number of messages per second may play a role in performance. You can set some of the options when the card module loaded in /etc/modprobe.conf for the Intel drivers or more conveniently using ethtool with the Broadcom drivers.
It is also worth investigating TCP parameters. There is a nice TCP Tuning Guide from Berkley Labs that can assist in tuning these parameters. All TCP implementations uses buffers for the transfers between nodes and the buffer sizes can effect performance. In newer Linux kernels, the kernel will auto-tune the buffer size based on the communication pattern it encounters. Even with this auto-tuning you probably want to increase the maximum tunable size of the buffers. You can effect the changes by simply running a sysctl -p.
One final thought about Ethernet. I have seen a number of clusters that pay handsomely for the latest compute nodes but skimp on the Gigabit switch. A cheap switch is, well, a cheap switch and will probably leave quite a “few processors on the table.” Not all switches support large MTU (often called Jumbo packets) and some have been known to “fall down” when fully loaded. I have seen application performance increase by 25% by simply using a better switch. Managed switches do not always imply better performance and Jumbo frames are now available in almost any size switch, if you read the specification close enough. A simple test to see how well your switch works for a point-to-point communication is to run netpipe between two nodes, then replace the switch with a cross-over cable and rerun the test. Due diligence with you Ethernet switch will pay off handsomely.
Like quantum mechanics, no one really understands multi-core/multi-node programming. There are some things you can do with multi-core now, but the story is still developing. For Intel systems, you will want to make sure you have Intel processor throttling turned off (i.e. set to performance mode). In addition, you may want to investigate the taskset utility and be aware of the Portable Linux Processor Affinity library. Finally, to see the load on all your cores using top press 1.
Watch Those Assumptions
When I think back about my experiences with cluster performance, I have found that the biggest impediment to better performance are assumptions. I don’t really care to count the number of times I just assumed something was working optimally because it was working. In addition, there are what I call folklore assumptions based on anecdotal evidence or notions that seem to have at one point hold some truth. These are the worst kind of assumptions because they often get ingrained in the common wisdom. Which is why running benchmarks to confirm assumptions is important.
I recall that in one case an HPC user who grew up on the UNIX RISC side of town was stunned that an Pentium II was performing better than his SGI box. Conventional wisdom told him that X86 was slow a CISC chip and MIPS was fast RISC chip, no contest. At one point in time, this was certainly true. Yet, when he actually ran his application, he was surprised by the results and even thought something was wrong with the Pentium II system.
Shared Memory and TCP
Another notion floating around is that multi-core MPI communication by shared memory is much faster than TCP on the same processor. Certainly sounds reasonable and logical, which means this type of assumption needs to be tested.
Previously I ran the NAS parallel benchmark (Size B) on some Intel dual-core Pentium D systems. I used eight one processor nodes connected by GigE. There were then 2 cores per node for a total of 16 cores. I also used LAM/MPI version 7.1.1. Like most MPIs, LAM has the option to use memory-to-memory copying when two processes are on the same node. I ran the test suite two ways, first with just TCP communication, and then with TCP+SYSV (System V style memory transfers) Table One lists the winning method for 16 MPI processes. There is no clear best method and most interestingly, the results were all very close to each other.
Table One: The winning LAM/MPI communication method for 8 dual-core nodes
The results were not unique. I have seen these kinds of results for other Intel multi-core systems. (Notice how I was careful not to generalize to all multi-core processors.) I don’t know the exact reason for this behavior, but I would have assumed that shared memory always trumps TCP communications.
Building Your Application
Believe it or not there are a lot of variables that control how your application is run on a cluster. Before we get to these run-time aspects, there are two areas worth considering; the compiler, and the libraries used by your application.
Using a commercial compiler can often increase your single process performance. Indeed, there are enough option combinations that truly testing all of them could take years. Plus, option settings intended to increase performance may actually decrease performance. Just because an optimization worked on someone else code, do not assume it will work on yours.
In general, the current gcc and gfortran provide good performance. Note that g77 is no longer part of the GNU tool chain and has been replaced by the almost Fortran 95 compliant gfortran. Furthermore, gfortran is not g95. gfortran is the Fortran 95 compiler that is part of the GCC tool set and g95 is another Fortran 95 compiler, which is based on GCC. Got that.
In the past, g77 had some limitations such as parsing all the non-standard, but omnipresent vendor additions to the Fortran 77 standard. With the advent of gfortran, many codes that would not compile with g77 are now running happily on clusters. The performance of gfortran is actually better than g77 and may be adequate for many users. In the past, the limited scope and performance of g77 almost mandated that a cluster user purchase a commercial compiler suite to build fast Fortran codes. In some cases this may be still be true, but gfortran has closed the gap. Of course, most of the commercial compilers have free trial versions, so testing your code with these compilers is often well worth the effort. One interesting resource is the Polyhedron Software Fortran benchmark page. Many of the popular Fortran compilers are compared across sixteen benchmarks. If you l! ook at this page, you will notice that there is no “best at everything” compiler and the right/wrong compiler choice may increase/decrease performance by a factor of two or more in some cases.
Now that you are plan to spend four weeks testing every Fortran compiler available, it may be just as important to consider where your program spends most of it’s time. This task can be done by using a profiler or just looking at the code. For instance the famous High Performance Linpack (HPL) show little sensitivity to the type of compiler. Big performance differences come from the choice of BLAS (Basic Linear Algebra) libraries used. If one were to use the standard but un-optimized BLAS libraries from Netlib you may not be pleased with the performance of you new cluster. On the other hand, if you were to use the Netlib ATLAS (Automatically Tuned Linear Algebra Software) package built specifically for your CPU architecture, that frown may just turn into a smile (of course that would be after you found the magic NB size for your cluster). Other options for BLAS libraries include the fa! mous hand tuned GOTO libraries (Check the license restrictions, however).
Both AMD and Intel have math libraries (including BLAS) for their processors as well. If you use things like linear algebra, FFT’s, or random number generators, then you may want to check out these packages. The AMD Core Math Library (ACML) is freely available in binary form. In general, before you write any new code, you may want to check to see if there is an existing and optimized math library the will work for you. Chances are that someone has put a large amount of time into optimizing such a library. That is, unless you are using some newly invented mathematics and are convinced that you write code far better than the guys at NASA. Let me know how that works out.
One of the more important library decisions is the choice of MPI. There are currently more MPI library packages than I have fingers and toes, so I’ll focus on the open source packages for now. The popular packages include, MPICH, MPICH2, LAM/MPI, and Open MPI. A bit of history will be helpful here. Back in the day, there were primarily two “MPI trees” in use. The MPICH tree developed at Argonne National Lab (ANL) and Mississippi State and the LAM tree originally developed at Ohio State, which then moved to Notre Dame, and finally is growing comfortably at Indiana University. While these two tools continue to be very useful the MPI movers and shakers have been working on follow-on versions. ANL has introduced MPICH2, which was a virtual re-write and now supports the MPI2 specification. Open MPI is a collaborative effort by the FT-MPI, LA-MPI, LAM/MPI, and PACX-MPI teams that has all kinds of new goodness as well (including MPI2 support). In terms of the future, most of the ef! fort will be going into MPICH2 and Open MPI, so getting too comfortable with MPICH and LAM is not the best strategy.
Since, MPI is a “standard”, it is usually not to difficult to swap out one MPI library for another and rebuild you code. Thus, testing various MPIs is not that difficult (in theory) as most MPIs include “wrapper” functions to make sure all the required paths and options are set. Determining which MPI results in better performance is not so simple and it often is determined by how the codes are run on the cluster.
Before we take a look at how MPIs run codes, a point must be made. I have been using both MPICH and LAM/MPI for over ten years and I can say with some certainty that, like Fortran compilers, there is no clear MPI winner. I doubt this situation will change with the new MPI versions.
Running Your Code
While MPI provides a portable way to write parallel codes, there is no specification how an MPI is to run codes. Indeed, one of the complaints about MPI is the fact the the user must specify the number of processors (or processes) to run in parallel using the mpirun command (or in newer versions the mpiexec command).
One assumption to check is that the mpirun will be using the fastest interconnect on your cluster. If your cluster uses GigE exclusively, then this assumption is probably safe. If on the other hand, you are using Myrinet or Infiniband these often have a specific MPI version that supports their interconnect. MPI does not naively know or care what is moving the messages between processes. Obviously, if your applications responds to a fast interconnect, accidentally running over of GigE will show reduced run times. With other applications it may not be as clear.
There are some other issues that the mpirun command needs to sort out as well. For instance, how to start processes on the remote nodes? how does the user specify the nodes to use?, and how are processes mapped on processors?
While the first two issues are certainly MPI dependent they are not really going to effect performance. The third issue can drastically effect performance, however. Some further background is helpful at this point. Adroit readers will notice that when I talk about MPI on clusters, I often talk about processes and not processors. The reason for this careful wording is that MPI on clusters has no real concept of processors. The mpirun commands start processes on a given host. If the host is the same for each process, then all your processes will be started on the same host. (An often forgotten aspect of this is that simple MPI codes can be run and debugged on a single workstation or laptop.)
Most MPI’s will try to map a number of processes (the -np argument) the to a list of nodes. Unless you tell mpirun otherwise by adding processor information to the list, it will simply move down the list starting a process on each node. If it gets to the bottom of the list and it still has more processes to map, it goes right back to the top and starts another process on the first node.
Obviously if mpirun has less processes than nodes, it just does not use the extra nodes. In addition almost all MPIs have an option to specify how many processors (or cores) are on each node. In general, by specifying the core count for the node, mpirun will start that many processes per node. For example if you have four cores per node, and tell mpirun you need 16 processes, then mpirun will start four processes on each node.
The end user has the option to make some of these decisions, but know the consequences. Is it best to run my code in a dispersed (one process per node) or in packed (multi-process per node) configuration? Presumably if we run two or more MPI processes on the same node, they will use shared memory for sending message instead of TCP, and we all know that is safe assumption — just checking if you are paying attention. The problem is that now I will have two or more processes that may need to share the interconnect on the node. In addition all the MPI processes probably have the same memory access patterns thus putting extra contention on the memory sub-system.
If I had one process per node, then at least I might have less interconnect traffic memory usage on the node with which to contend. That is if the processes sharing the node with me are good neighbors. If they are interconnect and memory hogs then it may be better to pack my job in a smaller number of nodes.
This situation was tolerable in old dual processor (single core days). But, now in the multi-core world, things are not so clear. Indeed, other issues now surface. If I have dual socket, dual core node (total four cores), is it best to run one, two, or four processes per node. And, if I run two processes is it best to place them on two separate processors?
If you are noticing that there are more questions than answers then you now have a grasp of optimization in the multi-core age. On some quad-core Intel platforms at least, random process placement can result in large performance variations. Now the really good news, other than Open MPI, none of the MPIs mentioned have any control over where a process gets placed on a node. The decision as to what core gets what process is up to Linux. Open MPI provides some naive process affinity, but right now, the performance you get on a multi-core system, can be more a matter of luck than cunning programing and optimization. Now that is a comforting thought. As I understand Intel MPI (commercial) does optimal placement of processes so if you are interested in a little more certainty, then this may be a way to proceed.
While the current situation sounds a bit hard to tame, the solution is play with a few of these parameters and not assume that just because your code runs, it is running optimally. Many of the mpirun changes are extremely simple to try.
And That’s Not All
In closing, remember MPI versions are almost like compilers when it comes to running the codes — a few moments with the man pages will be quite helpful in most cases. In the case of Open MPI, one of the design goals was to have a totally modular system so that the type of interconnect could be changed at run-time without recompiling the code. As such, there are a multitude of switches, knobs, and buttons, available for Open MPI.
At this point, I realize that with all the possible tweaks, optimizing your code sounds like a lot of work. Well, it can be and, in my opinion, that is why optimization is not done very often. One of the important issues to determine is how much improvement is good enough. I assume that like many things, the 80-20 rule applies to optimization. That is 80% of the performance comes from 20% of the effort, and the last 20% improvement will take 80% of the effort. Of course like all good time tested assumptions, this one is ripe for testing.