Hybridizing MPI applications with cores and GP-GPUs. Is this a good idea?
Multi-core is on my mind again. I can’t help it. The other day, I was thinking about benchmarking and what I wrote about in Good Enough Will Have To Do. Then it hit me, a possible way out of the multi-core (and GP-GPU) quagmire. Before I reveal my somewhat obvious solution, I need to set the stage.
The typical MPI program is a collection of processes that communicate via messages. These processes can live on the same multi-core node, on another node or a combination of both. Before multi-core there was one or two processes per node. The user often had some control over where his MPI processes would go — either dispersed, one per node or compact, two per node. And, more importantly, the user usually knew what arrangement worked best with his/her codes. With multi-core this has changed a bit.
The user now pretty much is at the mercy of whereever his processes land. There is also the use of processor (or core) affinity to pin a process to specific core, but it introduces another set of problems on shared nodes. In my opinion, such fine grained control should not be the users responsibility.
Let’s move on to GP-GPU and clusters. Even if you sweep the MPI/multi-core issues under the rug and just run on any available cores, there is the issue of GP-GPU in clusters. How does one adapt an MPI code to a node with GP-GPUs? If you enable each MPI process with the ability to use a GP-GPU, then you need to make sure that the processes are balanced so that the GP-GPU resources (which can vary from cluster to cluster) are used effectively (i.e. if one node has two GPU-GPU processes and another has six, then things are not balanced). Languages like Cuda and OpenCL do not address the cluster model.
Having thought about this issue I believe there is a solution to this mess. It is not the best of solutions but it is workable.
It came to me when I was thinking about running HPL on a cluster. (HPL, High Performance Linpack, is the program used to rank computers on the Top500 List). When one runs HPL on an eight-core node, you do not run eight MPI processes, you run a threaded MPI process. Therefore, if I have a cluster with 128 nodes (each with eight cores), my MPI job has 128 processes (i.e. mpirun -np 128 …). I don't run 1024 MPI processes because the threaded implementation of HPL provides better performance. Then I got hit with the obvious stick. Threads (OpenMP or OpenCL/Cuda) on nodes, messages between nodes, that can solve the problem.
The idea is simple in concept. If there is one process, then it can manage using the node resources, whether it be cores or GP-GPUs. From programming standpoint, the MPI structure of the code may not need to change. What would need to change are the inner loops, but it may not be that simple. If you take a look at Experiences in Tuning Performance of Hybrid MPI/OpenMP Applications on Quad-core Systems by Ashay Rane and Dan Stanzione of the Fulton High Performance Computing Initiative, Arizona State University, you can see some of the issues that are involved.
There are some drawbacks to this model. You are going to have to change some code as there is no compiler option. My guess is the changes are going to range from trivial to rip your program apart and rewrite it. The goal would be structuring your program so that if the node had lots of cores, then you could use them, or if the node had GP-GPUs available you could use them as well. Programs would need to be recompiled or have some kind of run-time switch. A tall order for sure, but I don't see any other way out of this in the near term. Cores will continue to increase, GP-GPUs will continue to show up on nodes, so perhaps this approach will provide a path forward.
Of course, you can always try simple things. For instance, the Portland Group Compilers support NVidia GP-GPU parallelization. I am sure other compilers investigating this approach as well. Almost every compiler also supports OpenMP (including the GNU compilers). And, it may not take much to try a few things with your codes as adding a few pragmas to your code is not that difficult. It is worth trying, because I assume if it does not work out, you will let me know.
I should also mention that I do not care for mixed models. It takes a lot of programming effort and it makes optimization much more difficult. If you think about the MPI communication between nodes, it only makes sense if the all the parallel parts can be done faster then doing it sequentially on one node. That is, the parallel computation plus the communication overhead time has to be less than the time to do it on single node. In a straight forward MPI program, the inner-loops are done on a single core. If there are now eight cores working on that chunk of code, that chunk you sent to single core just got eight times faster -- and maybe even faster if you are using a GP-GPU. From an MPI perspective the communication overhead now carries a much heaver weight. (The communication/compute ratio is what determines MPI efficiency). You can gather some data and do a "back of the envelope" calculation to see how this will play out with your code, but often times it is easier to figure this out with some trial and error testing.
In closing, I would be interested to know what you think about this idea. I am sure there are those who actually tried such a thing and have some good insights. As I am found to state, One benchmark is worth 100 opinions, unless you write about HPC then you can just throw ideas an opinions at the wall and see what sticks.
By the way, I'm still doing my low frequency twittering I'm up to 106 followers. If you are one of the chosen few, stay tuned because I may revel some utterly useless mundane part of my life.