I Have (HPC) Issues

There I admit it. There are certain things that send me into long rants when it comes to High Performance Computing (HPC). (We'll skip the non-HPC issues for now). I'll bet you have issues as well. Those things that just bug you about the state of HPC clusters. Admit it, you do. There, don't you feel better?

There I admit it. There are certain things that send me into long rants when it comes to High Performance Computing (HPC). (We’ll skip the non-HPC issues for now). I’ll bet you have issues as well. Those things that just bug you about the state of HPC clusters. Admit it, you do. There, don’t you feel better?

In a politically correct sense, I should call my issues challenges. A more positive spin is always helpful. From a marketing perspective, issues are often called pain points. Nothing positive there. Keep telling yourself price performance has never been this good.

So what are my issues? In the past many of my HPC issues were tolerable and I was hopeful with continued effort things would get better. Then multi-core arrives and throws everything into the blender. So no I don’t just have issues, I have multi-issues.

My list is starts with people. We need more people that understand this stuff. We need more domain experts (end users) that can adapt the established methods of cluster HPC to their problems areas. For instance, more cores in a single box may be a big win for some, while others running legacy applications will find the applications seemingly blind to the extra cores. To the domain expert, more cores should mean faster applications. In some cases it may actually mean slower applications.

In addition to domain experts, we need people that can manage and design clusters effectively. The choice to obtain a cluster should be no more difficult than any other computer system. It is no longer enough to stack up nodes with some Ethernet, as a slow interconnect approach may leave cores starved for bandwidth.

And finally, we need people that can write programs for a multi-core cluster environment. In terms of challenges, this is the Mount Everest of issues. In the past, writing and optimizing parallel codes was not an easy task. As a mater of fact, it was and still is hard. A typical MPI approach to computing was based on the assumption; “I have a bunch of processors with their own private memory connected together with some type of network.”

The new assumption goes something like this; “I have a collection of shared memory SMP (multi-core) islands connected together with some kind of network.” In essence, there are now two communications paths: local through memory, and distant through the network. Programming and optimizing for this new model is now much harder. There are hybrid approaches that employ threads or OpenMP on the SMP nodes and MPI between nodes. This approach requires two different conceptual models in the same application. I consider this somewhat painful, but then I do have issues.

The programming issue is complicated by the fact that no one knows the best way to program for this new multi-core cluster paradigm. The older more traditional MPI approach was at least a known method and somewhat mature albeit arduous methodology. We don’t have such a luxury.

In summary my big issue is people. We need people that understand how to use this new HPC Lego that the market is giving us. (For those that are chronologically gifted, recall tinker toys or erector sets). In addition, we need to figure out how to program these things in a cost-effective way. Indeed, not just the HPC crowd, but everybody needs to grasp this, as the parallel approach to computing is here to stay.

My issues are well, my issues. Unfortunately I have more. For instance, concerns about power and cooling, cluster management, parallel I/O, interconnects, co-processors (GP-GPU’s and FPGA’s, etc.), virtualization, grid, etc. are all on the table and somewhere on my list as well.

Now it’s your turn. What are you issues? You can tell me and everyone else about your issues by checking out the new polls on the Todays HPC Clusters site. Your response is anonymous and you may find you are not alone in your pain, sorry challenges. And, while you are at it, take the Linux Magazine Data Center Infrastructure Survey Fatal error: Call to undefined function aa_author_bios() in /opt/apache/dms/b2b/linux-mag.com/site/www/htdocs/wp-content/themes/linuxmag/single.php on line 62