<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: The HPC Software Conundrum</title>
	<atom:link href="http://www.linux-mag.com/id/7701/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.linux-mag.com/id/7701/</link>
	<description>Open Source, Open Standards</description>
	<lastBuildDate>Sat, 05 Oct 2013 13:48:18 +0000</lastBuildDate>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.1</generator>
	<item>
		<title>By: dmpase</title>
		<link>http://www.linux-mag.com/id/7701/#comment-7862</link>
		<dc:creator>dmpase</dc:creator>
		<pubDate>Wed, 30 Nov -0001 00:00:00 +0000</pubDate>
		<guid isPermaLink="false">http://www.linux-mag.com/id/7701/#comment-7862</guid>
		<description>&lt;p&gt;Having spent a fair amount of time trying to solve that problem myself, I have to wonder whether it is really possible. The technical requirements of any such programming model (or language) would be that: (1) it has to be sufficiently expressive to be useful for describing a rich variety of algorithms, (2) it must be translatable into highly optimized machine code for architectures that are very different from each other, especially in their performance characteristics and trade-offs. These requirements are very much in conflict with each other. &lt;/p&gt;
&lt;p&gt;I have seen (and designed languages for) functional, object-oriented, logic and procedural parallel models. About the most promising approach I have seen so far is a hybrid approach, where major computational elements are sewn together using a data-flow, or functional, or object-oriented high-level language. The individual computational elements can then be expressed in a language that best suits the architecture. &lt;/p&gt;
&lt;p&gt;For example, your program might read in a set of matrices, perform an FFT to translate the data from the time to the frequency domain, pass the results through several high-pass and low-pass filters, then pass those results through an inverse FFT to translate it back to the time domain.&lt;/p&gt;
&lt;p&gt;The high-level operations -- FFT-&gt;filters-&gt;inverse FFT -- are invariant. They, along with descriptions of your data, represent the semantic content of your program. The individual implementations of the FFT, etc., may vary from one system to the next to take advantage of specific hardware. Think of it as a really fast implementation of LAPACK or math.h if you like.&lt;/p&gt;
&lt;p&gt;The idea that you can specify the low level details of an algorithm (e.g., matrix inversion) in a machine independent form that can be executed efficiently on architectures as different as GPGPUs and large-scale distributed clusters and BlueGene and Larrabee and TMC\&#039;s CM-1 (remember them?) is ... well ... a nice idea but, IMHO, not much more than that. In order for the compiler to have enough flexibility to make efficient choices for the architecture in question, the algorithm *must* be expressed at a fairly high level, the higher the better.
&lt;/p&gt;
</description>
		<content:encoded><![CDATA[<p>Having spent a fair amount of time trying to solve that problem myself, I have to wonder whether it is really possible. The technical requirements of any such programming model (or language) would be that: (1) it has to be sufficiently expressive to be useful for describing a rich variety of algorithms, (2) it must be translatable into highly optimized machine code for architectures that are very different from each other, especially in their performance characteristics and trade-offs. These requirements are very much in conflict with each other. </p>
<p>I have seen (and designed languages for) functional, object-oriented, logic and procedural parallel models. About the most promising approach I have seen so far is a hybrid approach, where major computational elements are sewn together using a data-flow, or functional, or object-oriented high-level language. The individual computational elements can then be expressed in a language that best suits the architecture. </p>
<p>For example, your program might read in a set of matrices, perform an FFT to translate the data from the time to the frequency domain, pass the results through several high-pass and low-pass filters, then pass those results through an inverse FFT to translate it back to the time domain.</p>
<p>The high-level operations &#8212; FFT-&gt;filters-&gt;inverse FFT &#8212; are invariant. They, along with descriptions of your data, represent the semantic content of your program. The individual implementations of the FFT, etc., may vary from one system to the next to take advantage of specific hardware. Think of it as a really fast implementation of LAPACK or math.h if you like.</p>
<p>The idea that you can specify the low level details of an algorithm (e.g., matrix inversion) in a machine independent form that can be executed efficiently on architectures as different as GPGPUs and large-scale distributed clusters and BlueGene and Larrabee and TMC\&#8217;s CM-1 (remember them?) is &#8230; well &#8230; a nice idea but, IMHO, not much more than that. In order for the compiler to have enough flexibility to make efficient choices for the architecture in question, the algorithm *must* be expressed at a fairly high level, the higher the better.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: cgorac</title>
		<link>http://www.linux-mag.com/id/7701/#comment-7863</link>
		<dc:creator>cgorac</dc:creator>
		<pubDate>Wed, 30 Nov -0001 00:00:00 +0000</pubDate>
		<guid isPermaLink="false">http://www.linux-mag.com/id/7701/#comment-7863</guid>
		<description>&lt;p&gt;\&quot;...until that kind of performance is generally available to the Joe Programmer, the extra hardware is, in a sense, superfluous.\&quot; - Why would you think something like that?  HPC programming work was never, and never will be something that Joe Programmer could handle; there will be libraries and alike stuff built for this poor guy, but the core work would simply have to be handled by competent people.  I see also no issue with that many APIs available: have been used all 4 you mentioned, and dozen more (anyone remember p4, the pre-cursor of MPI?), I can tell I liked each one I used, and I see that many APIs just as natural process of evolution towards eventual ultimate programming model(s).  But until that point reached, there are always clients in demand of speeding-up their codes, so instead of whining over the sad state regarding the incompatibility of current batch of tools, I prefer simply to enjoy in coding.
&lt;/p&gt;
</description>
		<content:encoded><![CDATA[<p>\&#8221;&#8230;until that kind of performance is generally available to the Joe Programmer, the extra hardware is, in a sense, superfluous.\&#8221; &#8211; Why would you think something like that?  HPC programming work was never, and never will be something that Joe Programmer could handle; there will be libraries and alike stuff built for this poor guy, but the core work would simply have to be handled by competent people.  I see also no issue with that many APIs available: have been used all 4 you mentioned, and dozen more (anyone remember p4, the pre-cursor of MPI?), I can tell I liked each one I used, and I see that many APIs just as natural process of evolution towards eventual ultimate programming model(s).  But until that point reached, there are always clients in demand of speeding-up their codes, so instead of whining over the sad state regarding the incompatibility of current batch of tools, I prefer simply to enjoy in coding.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: dubuf</title>
		<link>http://www.linux-mag.com/id/7701/#comment-7864</link>
		<dc:creator>dubuf</dc:creator>
		<pubDate>Wed, 30 Nov -0001 00:00:00 +0000</pubDate>
		<guid isPermaLink="false">http://www.linux-mag.com/id/7701/#comment-7864</guid>
		<description>&lt;p&gt;I think it\&#039;s a question of time. First the hardware development, and always lagging behind the software development. Remember SGI\&#039;s C$DOACROSS? Ideal for the main loops but also for small ones (I vaguely remember a break-even point of 400 clock cycles or 100 multiplications). Now we\&#039;ve got OpenMP with many features for finetuning. And Portland has already an Accelerator at directive/pragma level for GP-GPUs like the Tesla, so no need to program at CUDA level unless you want to squeeze the last drop out of the Tesla. And from what I\&#039;ve read you can mix OpenMP with Accelerator directives. What I still don\&#039;t understand is the lack of high-level tools for MPI (remember BERT?). Why is there no initiative to develop high-level directives for hiding most if not all MPI stuff? With a set of pre-defined communication structures such that at least the most common applications can be parallelized in less than an hour or so. I myself invested a few weeks in my SPMDlib on top of MPI which provides a basis for SPMDdir, a small directive set, but, of course, I am doing research in a specific area and don\&#039;t have the time to develop generic tools. &lt;/p&gt;
&lt;p&gt;I really think that directives can solve most problems efficiently, with a top-down structure MPI-OpenMP-Accelerator, but the bottom-up link, from Accelerator to MPI, might be even more important in order to decide at MPI level what the most efficient solution might be. If Accelerator is intelligent enough to do a break-even code analysis at CUDA level, it should be possible to to the same at MPI level, with or without the use of OpenMP and other tools. And now back to my DEadline :-)
&lt;/p&gt;
</description>
		<content:encoded><![CDATA[<p>I think it\&#8217;s a question of time. First the hardware development, and always lagging behind the software development. Remember SGI\&#8217;s C$DOACROSS? Ideal for the main loops but also for small ones (I vaguely remember a break-even point of 400 clock cycles or 100 multiplications). Now we\&#8217;ve got OpenMP with many features for finetuning. And Portland has already an Accelerator at directive/pragma level for GP-GPUs like the Tesla, so no need to program at CUDA level unless you want to squeeze the last drop out of the Tesla. And from what I\&#8217;ve read you can mix OpenMP with Accelerator directives. What I still don\&#8217;t understand is the lack of high-level tools for MPI (remember BERT?). Why is there no initiative to develop high-level directives for hiding most if not all MPI stuff? With a set of pre-defined communication structures such that at least the most common applications can be parallelized in less than an hour or so. I myself invested a few weeks in my SPMDlib on top of MPI which provides a basis for SPMDdir, a small directive set, but, of course, I am doing research in a specific area and don\&#8217;t have the time to develop generic tools. </p>
<p>I really think that directives can solve most problems efficiently, with a top-down structure MPI-OpenMP-Accelerator, but the bottom-up link, from Accelerator to MPI, might be even more important in order to decide at MPI level what the most efficient solution might be. If Accelerator is intelligent enough to do a break-even code analysis at CUDA level, it should be possible to to the same at MPI level, with or without the use of OpenMP and other tools. And now back to my DEadline :-)</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: grdetil</title>
		<link>http://www.linux-mag.com/id/7701/#comment-7865</link>
		<dc:creator>grdetil</dc:creator>
		<pubDate>Wed, 30 Nov -0001 00:00:00 +0000</pubDate>
		<guid isPermaLink="false">http://www.linux-mag.com/id/7701/#comment-7865</guid>
		<description>&lt;p&gt;In response to the concerns expressed by dmpase, I have to wonder if we\&#039;re not at the same sort of crossroads as we were in the transition from assembly language to high-level languages.  It wasn\&#039;t that long ago that people thought it was unthinkable that compilers could produce code as tight and efficient as a programmer could do by hand when programming close to the hardware.  CPU architectures were so vastly different as to seem irreconcilable: special purpose registers vs general purpose, different bit and byte orders, different memory and cache architectures.  How could a compiler efficiently manage all those differences better than a human being, and still make loops tight enough to fully exploit the small instruction cache\&#039;s locality of reference?  Well, compilers got a whole lot better at dealing with all that, and today very few programmers seriously consider programming so close to the hardware themselves that they need to worry about these details.  Certainly in the \&#039;80s, though, there were a lot of die-hards who refused to program in anything but assembly language because they thought all high-level languages were resource hogs.&lt;/p&gt;
&lt;p&gt;Maybe in another decade or two, processor cores will be managed automatically by higher-level language compilers the way registers are now, and we won\&#039;t care how many general or special purpose cores (or nodes!) a system has.  It will probably mean moving to a higher-level language than what\&#039;s commonly used now, so we\&#039;re not programming as close to the hardware as we are now, and leaving the details of what depends on what results and what can be done in parallel (and on which cores) to the compiler and/or OS.  Worrying about such details may seem as quaint as worrying about CPU registers today.  Of course, the big question is how to get to that point.  A lot of hard work went into making today\&#039;s compilers as sophisticated and efficient as they are, and this next challenge seems even more daunting.  But I wouldn\&#039;t doubt that it\&#039;s possible.
&lt;/p&gt;
</description>
		<content:encoded><![CDATA[<p>In response to the concerns expressed by dmpase, I have to wonder if we\&#8217;re not at the same sort of crossroads as we were in the transition from assembly language to high-level languages.  It wasn\&#8217;t that long ago that people thought it was unthinkable that compilers could produce code as tight and efficient as a programmer could do by hand when programming close to the hardware.  CPU architectures were so vastly different as to seem irreconcilable: special purpose registers vs general purpose, different bit and byte orders, different memory and cache architectures.  How could a compiler efficiently manage all those differences better than a human being, and still make loops tight enough to fully exploit the small instruction cache\&#8217;s locality of reference?  Well, compilers got a whole lot better at dealing with all that, and today very few programmers seriously consider programming so close to the hardware themselves that they need to worry about these details.  Certainly in the \&#8217;80s, though, there were a lot of die-hards who refused to program in anything but assembly language because they thought all high-level languages were resource hogs.</p>
<p>Maybe in another decade or two, processor cores will be managed automatically by higher-level language compilers the way registers are now, and we won\&#8217;t care how many general or special purpose cores (or nodes!) a system has.  It will probably mean moving to a higher-level language than what\&#8217;s commonly used now, so we\&#8217;re not programming as close to the hardware as we are now, and leaving the details of what depends on what results and what can be done in parallel (and on which cores) to the compiler and/or OS.  Worrying about such details may seem as quaint as worrying about CPU registers today.  Of course, the big question is how to get to that point.  A lot of hard work went into making today\&#8217;s compilers as sophisticated and efficient as they are, and this next challenge seems even more daunting.  But I wouldn\&#8217;t doubt that it\&#8217;s possible.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: truly64</title>
		<link>http://www.linux-mag.com/id/7701/#comment-7866</link>
		<dc:creator>truly64</dc:creator>
		<pubDate>Wed, 30 Nov -0001 00:00:00 +0000</pubDate>
		<guid isPermaLink="false">http://www.linux-mag.com/id/7701/#comment-7866</guid>
		<description>&lt;p&gt;In our 1600 core cluster, ~60% of the jobs use a single core. Researchers make a choice... they can spend months or years to take old working code and try to parallelize it, or they can run the old working code on a single processor, and stage a few hundred jobs on iterations of the data set.&lt;/p&gt;
&lt;p&gt;Given the cost of modifying working code, and the lack of talented programmers who know how to do so, many researchers prefer to just run lots of single core jobs to get their work done. Sure, many of the savvy researchers with grant based funding have gotten their code to scale on multi cores, but none are even considering the daunting task of migrating to a GPU, if they even understood how to do it.&lt;/p&gt;
&lt;p&gt;So your observations are very correct. In reality, very large supercomputers, like those at Los Alamos and Livermore, are very efficient at generating heat, but very inefficient at generating results due to the enormous programming effort required to take advantage of such scale.
&lt;/p&gt;
</description>
		<content:encoded><![CDATA[<p>In our 1600 core cluster, ~60% of the jobs use a single core. Researchers make a choice&#8230; they can spend months or years to take old working code and try to parallelize it, or they can run the old working code on a single processor, and stage a few hundred jobs on iterations of the data set.</p>
<p>Given the cost of modifying working code, and the lack of talented programmers who know how to do so, many researchers prefer to just run lots of single core jobs to get their work done. Sure, many of the savvy researchers with grant based funding have gotten their code to scale on multi cores, but none are even considering the daunting task of migrating to a GPU, if they even understood how to do it.</p>
<p>So your observations are very correct. In reality, very large supercomputers, like those at Los Alamos and Livermore, are very efficient at generating heat, but very inefficient at generating results due to the enormous programming effort required to take advantage of such scale.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: rpmasson</title>
		<link>http://www.linux-mag.com/id/7701/#comment-7867</link>
		<dc:creator>rpmasson</dc:creator>
		<pubDate>Wed, 30 Nov -0001 00:00:00 +0000</pubDate>
		<guid isPermaLink="false">http://www.linux-mag.com/id/7701/#comment-7867</guid>
		<description>&lt;p&gt;I (and the company I work for) believe that some form of heterogeneous computing is the only way to solve this dilemma (we call it \&quot;circumventing the laws of physics\&quot;). Take the same number of transistors used to implement a general-purpose instruction set (e.g. x86_64) and use those transistors to implement a specific algorithm. It\&#039;ll *always* be more efficient (not counting infrastructure changes). That\&#039;s what GPGPUs are all about, as well as where other hardware-based solutions like FPGAs come in.&lt;/p&gt;
&lt;p&gt;The challenge becomes integrating a heterogeneous computing solution into your current favorite computing platform. If you have to use some new programming language or a dialect that\&#039;s harder than actually restructuring your app to take advantage of a multi-core processor, it\&#039;s probably not worth it. If it\&#039;s relatively easy, then you get performance/watt increases beyond what\&#039;s possible with off-the-shelf processors, without totally rewriting your application.&lt;/p&gt;
&lt;p&gt;I compare it to using an attached array processor back in the 80\&#039;s. Worked great, but most of the time wasn\&#039;t worth it. It wasn\&#039;t until minisupercomputers came along, with integrated vector instructions, that you could get performance increases without working in two different development &amp; runtime environments.
&lt;/p&gt;
</description>
		<content:encoded><![CDATA[<p>I (and the company I work for) believe that some form of heterogeneous computing is the only way to solve this dilemma (we call it \&#8221;circumventing the laws of physics\&#8221;). Take the same number of transistors used to implement a general-purpose instruction set (e.g. x86_64) and use those transistors to implement a specific algorithm. It\&#8217;ll *always* be more efficient (not counting infrastructure changes). That\&#8217;s what GPGPUs are all about, as well as where other hardware-based solutions like FPGAs come in.</p>
<p>The challenge becomes integrating a heterogeneous computing solution into your current favorite computing platform. If you have to use some new programming language or a dialect that\&#8217;s harder than actually restructuring your app to take advantage of a multi-core processor, it\&#8217;s probably not worth it. If it\&#8217;s relatively easy, then you get performance/watt increases beyond what\&#8217;s possible with off-the-shelf processors, without totally rewriting your application.</p>
<p>I compare it to using an attached array processor back in the 80\&#8217;s. Worked great, but most of the time wasn\&#8217;t worth it. It wasn\&#8217;t until minisupercomputers came along, with integrated vector instructions, that you could get performance increases without working in two different development &#38; runtime environments.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: dmpase</title>
		<link>http://www.linux-mag.com/id/7701/#comment-7868</link>
		<dc:creator>dmpase</dc:creator>
		<pubDate>Wed, 30 Nov -0001 00:00:00 +0000</pubDate>
		<guid isPermaLink="false">http://www.linux-mag.com/id/7701/#comment-7868</guid>
		<description>&lt;p&gt;I appreciate the comments of grdetil. They make me pause and think about what allowed us to transition from assembler to higher level languages (HLLs). I believe it is that the HLLs transitioned to useful abstractions that compilers could work with across the variety of machines they needed to support, and at the same time reduce the effort of programming. I also think that it helps a great deal that the current architectures are still relatively similar (i.e., von Neumann load-store architectures).&lt;/p&gt;
&lt;p&gt;So, if the HLL for parallel programming were to support higher abstractions for operators, such as reductions, partial reductions, etc. -- operators that could be parallelized across a wide variety of parallel architectures -- then I can see how that might be a path for success. It takes advantage of the best features of the hybrid approach without some of its complexity.&lt;/p&gt;
&lt;p&gt;The success of HLLs over assembly came about because they reduced the complexity of programming. It allowed the programmer to stop thinking about low level details that were incidental to the algorithm. Examples of those details might include the number of registers available, whether the data to be operated on was in this register or that, whether branch conditions were handled through direct operators or by testing a condition in one instruction and branching based on a condition register in the next. &lt;/p&gt;
&lt;p&gt;In short, the level of abstraction was raised and this helped the programmer, while at the same time it stayed fairly close to the architectures they needed to support. The range in architectures is also pretty narrow, so that helps a lot.&lt;/p&gt;
&lt;p&gt;So, could a parallel language be created that would span the spectrum of parallel architectures? I\&#039;m still a bit skeptical, because there are huge differences in parallel architectures, much larger than what we see separating, say, CISC from RISC. But if there is hope it is in identifying a useful set of operators, expressive enough to easily describe a large set of programs, while at the same time compilable into low level code for the available spectrum of parallel architectures. (The hybrid approach assumes no such operators exist and allows the user to define their own.)
&lt;/p&gt;
</description>
		<content:encoded><![CDATA[<p>I appreciate the comments of grdetil. They make me pause and think about what allowed us to transition from assembler to higher level languages (HLLs). I believe it is that the HLLs transitioned to useful abstractions that compilers could work with across the variety of machines they needed to support, and at the same time reduce the effort of programming. I also think that it helps a great deal that the current architectures are still relatively similar (i.e., von Neumann load-store architectures).</p>
<p>So, if the HLL for parallel programming were to support higher abstractions for operators, such as reductions, partial reductions, etc. &#8212; operators that could be parallelized across a wide variety of parallel architectures &#8212; then I can see how that might be a path for success. It takes advantage of the best features of the hybrid approach without some of its complexity.</p>
<p>The success of HLLs over assembly came about because they reduced the complexity of programming. It allowed the programmer to stop thinking about low level details that were incidental to the algorithm. Examples of those details might include the number of registers available, whether the data to be operated on was in this register or that, whether branch conditions were handled through direct operators or by testing a condition in one instruction and branching based on a condition register in the next. </p>
<p>In short, the level of abstraction was raised and this helped the programmer, while at the same time it stayed fairly close to the architectures they needed to support. The range in architectures is also pretty narrow, so that helps a lot.</p>
<p>So, could a parallel language be created that would span the spectrum of parallel architectures? I\&#8217;m still a bit skeptical, because there are huge differences in parallel architectures, much larger than what we see separating, say, CISC from RISC. But if there is hope it is in identifying a useful set of operators, expressive enough to easily describe a large set of programs, while at the same time compilable into low level code for the available spectrum of parallel architectures. (The hybrid approach assumes no such operators exist and allows the user to define their own.)</p>
]]></content:encoded>
	</item>
</channel>
</rss>