<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: The Return of the Vector Processor</title>
	<atom:link href="http://www.linux-mag.com/id/7575/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.linux-mag.com/id/7575/</link>
	<description>Open Source, Open Standards</description>
	<lastBuildDate>Sat, 05 Oct 2013 13:48:18 +0000</lastBuildDate>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.1</generator>
	<item>
		<title>By: lescoke</title>
		<link>http://www.linux-mag.com/id/7575/#comment-7156</link>
		<dc:creator>lescoke</dc:creator>
		<pubDate>Wed, 30 Nov -0001 00:00:00 +0000</pubDate>
		<guid isPermaLink="false">http://www.linux-mag.com/id/7575/#comment-7156</guid>
		<description>&lt;p&gt;Ah, yes; the bit-sliced vector processor;  The year was 1989, the desktop was evolving from 16-25 MHz 386\&#039;s, to 33 MHz 486\&#039;s.  I could pipeline an array of data through an ALU and Multiplier-accumulator to apply a single threaded signal processing algorithm at a rate of one data element per clock cycle.  The trick was staging the data flow from one processing element to another so that each had something to do on every clock cycle.  Programming was done at the microcode level using a language very similar to assembly called register transfer language (RTL), each processing element control signal, register select, clock enable, etc,... were controlled from bits in a 200 bit instruction word.&lt;/p&gt;
&lt;p&gt;Many of the idea\&#039;s used in that system have been showing up in DSP\&#039;s and now GPU\&#039;s; the processing power is available, the bottleneck still appears to be keeping them fed with outside data.&lt;/p&gt;
&lt;p&gt;Les
&lt;/p&gt;
</description>
		<content:encoded><![CDATA[<p>Ah, yes; the bit-sliced vector processor;  The year was 1989, the desktop was evolving from 16-25 MHz 386\&#8217;s, to 33 MHz 486\&#8217;s.  I could pipeline an array of data through an ALU and Multiplier-accumulator to apply a single threaded signal processing algorithm at a rate of one data element per clock cycle.  The trick was staging the data flow from one processing element to another so that each had something to do on every clock cycle.  Programming was done at the microcode level using a language very similar to assembly called register transfer language (RTL), each processing element control signal, register select, clock enable, etc,&#8230; were controlled from bits in a 200 bit instruction word.</p>
<p>Many of the idea\&#8217;s used in that system have been showing up in DSP\&#8217;s and now GPU\&#8217;s; the processing power is available, the bottleneck still appears to be keeping them fed with outside data.</p>
<p>Les</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: userda</title>
		<link>http://www.linux-mag.com/id/7575/#comment-7157</link>
		<dc:creator>userda</dc:creator>
		<pubDate>Wed, 30 Nov -0001 00:00:00 +0000</pubDate>
		<guid isPermaLink="false">http://www.linux-mag.com/id/7575/#comment-7157</guid>
		<description>&lt;p&gt;Typo: \&quot;bare\&quot; should be \&quot;bear\&quot;.&lt;/p&gt;
&lt;p&gt;As renormalized quantum field theory tells us, fermions are clothed, neot bare.
&lt;/p&gt;
</description>
		<content:encoded><![CDATA[<p>Typo: \&#8221;bare\&#8221; should be \&#8221;bear\&#8221;.</p>
<p>As renormalized quantum field theory tells us, fermions are clothed, neot bare.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: kalloyd</title>
		<link>http://www.linux-mag.com/id/7575/#comment-7158</link>
		<dc:creator>kalloyd</dc:creator>
		<pubDate>Wed, 30 Nov -0001 00:00:00 +0000</pubDate>
		<guid isPermaLink="false">http://www.linux-mag.com/id/7575/#comment-7158</guid>
		<description>&lt;p&gt;Les,&lt;/p&gt;
&lt;p&gt;Ah, yes.  Memories (not all good, I\&#039;m afraid).&lt;/p&gt;
&lt;p&gt;There are all sorts of new questions raised by the new Fermi architecture. First, does the new memory model solve some of the \&quot;particulars\&quot; (or not) in keeping the computational pipeline fed and flowing. I want to see how this works with the L2 cache config and the special function units (who waits on what).&lt;/p&gt;
&lt;p&gt;I also wonder how the performance of the new FMA compares to the old multadd (MAD). It obviously fixes some of the rounding/dropoff problems - especially welcome in iterative routines. &lt;/p&gt;
&lt;p&gt;The whole architecture is so radically different than the GT200, I think its going to take some experimentation to find how make best use of it.  Lots of potential - now what can we do with that?&lt;/p&gt;
&lt;p&gt;Ken
&lt;/p&gt;
</description>
		<content:encoded><![CDATA[<p>Les,</p>
<p>Ah, yes.  Memories (not all good, I\&#8217;m afraid).</p>
<p>There are all sorts of new questions raised by the new Fermi architecture. First, does the new memory model solve some of the \&#8221;particulars\&#8221; (or not) in keeping the computational pipeline fed and flowing. I want to see how this works with the L2 cache config and the special function units (who waits on what).</p>
<p>I also wonder how the performance of the new FMA compares to the old multadd (MAD). It obviously fixes some of the rounding/dropoff problems &#8211; especially welcome in iterative routines. </p>
<p>The whole architecture is so radically different than the GT200, I think its going to take some experimentation to find how make best use of it.  Lots of potential &#8211; now what can we do with that?</p>
<p>Ken</p>
]]></content:encoded>
	</item>
</channel>
</rss>