My last column created some interest outside of the Linux Magazine domain. In addition to being accused of shilling ARM processors, there were those who thought my prediction of ARM based supercomputers quite absurd. Of course, I have been wrong before, but not this time.
How can I be so sure? Let’s take a look some evidence. Consider exhibit A. Take a look at this ARM patch entry on the Open MPI developers list. I have it on good authority that the ARM patch was rather complex and needed no modification. Either my rant about cell phone clusters is more truth than fiction or someone out there is interested in using ARM cores to do HPC.
Moving on to Exhibit B I give you NVidia Project Denver. The marriage of a CPU and GP-GPU makes a lot of sense. AMD seems to think so. From an HPC point of view, it is really a processor and an array-processor. The array processor (like a math co-processor) does the numerical heavy lifting for the CPU. In many cases, scientific codes are dominated by array operations and the “host processor” does more bookkeeping, set-up, and tear-down than actual number crunching. In addition, the new fusion type of processors will have a shared memory design between the CPU and GPU. There will be no need to move data on and off the GPU over the PCIe bus.
Still not convinced. Have a look at the Marvell ARMADA XP processor. It has up to 4 ARM cores running at 1.6GHz with 2 MB L2 cache, four GigE MACs, and supports 64-bit DDR2/DDR3 ECC memory interface. Finally, last fall, ARM announced the Cortex A15 MPCore which can have 1-8 cores with a total address space of one terabyte. These are not phone or pad processors.
All these ARM designs give me visions of small “cell phone” sized modules with 16-32 cores, 32 GBytes of memory, and maybe an SSD plugged into a high speed backbone network. Perhaps, SiCortex was on to something – a pile of low power processors, fast interconnect, and a good software stack.
What about programming. How will these devices be programmed? In particular, will you code a multi-core/GPU phone like a cluster node? And more importantly what tools will you use?
This is where all the hardware excitement meets the cold reality of parallel programming. We are all familiar with “Moore’s Law” (the transistor density doubles every two years). Many people have probably not heard of “May’s Law.” (for the purist, you can substitute “trend” for “law”). In any case, David May states his law as follows, “Software efficiency halves every 18 months, compensating for Moore’s Law.” Think about it. Every new generation of hardware introduces some new form of hardware optimization. Usually these optimizations can be handled by the compiler, which have become quite complex. Compilers hit a wall with parallel computing. When more cores started showing up, due to Moore’s law and some laws of physics, writing software got more complex. When GP-GPUs started showing up, programming became much more difficult (i.e. it takes more work to get your problem to run efficiently on the new hardware).
For example, consider a matrix multiplication code. In almost all cases, writing a parallel matrix multiplication expands the size of the program code. Look at any OpenMP, MPI, CUDA, OpenCL versions and compare them to the serial counterpart. The least explosive and most restrictive is OpenMP where it tries to help the compiler by providing directives. Other methods tend to expand the code, like using assembly language instead of C or Fortran. Also I want to be clear. I fully support the use of OpenMP, MPI, CUDA, OpenCL, etc. I just wish parallel programming was easier.
That software always trails hardware is a well known in this business. That new hardware complicates software development is less talked about. In the case of parallel computing, which is a difficult nut to crack, software tools are slow to emerge (if at all) and getting more complex.
Are there any alternatives? Yes, but there is no clear path forward. First, let me mention, I am big fan of Functional Languages like Erlang and Haskell. I have written about this concept in the past. I also like the work Portland Group is doing with their PGI Accelerator Compilers that allow OpenMP like compiler directives to be added to existing code. Currently they support C and Fortran and CUDA based hardware. Intel is promoting Cilk Plus (pronounced “silk”) as a solution. It is based on the MIT Cilk Project. Cilk has some nice features and it includes a runtime system that takes care of details like load balancing, synchronization, and communication protocols etc. between cores. It is based on augmenting existing C/C++ codes with just three new keywords.
As software progress crawls along, I am convinced that future large scale HPC applications will include dynamic fault-tolerant runtime systems. The user needs to be lifted away from low level responsibility so they can focus on the application and not the complexity of the next hardware advance.
Fatal error: Call to undefined function aa_author_bios() in /opt/apache/dms/b2b/linux-mag.com/site/www/htdocs/wp-content/themes/linuxmag/single.php on line 62