Summer Language Seminar: The R Project

The R statical language is more than a plotting tool.

This time of year the HPC world seems to be a little slow. Of course, it is summer and many people are on vacation (I have just returned) and normally not much seems to happen until September. Although in the middle of writing this I just read the news that Amazon introduced EC2 for HPC. I’ll be looking into this later and try to get some more details.

I often like to write about topical issues, particularly those may have far reaching effects on the HPC market. When in the “HPC dry season,” I often turn to the next best thing; programming parallel computers. When I discuss parallel programming, I often talk about functional approaches versus imperative approaches. You can find some background here, but the upshot is functional languages do not have looping structures and don’t require the programmer to manage “state” as in traditional “procedural” or imperative languages like C, Fortran, Python, Perl, etc. Not managing state makes parallel operation/conversion easier.

In addition, I have talked about Erlang and Haskell, two notable functional languages, as a better way to express parallelism. If you are a traditional programmer, you may find these languages down right stupid, but there are huge benefits to be gained from this stupidity. Both languages are freely available and easy to install and try.

I have also been following the R Project. Often considered a data analysis or statistics plotting language, R is also true functional language that may have more utility in HPC than most people assume. Before, I dive into my R primer, I want to be clear about performance.

There is generally a trade-off between performance and ease of use. Usually the easier (simpler) a language is to use the less efficient it is when executing. This loss of performance is usually due things like an interpretive (not compiled) nature of the language and the higher abstraction level. For instance, MATLAB is a popular commercial software package that provides a high level approach to mathematical programming. GNU Octave is a similar and somewhat compatible version. Both MATLAB and GNU Octave were developed so students did not have to learn Fortran to do calculations for Engineering classes. (i.e. it keeps the students closer to the engineering problem and away from the details of the computer.)

Understandably, HPC is is about performance and thus most programs are written in Fortran or C/C++. The idea is that the closer you can get your problem to the machine, the faster it will execute. There are however, a class of HPC users whose needs do not allow a huge investment in programming and thus want an “easy” way to code their applications. For these purposes, things like MATLAB, GNU Octave, and R work quite well. These high level approaches are often slower in execution than traditional languages, but they do provide a faster “overall project completion time.”

Getting back to R. I won’t rehash the background that can be found in the on-line manual, but will mention that R is a strong functional programming language. It is also open source and easy to install. (For RPM based systems look for an “R-core” RPM).

Let’s jump into an example. If you have installed R, then to start the interpreter, enter “R” at the command prompt. After some start-up text, you should now be staring at an “>” input prompt. Let’s cover a few basics. I will add “#” to signify comments.

> 2+2     # simple math
[1] 4
> b <- 2+2 # assignment using "<-"
> b      # print that value
[1] 4
> b * 7  #use the value.
[1] 28

Simple enough so far. Note the use of <- instead of =. Let's enter a vector using the c function and then square the elements.

> x <- c(1,2,3,4,5,6)   # create ordered collection (vector)
> print(x)              # alternative print command
[1] 1 2 3 4 5 6
> y <- x^2              # Square the elements of x
> y
[1]  1  4  9 16 25 36

Notice that I can do an operation on the entire vector without a looping structure. Next, consider array multiplication. We will create two arrays and multiply them.

> a <- matrix(c(1,2,3,4,5,6,7,8,9),nrow=3) # three column matrix
> a                                        # print the matrix
     [,1] [,2] [,3]
[1,]    1    4    7
[2,]    2    5    8
[3,]    3    6    9
> b <- matrix(c(9,8,7,6,5,4,3,2,1),nrow=3)
> b
     [,1] [,2] [,3]
[1,]    9    6    3
[2,]    8    5    2
[3,]    7    4    1
> a %*% b                                  # multiply matrix a by matrix b
     [,1] [,2] [,3]
[1,]   90   54   18
[2,]  114   69   24
[3,]  138   84   30

Let's cover a few bookkeeping issues. First, the matrix function takes a collection of terms as defined by the c function and creates a three row matrix (nrow=3). After both matrices are created, matrix a is multiplied by matrix b. Notice that I did not have to use any loop structures or indices to perform my operations. How the matrix multiplication was performed was out of my control. I just told the R interpreter what I wanted to do. To be complete, let's see what happens when we try to multiply two incompatible matrices.

> c <- matrix(c(1,2,3,4),nrow=2)
> c
     [,1] [,2]
[1,]    1    3
[2,]    2    4
> a %*% c
Error in a %*% c : non-conformable arguments

As expected, the multiplication was not possible and noted as such by the R interpreter. Obviously, I have glossed over a large amount of R background. My intention was to illustrate the functional aspect of R as compared to Fortran or C (Google for any Fortran or C matrix multiplication example and compare it to the R code above.)

For those that are panicking, you can relax. R has a full set of control structures to satisfy your imperative urges (e.g. for, while, repeat), which is why I like to suggest the language to traditional hard-headed programmers. In some other functional languages recursion takes the place of looping. What you may find however, is you may not need to use looping structures in R.

In my next installment, I'm going to look at some of the parallel additions to R. Of course, R could be modified to execute implicitly parallel operations across additional cores (or nodes) making parallel execution transparent. That is a whole other topic as well. The good news is that once you express your problem in functional terms all kinds of interesting things can happen.

Fatal error: Call to undefined function aa_author_bios() in /opt/apache/dms/b2b/ on line 62