The Global Arrays Toolkit

This month’s column introduces the Global Arrays Toolkit (GA, http://www.emsl.pnl.gov/docs/global/), a suite of application programming interfaces (API’s) for handling distributed data structures.
Continuing with the theme of simplifying parallel programming, this month’s column introduces the Global Arrays Toolkit (GA, http://www.emsl.pnl.gov/docs/global/), a suite of application programming interfaces (API’s) for handling distributed data structures.
The previous three columns discussed Unified Parallel C (UPC), an extension of C 99 that supports explicit parallel execution and a shared address space. UPC provides automatic memory management across distributed (and shared) memory hardware without requiring explicit message passing.
Like UPC, GA provides a mechanism for shared-memory style programming in a distributed memory computing environment. Unlike UPC, GA works in conjunction with traditional message passing APIs — in particular, the Message Passing Interface (MPI) — to offer both shared-memory and message-passing paradigms in the same program. In both UPC and GA, data distribution information (that is, the affinity of data to processes) is available to the application so that data locality can be exploited to maximize performance.
The GA toolkit was developed at the U.S. Department of Energy’s Pacific Northwest National Laboratory (PNNL), where it is still undergoing significant development and enhancement. Version 4.0 was recently released in April 2006. GA has been in the public domain since 1994, and is used in a number of high performance computing (HPC) simulation codes. In particular, GA is used extensively by PNNL’s NWChem computational quantum chemistry package.
GA uses the Aggregate Remote Memory Copy (ARMCI) library, an API that provides general-purpose, efficient, and portable remote memory access (RMA) operations through one-sided communications. ARMCI utilizes network communications interfaces — including popular low-latency, high-bandwidth interfaces like Myrinet, Quadrics, Giganet, and Infiniband — as well as shared memory to achieve the best possible performance for RMA operations. ARMCI is available on supercomputers (such as the IBM SP; Cray’s X1, T3E, SV1, and J90; Fujitsu’s VX/VPP and Primepower; the NEC SX-5; and the Hitachi SR8000), Unix servers and workstations, and clusters of Unix, Linux, and Windows boxes.
In addition, GA works in conjunction with Memory Allocator (MA), a collection of library routines that perform dynamic memory allocation for C, Fortran, and mixed-language applications. GA relies upon MA to provide all of its dynamically allocated local memory. MA provides both heap and stack memory management along with memory availability and utilization information and statistics. MA supports both C and Fortran data types and offers debugging and verification mechanisms. It’s supported on a wide variety of supercomputers and Unix, Linux, and Windows boxes.
GA must be used in combination with some message passing library. In its original incarnation, GA utilized a minimalistic message passing library called TCGMSG. MPI support was added in 1995, and a transitional interface called TCGMSG-MPI was developed to provide TCGMSG functionality on top of MPI. While TCGMSG is still provided with the GA toolkit, the authors recommend using MPI directly and avoiding the use of TCGMSG in new HPC applications.

So, Howzit Work?

Like UPC, GA logically partitions shared data into “local” and “remote” portions. It recognizes and acknowledges that the local portion can be accessed faster than any of the remote portions. The local portion can be accessed directly, while the other portions must be accessed through GA library calls.
GA supports basic get, put, gather, and scatter shared memory operations, as well as atomic read-and-increment, accumulate (reduction combining local and shared memory), and lock operations on global arrays. These GA operations are one-sided and require no polling or remote-side library calls to complete. You have control over the distribution of global arrays, and both regular and irregular distributions are possible.
GA data transfers use an array index-based interface instead of addresses of remote shared data. It provides a global view of data structures; the library internally performs the mapping from indices to addresses and subsequently transfers the needed data between appropriate processes. Nevertheless, you may inquire where an array section is located and to which process it has affinity.
GA supports integer, real, double precision, and complex data types in Fortran and int, long, float, double, and struct double complex in C. The library actually represents all of these using C data types. Array dimensions range from one to seven, following the Fortran convention.

Installing GA

The source code for GA and its associated libraries is available from the Global Arrays homepage at PNNL (http://www.emsl.pnl.gov/docs/global/). This site requires that users register prior to downloading the software. After the registration form is submitted, the Web site automatically starts the transfer of the GA version 4.0 software to the browser.
To build GA, some version of MPI must already be installed. The example installation was performed on a cluster of dual AMD Opteron nodes running the x86_64 version of Fedora Core 4. MPICH2 version 1.0.3 was already installed, but you can refer to the August 2004″ Extreme Linux” column (available online at http://www.linux-mag.com/2004-08/extreme_01.html) for instructions on installing MPICH2.
Figure One shows the steps required to build and install GA 4.0 as the root user so that it is accessible to all users. After the gzip ped tar file containing the source code, ga-4-0.tgz, has been downloaded, it should be extracted using tar. Change to the new ga-4-0/ directory and type make with appropriate settings for your system and MPI installation. In this case, the target platform is a 64-bit version of Linux (TARGET=LINUX64) and the desired C compiler is the mpicc front-end script created from the MPICH2 installation. Likewise, the desired Fortran compiler is the mpif77 front-end from MPICH2. Using these front-end scripts for the compiler avoids having to include all the flags and include and library paths which would otherwise be required for compiling MPI codes.
Figure One: Installation of the Global Array Toolkit built on top of MPICH2
Download Global Array Toolkit from http://www.emsl.pnl.gov/docs/global/,then follow this procedure.
[src]# tar xvzf ga-4-0.tgz

[src]# cd ga-4-0
[ga-4-0]# make TARGET=LINUX64 CC=mpicc FC=mpif77 MSG_COMMS=MPI

[ga-4-0]# ./global/testing/test.x

All tests successful
[ga-4-0]# mkdir –p /usr/local/ga/ga-4-0-mpich2-gcc
[ga-4-0]# chmod 644 include/*
[ga-4-0]# mv include /usr/local/ga/ga-4-0-mpich2-gcc
[ga-4-0]# mv lib /usr/local/ga/ga-4-0-mpich2-gcc
[ga-4-0]# cp –pR ma/man /usr/local/ga/ga-4-0-mpich2-gcc/man
[ga-4-0]# echo "MANPATH /usr/local/ga/ga-4-0-mpich2-gcc/man" >> /etc/man.config
The two remaining parameters to make tell the build system to use MPI for message passing (MSG_COMMS=MPI) and to link with the MPI library called mpich (LIBMPI=-lmpich). While the installation instructions are a bit confusing regarding which variables to set to get MPI instead of TCGMSG for message passing, these settings seemed to work best. After the build is complete, a series of tests can be performed by running ./global/testing/test.x, which is built near the end of the make sequence.
If all tests are successful, the include files, libraries, and MA man pages should be manually placed in a local directory where all users can access them. In this build, some of the include files ended up with funny modes, so it was necessary to change them with chmod 644 include/*, so that all users could read them. In this example, all versions of GA are located under /usr/local/ga, and this particular build is installed under /usr/local/ga/ga-4-0-mpich2-gcc, signifying that the libraries below that directory come from GA version 4.0 and that they were built with MPICH2, which was built with and used the gcc compilers.
After the new include/ and lib/ directories are moved to the publicly accessible location, the MA man pages are copied to the same area and a MANPATH entry is added to /etc/man.config so that those manual pages can be seen by the man command.
Now that the installation is complete, it’s time to try out a simple program to see how the API works.

A Simple GA Program

To understand the basic requirements of using the GA toolkit, let’s modify the standard MPI” Hello, world!” program that has appeared here before. The GA version of this program will still initialize and finalize MPI, but it must also start GA and MA. Listing One contains the sample source code for hello-world-ga.c.
Listing One: The Global Arrays Toolkit version of hello-world-ga.c

#include <stdio.h>
#include “mpi.h”
#include “global.h”
#include “ga.h”
#include “macdecls.h”

int main(int argc, char** argv)
int rank, size, namelen, ga_rank, ga_size;
char processor_name[MPI_MAX_PROCESSOR_NAME];
int g_a, dims[] = { 2, 3 }, lo[2], hi[2];

MPI_Init(&argc, &argv); /* start MPI */
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Get_processor_name(processor_name, &namelen);
printf(“Hello world! I’m rank %d of %d on %s\n”, rank,
size, processor_name);

GA_Initialize(); /* start global arrays */
ga_rank = GA_Nodeid(); /* get node id */
ga_size = GA_Nnodes(); /* get number of nodes */
printf(“Hello world! I’m rank %d of %d according to GA\n”,
ga_rank, ga_size);
printf(“GA_Uses_ma() returns %d\n”, GA_Uses_ma());
MA_init(C_DBL, 64, 64); /* initialize memory allocator */
/* create a global array called stuff */
g_a = NGA_Create(C_DBL, 2, dims, “stuff”, NULL);
GA_Zero(g_a); /* set all array elements to zero */
/* discover data distribution */
NGA_Distribution(g_a, ga_rank, lo, hi);
printf(“I have [%d, %d] through [%d, %d] of g_a\n”,
lo[0], lo[1], hi[0], hi[1]);
GA_Destroy(g_a); /* destroy the global array */

GA_Terminate(); /* stop global arrays */
MPI_Finalize(); /* stop MPI */

As usual, the MPI header file is included, and to use global arrays, three additional header files are required: global.h, ga.h, and macdecls.h. Inside main() MPI is initialized first (with MPI_Init(&argc,&argv)) after variables are declared. Next, the process rank, communicator size, and processor name are obtained from calls to MPI routines, again as usual. Each participating process will prints this information after printing “Hello world! ”
After MPI is initialized, the global arrays package is initialized by calling GA_Initialize(). Next, the node ID and number of nodes are obtained from the GA package by calling GA_Nodeid() and GA_Nnodes(), respectively. These should correspond to the MPI rank and size. To verify this, the values will be printed out by each process. Then GA_Uses_ma() is called to determine whether local or shared memory is in use by GA. This routine returns false (0) when shared memory is in use; otherwise it returns true (1), signifying that MA is used.
In the next line, MA_init() is called to allocate local memory for 64 double precision values (C_DBL) in the stack and the same in the heap. (This is more memory than is required for this trivial example.) Then a global array is created by calling NGA_Create(). This array is a two-dimensional (2) double precision (C_DBL) array with dimensions of (2,3) (from dims) and labeled “stuff.” Having NULL as the last parameter tells GA to automatically evenly distribute the array among available processes. The NGA_Create() call returns an integer that’s used as a handle for manipulating the global array with other GA routines.
Next, GA_Zero() is called with g_a, the global array handle, passed in as the argument. This routine zeroes out all elements of the new global array. To discover how GA distributed the array among processes, NGA_Distribution() is called by each process with its own process number (ga_rank) and global array handle (g_a). This call returns the starting and ending indices for each dimension of the global array that are local to the calling process. These indices are then printed.
To clean up and end the program, the global array is eliminated by calling GA_Destroy() with the global array handle. Next, GA should be stopped (before finalizing MPI) by calling GA_Terminate(). Finally, MPI is finalized by calling MPI_Finalize(), and the program ends.
Figure Two shows the results of compiling and running this program. mpicc is used to compile the C program, and it must be pointed to the path containing the header files for GA and MA (-I/usr/local/ga/ga-4-0-mpich2-gcc/include). mpif77 must be used for the link step because some of the library code depends on Fortran routines. At this link step, mpif77 must be pointed at the location for the GA and associated libraries (-L/usr/local/ga/ga-4-0-mpich2-gcc/lib/LINUX64) and the desired libraries used must be linked in by including each of them with the -l flag (–lglobal –llinalg –lma –larmci –lm).
FIGURE TWO: Compiling and running hello-world-ga.c
[ga]$ mpicc –c –I/usr/local/ga/ga-4-0-mpich2-gcc/include hello-world- 
[ga]$ mpif77 –o hello-world-ga.x hello-world-ga.o \
–L/usr/local/ga/ga-4-0-mpich2-gcc/lib/LINUX64 –lglobal -llinalg –lma \
–larmci -lm
[ga]$ mpiexec –l –n 4 –path `pwd` hello-world-ga.x
0: Hello world! I’m rank 0 of 4 on thing1.cluster
2: Hello world! I’m rank 2 of 4 on thing2.cluster
1: Hello world! I’m rank 1 of 4 on thing1.cluster
3: Hello world! I’m rank 3 of 4 on thing2.cluster
0: ARMCI configured for 2 cluster nodes. Network protocol is ’TCP/IP Sockets’.
0: Hello world! I’m rank 0 of 4 according to GA
0: GA_Uses_ma() returns 0
2: Hello world! I’m rank 2 of 4 according to GA
1: Hello world! I’m rank 1 of 4 according to GA
2: GA_Uses_ma() returns 0
1: GA_Uses_ma() returns 0
3: Hello world! I’m rank 3 of 4 according to GA
3: GA_Uses_ma() returns 0
1: I have [0, 2] through [0, 2] of g_a
0: I have [0, 0] through [0, 1] of g_a
2: I have [1, 0] through [1, 1] of g_a
3: I have [1, 2] through [1, 2] of g_a
After linking the program, four processes are launched (one for each processor on two dual AMD Opteron nodes called thing1 and thing2). ARMCI is started to provide the communication layer between the two cluster nodes via TCP/IP. All four processes have ranks and sizes reported by GA that correspond to those reported by MPI. All calls to GA_Uses_ma() return false (0).
At the bottom of the output is the distribution of the 2×3 matrix across four processes. Evidently processes 0 and 2 were each assigned two elements while processes 1 and 3 were each assigned a single element. Since processes 0 and 1 are on thing1 and processes 2 and 3 are on thing2, this distribution is as even as it can be given the size of the array.

Tune in Next Month…

GA contains a host of other library calls for manipulating global arrays. Some of these will be highlighted in next month’s column, so stay tuned and be sure to keep that subscription renewed.
However, you should already be able to see the utility of the GA toolkit, especially for data intensive applications with dynamic and irregular communications patterns. When coding in MPI gets too cumbersome, GA may be better able to provide your application with the one-sided communications access to a large logical shared memory in a distributed computing environment like a Linux cluster.

Forrest Hoffman is a computer modeling and simulation researcher at Oak Ridge National Laboratory. He can be reached at class="emailaddress">forrest@climate.ornl.gov. More information about the Global Arrays Toolkit can be found in an article titled” Global Arrays: A Non-Uniform-Memory-Access Programming Model For High-Performance Computers,” written by Nieplocha, Jaroslaw, Robert J. Harrision, and Richard J. Littlefield in the 1996 Journal of Supercomputing (10:169-189).

Comments are closed.