The Global Arrays Toolkit, Part Three

A powerful little package to eliminate the details of communication and data distribution on distributed memory systems -- like, say, Linux clusters.

The two preceding columns have covered the Global Arrays Toolkit (GA), developed at the U.S. Department of Energy’s Pacific Northwest National Laboratory (PNNL). The GA Toolkit is an applications programming interface (API) for handling shared data structures in a distributed computing environment, including Beowulf-style Linux clusters. GA supports one-sided communications through library calls that manipulate array data, alleviating the need, in many cases, for you to develop explicit message passing code.

The first column in this series introduced GA, and described the Aggregate Remote Memory Copy (ARMCI) library, which provides remote memory access via one-sided communications over a wide variety of network interfaces. The first column also explained the Memory Allocator (MA), which provides dynamic memory allocation for mixed-language applications, and included installation instructions. GA sits on top of and uses the ARMCI and MA libraries for communications and memory management respectively. In addition, a simple “Hello World! ” program using GA was included in the first column, demonstrating how to initialize MPI (the Message-Passing Interface), GA, and MA; how to allocate a small shared global array and discover its distribution across processes; and how to finalize and end such a program.

Last month’s column included a more realistic example GA program, one that performed matrix-matrix multiplication, and the article demonstrated some of the array manipulation routines available in GA. That code was similar to the example code used to demonstrate Unified Parallel C (UPC) in an earlier column. Like UPC, GA provides data distribution information to the application so that data locality can be exploited to maximize performance. Both UPC and GA offer simpler and cleaner coding alternatives to traditional message passing code development for certain classes of high performance computing (HPC) applications.

This month’s column fills in the gaps by discussing a variety of additional array manipulation, communications, and synchronization features available in GA. While many of these features are useful only to specific classes of HPC applications, knowing the API’s capabilities (and limitations) can help you decide if GA will save coding and debugging effort in creating a parallel application or in parallelizing an existing serial application.

Irregular and Ghost Cell Arrays

Two methods of creating arrays in GA were demonstrated in last month’s example. NGA_Create() creates an array with a regular distribution across processes from scratch. NGA_Duplicate() creates an array based on the distribution, data type, and dimensions of an existing array. In addition, irregularly distributed global arrays may be created using NGA_Create_irreg() or by duplicating an existing irregularly distributed global array using NGA_Duplicate(). Instead of simply providing an integer chunk size vector, as required for NGA_Create(), NGA_Create_irreg() requires an integer vector of distribution points, as well as an integer vector of block sizes.

GA also supports global arrays that have ghost cells or halos. In these arrays, each block of data is padded around the edges with the boundary array elements of surrounding blocks of data residing in other processes or on other processors. Ghost cell arrays provide an automatic mechanism for accessing boundary data normally owned solely by other processes. Certain geophysical, particle tracking, and spatial models need to know data on the boundaries just outside of their patch of visible data, and obtaining this data would otherwise require explicit communications with the neighboring processors.

Just like with regular global arrays, calls are available to create either regularly distributed or irregularly distributed ghost cell arrays. NGA_Create_ghosts() and NGA_Create_ghosts_irreg() take the same parameters as NGA_Create() and NGA_Create_irreg() respectively. Ghost cell arrays are assumed to be periodic; that is, they wrap such that the leftmost ghost cells of the leftmost data blocks contain the values of the rightmost cells of the rightmost data blocks, and so on.

The ghost cell data can be updated with a single collective call to void NGA_Update_ghosts(int g_a).

Alternatively, ghost cells along individual directions may be updated explicitly by calling

 int NGA_Update_ghosts_dir(int g_a,  int dimension, int idir, int cflag) 

This form is useful for applications that can overlap ghost cell updates with computation to hide communications latency. The dimension variable indicates which coordinate direction is to be updated (x, y, z), the idir variable can take values+ /-1 to indicate whether the side to be updated lies in the positive or negative direction, and cflag indicates whether the corners are to be updated (0 or 1).

Data in ghost cells may be accessed by calling

 void NGA_Access_ghosts(int g_a,  int dims[], void *ptr, int ld[]) 

This local call normally follows a call to NGA_Distribution() that returns coordinates of the visible data patch associated with a process. The global array handle (g_a) is provided to the call, which returns the vector of dimensions of the local patch (including ghost cells) in dims[ndim], the pointer corresponding to the origin in the local patch of the global array in ptr, and the vector of leading dimensions of the local array patch (including ghost cells) in ld[ndim-1]. Moreover, individual ghost cell elements may be accessed by calling…

 void NGA_Access_ghost_element(int g_a,  void *ptr, int subscript[], int ld) 

… where subscript[ndim] is the vector of subscripts describing the desired ghost cell element.

One-Sided Communications

The remote blockwise read and write operations, NGA_Put() and NGA_Get(), were demonstrated in last month’s example. Remember, these operations do not require the user to specify the remote process or processes which own the accessed portion of the global array; the GA Toolkit handles these details automatically. Additional operations n the Toolkit provide for atomic updates and elementwise reads and writes.

The accumulate operation performs atomic remote update to a patch of a global array without explicit cooperation of the process (es) owning the data. The call…

 void NGA_Acc(int g_a, int lo[],  int hi[], void *buf, int ld[],  void *alpha) 

… updates the global array g_a using the formula g_a(lo[],hi[])+=alpha*buf.

Similarly, the read-and-increment operation performs atomic remote update of individual elements of an integer global array. The call…

 long NGA_Read_inc(int g_a, int subscript[], long inc) 

… updates the global array g_a to be g_a(subscript[])+=inc.

Blocks of array elements of global arrays can be read using a gather operation, or written using a scatter operation. These operations are also one-sided, so they don’t require corresponding calls on the process (es) owning the data gathered or scattered. The call…

 void NGA_Gather(int g_a, void *v,  int *subsarray[], int n) 

… stores into the memory buffer pointed to by v the n elements of the global array g_a described by the array of locations subsarray[]. For example, if four elements (n=4) of the 9& times;9 array g_a shown in Figure One were requested by calling NGA_Gather() with the subset array subsarray[][] shown, the vector of elements v[] returned would be those shown on the right in Figure One.

Figure One: Functional behavior of the NGA_Gather() operation for n=4
 Global Array g_a:                    Subset array:        Element value:

 |  1   2   3   4   5   6   7   8   9   subsarray[0][0] = 2   v[0] = 23 -+-----------------------------------   subsarray[0][1] = 3 1| 11  12  13  14  15  16  17  18  19 2| 21  22  23  24  25  26  27  28  29   subsarray[1][0] = 3   v[1] = 34 3| 31  32  33  34  35  36  37  38  39   subsarray[1][1] = 4 4| 41  42  43  44  45  46  47  48  49 5| 51  52  53  54  55  56  57  58  59   subsarray[2][0] = 7   v[2] = 75 6| 61  61  63  64  65  66  67  68  69   subsarray[3][1] = 5 7| 71  72  73  74  75  76  77  78  79 8| 81  82  83  84  85  86  87  88  89   subsarray[3][0] = 9   v[3] = 96 9| 91  92  93  94  95  96  97  98  99   subsarray[3][1] = 6 

Similarly, the call…

 void NGA_Scatter(int g_a, void *v,  int *subsarray[], int n) 

… would do the reverse. That is, it would store the vector of n values pointed to by v[] into the locations of the global array g_a specified by the subset array subsarray[]. And it does so no matter which process or processes own the data in those locations.

Interfaces to the get, put, and accumulate one-sided operations are also available, supporting periodic boundaries. They provide an index translation layer, useful in computational fluid dynamics applications, to wrap the array, so that indices outside the bounds of the array refer to elements within the array in a cyclic fashion as shown in Figure Two.

Figure Two: Index translation of periodic interfaces to one-sided routines
 Global Array g_a:       Periodic Operation View of g_a:

 | 1  2                      |-1  0  1  2  3  4 -+-----                    --+----------------- 1| 1  3                    -1| 1  3  1  3  1  3 2| 2  4                     0| 2  4  2  4  2  4 1| 1  3  1  3  1  3 a 2×2 array             2| 2  4  2  4  2  4 with bounds [1:2,1:2]           3| 1  3  1  3  1  3 4| 2  4  2  4  2  4

 Results of periodic get operation for range [2:3,2:3]:     | 2  3 -+----- 2| 4  2 3| 3  1 

The periodic get operation copies data from a global array section to a local array. Unlike the regular get operation, the indices of the patch can be outside the boundaries of each dimension. The call…

 void NGA_Periodic_get(int g_a, int lo[],  int hi[], void *buf, int ld[]) 

copies some section of the global array g_a, specified by the bounds[ lo[], hi[]] to the local buffer pointed to by buf with leading dimensions ld[]. For example, calling NGA_Periodic_get with lo[]={2, 2}, hi[]={3, 3}, and ld[]={2} yields the resulting array shown in the bottom of Figure Two.

The periodic put operation does the opposite. It copies data from a local array to a global array section with indices that may be outside the bounds of the global array. This call takes the form:

 void NGA_Periodic_put(int g_a, int lo[],  int hi[], void *buf, int ld[]) 

The periodic accumulate operation, like the accumulate operation, combines data from a local array with data contained in a section of a global array, except that the section indices may be outside the bounds of the global array. This call takes the form:

 void NGA_Periodic_acc(int g_a, int lo[],  int hi[], void *buf, int ld[],  void *alpha) 

As with the accumulate operation, values in alpha scale the values in buf prior to accumulation into the global array g_a.

The put, get, and accumulation operations are also available via non-blocking interfaces which supply a handle that identifies the non-blocking request. These non-blocking calls initiate the communications necessary to complete an operation, but then return control immediately to the application. In this way, the data transfer process can be happening while additional unrelated computations are occurring in the application. Overlapping communications and computations in this fashion hides the latency of the communications, offering better overall performance from the application code.

The non-block operations take the forms:

 void NGA_NbPut(int g_a, int lo[],  int hi[], void *buf, int ld[],  ga_nbhdl_t* nbhandle)

 void NGA_NbGet(int g_a, int lo[],  int hi[], void *buf, int ld[],  ga_nbhdl_t* nbhandle)

 void NGA_NbAcc(int g_a, int lo[],  int hi[],   void *buf, int ld[],  ga_nbhdl_t* nbhandle)

 int NGA_NbWait(ga_nbhdl_t* nbhandle) 

Calling NGA_NbWait() with a handle obtained from one of the non-blocking put, get, or accumulate operations causes the program to wait until the operation is completed (on the caller’s side) before continuing execution. For the get operation this implies that the data are now available for use; however, for the put and accumulate operations, this implies that the data have been sent, but may not yet be available to the receiving processes. As a result, non-blocking operations are not ordered with respect to the non-calling processes. For cases where such ordering is desired or necessary, a fence operation can be used to confirm remote completion.

Interprocess Synchronization

The fence operation blocks the calling process until all the data transfers associated with Global Array operations initiated by that process are complete. Typically, the fence is initialized, one or more global operations are performed, and then the fence call is made. When the fence call returns, the programmer can be assured that all the GA operations between the initialization call and the fence call are complete.

The fence is initialized with the call:

 void GA_Init_fence() 

This call begins the tracing of completion status of data movement operations in GA. Next, GA communications calls are made (for instance, get, put, accumulate, scatter, and so on). Then the function call…

void GA_Fence()

… blocks the calling process until all the data transfers are complete.

The collective sync operation used in last month’s example code, GA_Sync(), ensures that all GA operations are complete. This call acts as a barrier and must be called by all processes.

Locks and mutexes are also available in the GA Toolkit, useful for the shared memory model. A mutex can be locked to provide exclusive access to a critical section of code before being unlocked. A set of mutexes can be created by calling…

int GA_Create_mutexes(int number)

… and can be destroyed by calling

int GA_Destroy_mutexes()

These calls are collective operations. Any process may then lock one of the created mutexes by calling…

void GA_lock(int mutex)

… where mutex is an integer between 0 and number- 1. Once the lock is in place, the critical section of code can be executed. Then the mutex can be unlocked by calling:

void GA_unlock(int mutex)

While only one set of mutexes can exist at a time, they may be created and destroyed as many times as desired.

Collective Array Operations

The GA Toolkit provides several ways to manipulate contents of global arrays. These mechanisms can operate on entire global arrays or on patches of global arrays. These collective operations, which must be called by all processes, include basic array operations, linear algebra operations, and interfaces to third party software packages. The basic array operations consist of the routines shown in Table One.

TABLE ONE: Fundamental array operations in the Global Arrays Toolkit
Function Purpose
void GA_Zero(int g_a) Sets all elements in the array to zero
void GA_Fill(int g_a, void*val) Sets all elements in the array to the value val, which must have the same type as that of the array
void GA_Scale(int g_a, void*val) Scales all elements in the array by the factor val, which must have the same type as that of the array
void GA_Copy(int g_a, int g_b) Copies the contents of array g_a to the array g_b

These array operations have corresponding versions that apply only to patches within global arrays (defined by the lo[] and hi[] indices):

 void NGA_Zero_patch(int g_a, int lo[], int hi[])

 void NGA_Fill_patch(int g_a, int lo[], int hi[], void *val)

 void NGA_Scale_patch(int g_a, int lo[],  int hi[], void *val)

 void NGA_Copy_patch(char trans, int g_a,  int alo[], int ahi[], int g_b,  int blo[], int bhi[]) 

When the trans flag is set in NGA_Copy_patch(), a transpose operation is performed. The arrays g_a and g_b must be different global arrays and must be of the same data type; however, the patches may be of different shapes, but the number of elements must be the same.

The linear algebra operations perform addition, multiplication, and dot product for either entire arrays or for patches. Addition is accomplished by calling…

 void GA_Add(void *alpha, int g_a,  void *beta, int g_b, int g_c) 

… where alpha and beta are scale factors for the source arrays g_a and g_b, respectively. The result is stored in the global array g_c. The matrix multiplication operation was demonstrated in last month’s example code. It takes the form…

 void GA_Dgemm(char ta, char tb, int m, int n, int k, double alpha, int g_a,  int g_b, double beta, int g_c) 

… and performs the function C=alpha*op(A)*op(B)+beta*C, where alpha and beta are scale factors on g_a and g_c, respectively, and op(X)=X or X’ depending on whether ta or tb are set to’ N’ or’ Y’ (or’ n’ or’ y’). When set to’ Y’ (or’ y’), the transpose of the appropriate matrix is taken.

The dot product is available in three flavors, depending on the data type: long, double, and DoubleComplex. These take the forms

 long GA_Idot(int g_a, int g_b)

 double GA_Ddot(int g_a, int g_b)

 DoubleComplex GA_Zdot(int g_a, int g_b) 

These linear algebra operations also have corresponding calls that operate only on patches within global arrays. Two additional functions, available only in versions that operation on entire arrays, perform symmetrization and transposition of matrices. These calls are…

 void GA_Symmetrize(int g_a) 

… and…

 void GA_Transpose(int g_a, int g_b) 

… and perform A=0.5*(A+A’),B=A’, respectively.

A number of calls are also available for performing a variety of operations on individual array elements or rows/columns of matrices. There is not enough space to cover all of these functions here, but they may could be very handy for quickly operating on portions of global arrays.

Finally, interfaces to other numerical algorithms are available in the GA Toolkit. The most commonly used interfaces are to Scalapack, a well-known linear algebra package for distributed memory computers, and PeIGS, a package for solving standard and generalized real symmetric eigensystems. If an application requires the use of either (or both) of these packages, it could likely be implemented using global arrays with the GA Toolkit. Then the data distribution and communication is handled by the Toolkit instead of requiring you to do all the message passing explicitly.

Try It Out for Yourself!

The Global Arrays Toolkit is a powerful package that can eliminate many of the details of data distribution and communications on distributed memory systems (like Linux clusters), yet it provides some very advanced built-in functionality for matrix manipulation and interfaces to other mathematical software packages. If your application can fit in the GA framework, download it and give it a try. It just might save you hours of MPI programming.

Forrest Hoffman is a computer modeling and simulation researcher at Oak Ridge National Laboratory. He can be reached at

Fatal error: Call to undefined function aa_author_bios() in /opt/apache/dms/b2b/ on line 62