This is the second column in the series about the Global Arrays (GA) Toolkit. The GA Toolkit, introduced in last month’s column, is an application programming interface (API) for handling shared data structures in a distributed computing environment like a Linux cluster. GA essentially provides one-sided communications for array data without requiring you to write explicit message passing code.
Like Unified Parallel C (UPC), described here in earlier columns, GA provides data distribution information to the application so that data locality can be exploited for maximize performance. While UPC offers a more familiar programming style (being implicitly parallel), the GA Toolkit works in conjunction with traditional message passing API — like the Message Passing Interface — to provide both shared-memory and message-passing paradigms in the same program.
The GA Toolkit, developed at the U.S. Department of Energy’s Pacific Northwest National Laboratory (PNNL), uses the Aggregate Remote Memory Copy (ARMCI) library, which provides general-purpose, efficient, and portable remote memory access (RMA) operations through one-sided communications. ARMCI utilizes network interfaces on clusters and supercomputers, including low-latency, high-bandwidth interfaces. GA also works in conjunction with Memory Allocator (MA), a collection of library routines that perform dynamic memory allocation for C, Fortran, and mixed-language applications. GA uses MA to provide all of its dynamically allocated, local memory.
GA must be used in combination with some message passing library, and last month’s column included instructions for downloading and installing the toolkit on a Linux cluster on top of
MPICH2, the popular
MPI-2 library from Argonne National Laboratory (ANL). The GA Toolkit and associated libraries are available at
http://www.emsl.pnl.gov/docs/global/. In addition, a simple “Hello, World! ” program using the GA Toolkit was included in the column, demonstrating how to initialize MPI, GA, and MA; how to allocate a small shared global array and discover its distribution across processes; and how to finalize and end such a program.
A More Realistic Example
This month, let’s tackle a more complex and realistic GA program, one that performs matrix-matrix multiplication, like the example code used to demonstrate programming in UPC. This program multiplies two matrices, A (m x p) and B (p x n), and saves the result in a third matrix, C (m x n). In this example, however, the matrices are all global arrays and are distributed among MPI processes. The source code for this program is shown in Listing One.
#include <stdio.h>
#include “mpi.h”
#include “global.h”
#include “ga.h”
#include “macdecls.h”
#define M 16
#define P 8
#define N 4
int main(int argc, char** argv)
{
int i, j, k, ga_rank, ga_size, rows;
int g_a, a_dims[] = {P, M}, a_chunk[] = {P, -1}, a_lo[2], a_hi[2], a_ld[1] = {M},
g_b, b_dims[] = {N, P}, b_chunk[] = {-1, P}, b_lo[2], b_hi[2], b_ld[1] = {P},
g_c, c_dims[] = {N, M}, c_chunk[] = {N, -1}, c_lo[2], c_hi[2], c_ld[1] = {M},
g_d;
double a[P*M], b[N*P], c[N*M];
MPI_Init(&argc, &argv); /* start MPI */
GA_Initialize(); /* start global arrays */
ga_rank = GA_Nodeid(); /* get node id */
ga_size = GA_Nnodes(); /* get number of nodes */
MA_init(C_DBL, 5000, 5000); /* initialize memory allocator */
/* create a global array A distributed in strips */
g_a = NGA_Create(C_DBL, 2, a_dims, “Array A”, a_chunk);
/* create a global array B distributed in columns */
g_b = NGA_Create(C_DBL, 2, b_dims, “Array B”, b_chunk);
/* create a global array C distributed in strips */
g_c = NGA_Create(C_DBL, 2, c_dims, “Array C”, c_chunk);
/* initialize data in A and B, lead process only */
if (ga_rank == 0) {
for (j = 0; j < M; j++)
for (i = 0; i < P; i++)
a[i*M+j] = (double)j*P+i+1;
a_lo[0] = a_lo[1] = 0;
a_hi[0] = P-1; a_hi[1]= M-1;
NGA_Put(g_a, a_lo, a_hi, a, a_ld);
for (j = 0; j < P; j++)
for (i = 0; i < N; i++)
b[j*N+i] = (double)j*N+i+1;
b_lo[0] = b_lo[1] = 0;
b_hi[0] = N-1; b_hi[1]= P-1;
NGA_Put(g_b, b_lo, b_hi, b, b_ld);
}
/* set all elements of C to zero */
GA_Zero(g_c);
/* Create a duplicate of Array C (with all zeros) for use later */
g_d = GA_Duplicate(g_c, “Array D”);
GA_Sync(); /* synchronize data across all processes (a barrier) */
GA_Print(g_a); /* print out global array A */
GA_Print(g_b); /* print out global array B */
/* discover data distribution */
NGA_Distribution(g_a, ga_rank, a_lo, a_hi);
printf(“I have [%d, %d] through [%d, %d] of g_a\n”,
a_lo[0], a_lo[1], a_hi[0], a_hi[1]);
NGA_Distribution(g_b, ga_rank, b_lo, b_hi);
printf(“I have [%d, %d] through [%d, %d] of g_b\n”,
b_lo[0], b_lo[1], b_hi[0], b_hi[1]);
NGA_Distribution(g_c, ga_rank, c_lo, c_hi);
printf(“I have [%d, %d] through [%d, %d] of g_c\n”,
c_lo[0], c_lo[1], c_hi[0], c_hi[1]);
/* perform multiplication */
/* exploit locality by using only the local portion of A array */
NGA_Get(g_a, a_lo, a_hi, a, a_ld);
/* get the entire B array */
b_lo[0] = b_lo[1] = 0;
b_hi[0] = N-1; b_hi[1]= P-1;
NGA_Get(g_b, b_lo, b_hi, b, b_ld);
for (i = 0; i < N*M; i++) c[i] = 0.;
/* compute contribution to C array from local portion of A array */
rows = a_hi[1] – a_lo[1] + 1;
for (k = 0; k < rows; k++) {
for (j = 0; j < N; j++) {
for (i = 0; i < P; i++) {
c[j*M+k] += a[i*a_ld[0]+k] * b[j*P+i];
#ifdef DEBUG
printf (“c(%d,%d) += a(%d,%d) * b(%d,%d) = %lg * %lg\n”, j, k, i,
k, j, i, a[i*a_ld[0]+k], b[j*P+i]);
#endif /* DEBUG */
}
}
}
NGA_Put(g_c, c_lo, c_hi, c, c_ld);
GA_Sync(); /* synchronize data across all processes (a barrier) */
GA_Print(g_c); /* print out global array C */
GA_Dgemm(’N’, ’N’, M, N, P, 1.0, g_a, g_b, 0.0, g_d);
GA_Print(g_d); /* print out global array D */
GA_Destroy(g_a); /* destroy the global array A */
GA_Destroy(g_b); /* destroy the global array B */
GA_Destroy(g_c); /* destroy the global array C */
GA_Destroy(g_d); /* destroy the global array D */
GA_Terminate(); /* stop global arrays */
MPI_Finalize(); /* stop MPI */
}
As discussed in the previous column, source code must include the MPI header file (mpi.h), as well as three others to support the GA Toolkit (global.h, ga.h, and macdecls.h). Below these include statements, the values of M, P, and N, are defined. These define the size of the matrices declared later in the code. Inside main(), four integers are declared to store handles for four global arrays (g_a, g_b, g_c, and g_d) representing four matrices, A, B, C, and D. Other integer variables and vectors are declared and initialized for use in GA library calls, and three local double-precision vectors representing the first three matrices are also declared.
In the first executable statement, MPI is initialized as usual. It must be initialized before the GA Toolkit. Next, GA is initialized and the node ID and number of participating nodes are queried. Then MA is initialized and space for 5,000 double-precision heap values and stack values are requested for each process. This is more than adequate for the small example used here. Once MA is working, the A and B matrices are created as is there product matrix C.
The NGA_Create() call requires a number of parameters, specifically a type (C_DBL for double precision), a dimension (2 for two-dimensional arrays), the actual dimensions of the array (as a vector), an array label, and a chunk vector describing the distribution of the array among processes. The call returns a handle to the newly-created global array. Since A will be multiplied by B, the chunk parameter for A (a_chunk) is set to {P,-1}, and the chunk parameter for B (b_chunk) is set to {-1, P}. The -1 chunk value requests an even distribution along that dimension. Therefore, the A matrix is distributed in horizontal blocks. while the B matrix is distributed in vertical blocks evenly across processes.
Since A is distributed in horizontal blocks, it makes sense for C to also be distributed in horizontal blocks to maximize opportunities for data locality. As in previous versions of matrix multiplication examples, a local copy of B is needed on each process to complete each portion of the matrix-matrix multiplication. Once the arrays have been successfully created, they must be initialized with data values.
For simplicity, this data initialization is performed on only one of the processes, the one with a rank of zero (ga_rank=0). This “master” process populates local arrays (a and b) and then stores them into their respective global arrays using NGA_Put(). Each element of the A matrix contains its index value starting at one in rows (left to right then top to bottom), while each element of the B matrix contains its index value starting at one in columns (top to bottom then left to right). This scheme makes it easy to debug problems when first writing code.
In C, it is often simplest to use vectors, since the language doesn’t actually have multi-dimensional arrays. The GA Toolkit supports multi-dimensional arrays (since it is most often used in Fortran), and it easily transforms vectors into arrays. The NGA_Put() routine requires the handle to the desired global array (g_a), the upper-left corner of the matrix (a_lo), the lower-right corner of the matrix (a_hi), the C pointer to the data buffer (a), and the leading dimension or data stride (a_ld). These corner indices in C start at zero and go to the dimension size minus one. In both NGA_Put() calls, the entire array is being overwritten with initial values from the master process.
Next, all of the elements of the C matrix are initialized to zero using a special routine, GA_Zero(). A fourth array representing a matrix D is created as a duplicate of the C array for use later. The GA_Duplicate() call takes the handle of the originating matrix and a label for the new array as parameters and returns a handle to the new (duplicate) array. At the end of all this initialization work, a call is made to GA_Sync(), which provides a barrier ensuring that data is synchronized across all processes. Then a handy routine in GA, called GA_Print(), is used to print out all of the elements of the g_a and g_b arrays.
In typical GA programs, the programmer attempts to exploit data locality to get the best performance by querying the library to find out which portions of the arrays are local to the processor. As we saw in the last column this is accomplished by calling NGA_Distribution(). That call is usually followed by a call to NGA_Get(), which retrieves the local portion of a global array to be used in computation.
This code is no different. For each global array (g_a, g_b, and g_c), a call is made to NGA_Distribution, using the local rank (ga_rank) as the owning process, to retrieve the corners of the rectangular region of each array that is local. These indices are stored in the a_lo, a_hi, b_lo, b_hi, c_lo, and c_hi vectors, and this program subsequently prints out those values to demonstrate that the desired decomposition has been performed.
In order to perform the matrix multiplication, the local portion of the A array is retrieved, using NGA_Get(), and stored in a. Notice that the indices in a_lo and a_hi are those retrieved from the call to NGA_Distribution(). The leading dimension or stride (a_ld) is left as initialized. Next, the entire B matrix is retrieved on by every process. Here the array indices returned by NGA_Distribution() are replaced with indices covering the entire g_b array. The local copy of g_b is stored in b.
Now it’s possible to compute the local portion of the C matrix. First, the local c array (actually a vector) is initialized to zero. Second, the number of rows of the A matrix (and hence the C matrix) stored in a is calculated using the returned indices. Third, a triple nested loop performs the multiplication of each local element of a by the appropriate element of b and accumulates those values into the local c array. To see each of these elemental multiplications, simply compile with DEBUG defined; a printf() call is provided.
Finally, a call to NGA_Put() is used to store the local portion of the C matrix (stored in c) into the global array g_c. After another call to GA_Sync() to ensure that communications have completed, GA_Print() is called to print out the elements of the global array g_c representing the solution matrix C. While those steps were somewhat cumbersome (and more so in C than in Fortran), the effort is worth it for very large matrices and complex computational problems.
However, as you see in the next line of code, all of the get and put calls and elemental multiplication can be avoided because DGEMM (a generic matrix-matrix multiplication routine) is included in the GA Toolkit. The single call to GA_Dgemm() performs the matrix-matrix multiplication on g_a and g_b, and stores the result in g_d. The first two parameters (‘N’) tell the routine not to transpose either matrix, the next three parameters (M, N, P) are the dimensions of the arrays, the 1.0 is a scale factor (called alpha), g_a and g_b represent the two matrices to multiply, 0.0 is a scale factor (called beta), and g_d contains the result.
Next, GA_Print() is called to dump out the resulting values of the global array g_d for comparison with the previous result in g_c. To end the program, all for global arrays are eliminated by separate calls to GA_Destroy(), global arrays are stopped by calling GA_Terminate(), and MPI is stopped by calling MPI_Finalize(). Then the program terminates.
Figure One shows the results of compiling and running the ga_mmult.c program. The GA_Print() calls format array output so that it’s readable, and it’s clear that g_a and g_b were intialized as desired. The data distribution is as desired with horizontal blocks for g_a and g_c and vertical blocks for g_b. Finally, it’s clear that g_c and g_d are equal, meaning that the call to GA_Dgemm() performed the same matrix-matrix multiplication correctly.
But Wait, There’s More!
In addition to what you’ve seen here, the GA Toolkit offers a variety of higher-level tools for computational problem solving. Additional linear algebra routines (like GA_Dgemm()) are included in the package, along with interfaces to simple solvers and connections to other mathematics packages for solving eigensystems, and more. While GA is not the answer to all parallel computational science needs, for some applications it offers some powerful tools, and can provide performance as good as direct message-passing code with less work.
Download a copy and try it out on your next parallel coding problem!
Forrest Hoffman is a computer modeling and simulation researcher at Oak Ridge National Laboratory. He can be reached at
class="emailaddress">forrest@climate.ornl.gov.