Global Arrays Toolkit, Part Two

Tackle a more complex and realistic Global Arrays Toolkit program, one that performs matrix-matrix multiplication.
This is the second column in the series about the Global Arrays (GA) Toolkit. The GA Toolkit, introduced in last month’s column, is an application programming interface (API) for handling shared data structures in a distributed computing environment like a Linux cluster. GA essentially provides one-sided communications for array data without requiring you to write explicit message passing code.
Like Unified Parallel C (UPC), described here in earlier columns, GA provides data distribution information to the application so that data locality can be exploited for maximize performance. While UPC offers a more familiar programming style (being implicitly parallel), the GA Toolkit works in conjunction with traditional message passing API — like the Message Passing Interface — to provide both shared-memory and message-passing paradigms in the same program.
The GA Toolkit, developed at the U.S. Department of Energy’s Pacific Northwest National Laboratory (PNNL), uses the Aggregate Remote Memory Copy (ARMCI) library, which provides general-purpose, efficient, and portable remote memory access (RMA) operations through one-sided communications. ARMCI utilizes network interfaces on clusters and supercomputers, including low-latency, high-bandwidth interfaces. GA also works in conjunction with Memory Allocator (MA), a collection of library routines that perform dynamic memory allocation for C, Fortran, and mixed-language applications. GA uses MA to provide all of its dynamically allocated, local memory.
GA must be used in combination with some message passing library, and last month’s column included instructions for downloading and installing the toolkit on a Linux cluster on top of MPICH2, the popular MPI-2 library from Argonne National Laboratory (ANL). The GA Toolkit and associated libraries are available at http://www.emsl.pnl.gov/docs/global/. In addition, a simple “Hello, World! ” program using the GA Toolkit was included in the column, demonstrating how to initialize MPI, GA, and MA; how to allocate a small shared global array and discover its distribution across processes; and how to finalize and end such a program.

A More Realistic Example

This month, let’s tackle a more complex and realistic GA program, one that performs matrix-matrix multiplication, like the example code used to demonstrate programming in UPC. This program multiplies two matrices, A (m x p) and B (p x n), and saves the result in a third matrix, C (m x n). In this example, however, the matrices are all global arrays and are distributed among MPI processes. The source code for this program is shown in Listing One.
Listing One: ga_mmult.c, a program which performs matrix-matrix multiplication using Global Arrays

#include <stdio.h>
#include “mpi.h”
#include “global.h”
#include “ga.h”
#include “macdecls.h”

#define M 16
#define P 8
#define N 4

int main(int argc, char** argv)
{
int i, j, k, ga_rank, ga_size, rows;
int g_a, a_dims[] = {P, M}, a_chunk[] = {P, -1}, a_lo[2], a_hi[2], a_ld[1] = {M},
g_b, b_dims[] = {N, P}, b_chunk[] = {-1, P}, b_lo[2], b_hi[2], b_ld[1] = {P},
g_c, c_dims[] = {N, M}, c_chunk[] = {N, -1}, c_lo[2], c_hi[2], c_ld[1] = {M},
g_d;
double a[P*M], b[N*P], c[N*M];

MPI_Init(&argc, &argv); /* start MPI */
GA_Initialize(); /* start global arrays */
ga_rank = GA_Nodeid(); /* get node id */
ga_size = GA_Nnodes(); /* get number of nodes */
MA_init(C_DBL, 5000, 5000); /* initialize memory allocator */

/* create a global array A distributed in strips */
g_a = NGA_Create(C_DBL, 2, a_dims, “Array A”, a_chunk);
/* create a global array B distributed in columns */
g_b = NGA_Create(C_DBL, 2, b_dims, “Array B”, b_chunk);
/* create a global array C distributed in strips */
g_c = NGA_Create(C_DBL, 2, c_dims, “Array C”, c_chunk);
/* initialize data in A and B, lead process only */

if (ga_rank == 0) {
for (j = 0; j < M; j++)
for (i = 0; i < P; i++)
a[i*M+j] = (double)j*P+i+1;
a_lo[0] = a_lo[1] = 0;
a_hi[0] = P-1; a_hi[1]= M-1;
NGA_Put(g_a, a_lo, a_hi, a, a_ld);
for (j = 0; j < P; j++)
for (i = 0; i < N; i++)
b[j*N+i] = (double)j*N+i+1;
b_lo[0] = b_lo[1] = 0;
b_hi[0] = N-1; b_hi[1]= P-1;
NGA_Put(g_b, b_lo, b_hi, b, b_ld);
}

/* set all elements of C to zero */
GA_Zero(g_c);
/* Create a duplicate of Array C (with all zeros) for use later */
g_d = GA_Duplicate(g_c, “Array D”);
GA_Sync(); /* synchronize data across all processes (a barrier) */
GA_Print(g_a); /* print out global array A */
GA_Print(g_b); /* print out global array B */

/* discover data distribution */
NGA_Distribution(g_a, ga_rank, a_lo, a_hi);
printf(“I have [%d, %d] through [%d, %d] of g_a\n”,
a_lo[0], a_lo[1], a_hi[0], a_hi[1]);
NGA_Distribution(g_b, ga_rank, b_lo, b_hi);
printf(“I have [%d, %d] through [%d, %d] of g_b\n”,
b_lo[0], b_lo[1], b_hi[0], b_hi[1]);
NGA_Distribution(g_c, ga_rank, c_lo, c_hi);

printf(“I have [%d, %d] through [%d, %d] of g_c\n”,
c_lo[0], c_lo[1], c_hi[0], c_hi[1]);

/* perform multiplication */

/* exploit locality by using only the local portion of A array */
NGA_Get(g_a, a_lo, a_hi, a, a_ld);
/* get the entire B array */
b_lo[0] = b_lo[1] = 0;
b_hi[0] = N-1; b_hi[1]= P-1;
NGA_Get(g_b, b_lo, b_hi, b, b_ld);
for (i = 0; i < N*M; i++) c[i] = 0.;
/* compute contribution to C array from local portion of A array */
rows = a_hi[1] – a_lo[1] + 1;
for (k = 0; k < rows; k++) {
for (j = 0; j < N; j++) {
for (i = 0; i < P; i++) {
c[j*M+k] += a[i*a_ld[0]+k] * b[j*P+i];
#ifdef DEBUG
printf (“c(%d,%d) += a(%d,%d) * b(%d,%d) = %lg * %lg\n”, j, k, i,
k, j, i, a[i*a_ld[0]+k], b[j*P+i]);
#endif /* DEBUG */
}
}
}
NGA_Put(g_c, c_lo, c_hi, c, c_ld);
GA_Sync(); /* synchronize data across all processes (a barrier) */
GA_Print(g_c); /* print out global array C */

GA_Dgemm(’N’, ’N’, M, N, P, 1.0, g_a, g_b, 0.0, g_d);
GA_Print(g_d); /* print out global array D */

GA_Destroy(g_a); /* destroy the global array A */
GA_Destroy(g_b); /* destroy the global array B */
GA_Destroy(g_c); /* destroy the global array C */
GA_Destroy(g_d); /* destroy the global array D */

GA_Terminate(); /* stop global arrays */
MPI_Finalize(); /* stop MPI */
}

As discussed in the previous column, source code must include the MPI header file (mpi.h), as well as three others to support the GA Toolkit (global.h, ga.h, and macdecls.h). Below these include statements, the values of M, P, and N, are defined. These define the size of the matrices declared later in the code. Inside main(), four integers are declared to store handles for four global arrays (g_a, g_b, g_c, and g_d) representing four matrices, A, B, C, and D. Other integer variables and vectors are declared and initialized for use in GA library calls, and three local double-precision vectors representing the first three matrices are also declared.
In the first executable statement, MPI is initialized as usual. It must be initialized before the GA Toolkit. Next, GA is initialized and the node ID and number of participating nodes are queried. Then MA is initialized and space for 5,000 double-precision heap values and stack values are requested for each process. This is more than adequate for the small example used here. Once MA is working, the A and B matrices are created as is there product matrix C.
The NGA_Create() call requires a number of parameters, specifically a type (C_DBL for double precision), a dimension (2 for two-dimensional arrays), the actual dimensions of the array (as a vector), an array label, and a chunk vector describing the distribution of the array among processes. The call returns a handle to the newly-created global array. Since A will be multiplied by B, the chunk parameter for A (a_chunk) is set to {P,-1}, and the chunk parameter for B (b_chunk) is set to {-1, P}. The -1 chunk value requests an even distribution along that dimension. Therefore, the A matrix is distributed in horizontal blocks. while the B matrix is distributed in vertical blocks evenly across processes.
Since A is distributed in horizontal blocks, it makes sense for C to also be distributed in horizontal blocks to maximize opportunities for data locality. As in previous versions of matrix multiplication examples, a local copy of B is needed on each process to complete each portion of the matrix-matrix multiplication. Once the arrays have been successfully created, they must be initialized with data values.
For simplicity, this data initialization is performed on only one of the processes, the one with a rank of zero (ga_rank=0). This “master” process populates local arrays (a and b) and then stores them into their respective global arrays using NGA_Put(). Each element of the A matrix contains its index value starting at one in rows (left to right then top to bottom), while each element of the B matrix contains its index value starting at one in columns (top to bottom then left to right). This scheme makes it easy to debug problems when first writing code.
In C, it is often simplest to use vectors, since the language doesn’t actually have multi-dimensional arrays. The GA Toolkit supports multi-dimensional arrays (since it is most often used in Fortran), and it easily transforms vectors into arrays. The NGA_Put() routine requires the handle to the desired global array (g_a), the upper-left corner of the matrix (a_lo), the lower-right corner of the matrix (a_hi), the C pointer to the data buffer (a), and the leading dimension or data stride (a_ld). These corner indices in C start at zero and go to the dimension size minus one. In both NGA_Put() calls, the entire array is being overwritten with initial values from the master process.
Next, all of the elements of the C matrix are initialized to zero using a special routine, GA_Zero(). A fourth array representing a matrix D is created as a duplicate of the C array for use later. The GA_Duplicate() call takes the handle of the originating matrix and a label for the new array as parameters and returns a handle to the new (duplicate) array. At the end of all this initialization work, a call is made to GA_Sync(), which provides a barrier ensuring that data is synchronized across all processes. Then a handy routine in GA, called GA_Print(), is used to print out all of the elements of the g_a and g_b arrays.
In typical GA programs, the programmer attempts to exploit data locality to get the best performance by querying the library to find out which portions of the arrays are local to the processor. As we saw in the last column this is accomplished by calling NGA_Distribution(). That call is usually followed by a call to NGA_Get(), which retrieves the local portion of a global array to be used in computation.
This code is no different. For each global array (g_a, g_b, and g_c), a call is made to NGA_Distribution, using the local rank (ga_rank) as the owning process, to retrieve the corners of the rectangular region of each array that is local. These indices are stored in the a_lo, a_hi, b_lo, b_hi, c_lo, and c_hi vectors, and this program subsequently prints out those values to demonstrate that the desired decomposition has been performed.
In order to perform the matrix multiplication, the local portion of the A array is retrieved, using NGA_Get(), and stored in a. Notice that the indices in a_lo and a_hi are those retrieved from the call to NGA_Distribution(). The leading dimension or stride (a_ld) is left as initialized. Next, the entire B matrix is retrieved on by every process. Here the array indices returned by NGA_Distribution() are replaced with indices covering the entire g_b array. The local copy of g_b is stored in b.
Now it’s possible to compute the local portion of the C matrix. First, the local c array (actually a vector) is initialized to zero. Second, the number of rows of the A matrix (and hence the C matrix) stored in a is calculated using the returned indices. Third, a triple nested loop performs the multiplication of each local element of a by the appropriate element of b and accumulates those values into the local c array. To see each of these elemental multiplications, simply compile with DEBUG defined; a printf() call is provided.
Finally, a call to NGA_Put() is used to store the local portion of the C matrix (stored in c) into the global array g_c. After another call to GA_Sync() to ensure that communications have completed, GA_Print() is called to print out the elements of the global array g_c representing the solution matrix C. While those steps were somewhat cumbersome (and more so in C than in Fortran), the effort is worth it for very large matrices and complex computational problems.
However, as you see in the next line of code, all of the get and put calls and elemental multiplication can be avoided because DGEMM (a generic matrix-matrix multiplication routine) is included in the GA Toolkit. The single call to GA_Dgemm() performs the matrix-matrix multiplication on g_a and g_b, and stores the result in g_d. The first two parameters (‘N’) tell the routine not to transpose either matrix, the next three parameters (M, N, P) are the dimensions of the arrays, the 1.0 is a scale factor (called alpha), g_a and g_b represent the two matrices to multiply, 0.0 is a scale factor (called beta), and g_d contains the result.
Next, GA_Print() is called to dump out the resulting values of the global array g_d for comparison with the previous result in g_c. To end the program, all for global arrays are eliminated by separate calls to GA_Destroy(), global arrays are stopped by calling GA_Terminate(), and MPI is stopped by calling MPI_Finalize(). Then the program terminates.
Figure One shows the results of compiling and running the ga_mmult.c program. The GA_Print() calls format array output so that it’s readable, and it’s clear that g_a and g_b were intialized as desired. The data distribution is as desired with horizontal blocks for g_a and g_c and vertical blocks for g_b. Finally, it’s clear that g_c and g_d are equal, meaning that the call to GA_Dgemm() performed the same matrix-matrix multiplication correctly.
FIGURE ONE: The results of compiling and running the ga_mmult.c program
[ga]$ mpicc –c –I/usr/local/ga/ga-4-0-mpich2-gcc/include ga_mmult.c

[ga]$ mpif77 –o ga_mmult.x ga_mmult.o \
–L/usr/local/ga/ga-4-0-mpich2-gcc/lib/LINUX64 \
–lglobal –llinalg –lma –larmci –lm

[ga]$ mpiexec –l –n 4 –path `pwd` ga_mmult.x

0: ARMCI configured for 2 cluster nodes. Network protocol is ’TCP/IP
Sockets’.
0:
0: global array: Array A[1:16,1:8], handle: -1000
0:
0: 1 2 3 4 5 6
0: ----------- ----------- ----------- ----------- ----------- -----------
0: 1 1.00000 2.00000 3.00000 4.00000 5.00000 6.00000
0: 2 9.00000 10.00000 11.00000 12.00000 13.00000 14.00000
0: 3 17.00000 18.00000 19.00000 20.00000 21.00000 22.00000
0: 4 25.00000 26.00000 27.00000 28.00000 29.00000 30.00000
0: 5 33.00000 34.00000 35.00000 36.00000 37.00000 38.00000
0: 6 41.00000 42.00000 43.00000 44.00000 45.00000 46.00000
0: 7 49.00000 50.00000 51.00000 52.00000 53.00000 54.00000
0: 8 57.00000 58.00000 59.00000 60.00000 61.00000 62.00000
0: 9 65.00000 66.00000 67.00000 68.00000 69.00000 70.00000
0: 10 73.00000 74.00000 75.00000 76.00000 77.00000 78.00000
0: 11 81.00000 82.00000 83.00000 84.00000 85.00000 86.00000
0: 12 89.00000 90.00000 91.00000 92.00000 93.00000 94.00000
0: 13 97.00000 98.00000 99.00000 100.00000 101.00000 102.00000
0: 14 105.00000 106.00000 107.00000 108.00000 109.00000 110.00000
0: 15 113.00000 114.00000 115.00000 116.00000 117.00000 118.00000
0: 16 121.00000 122.00000 123.00000 124.00000 125.00000 126.00000
0:
0: 7 8
0: ----------- -----------
0: 1 7.00000 8.00000
0: 2 15.00000 16.00000
0: 3 23.00000 24.00000
0: 4 31.00000 32.00000
0: 5 39.00000 40.00000
0: 6 47.00000 48.00000
0: 7 55.00000 56.00000
0: 8 63.00000 64.00000
0: 9 71.00000 72.00000
0: 10 79.00000 80.00000
0: 11 87.00000 88.00000
0: 12 95.00000 96.00000
0: 13 103.00000 104.00000
0: 14 111.00000 112.00000
0: 15 119.00000 120.00000
0: 16 127.00000 128.00000
0:
0: global array: Array B[1:8,1:4], handle: -999
0:
0: 1 2 3 4
0: ----------- ----------- ----------- -----------
0: 1 1.00000 9.00000 17.00000 25.00000
0: 2 2.00000 10.00000 18.00000 26.00000
0: 3 3.00000 11.00000 19.00000 27.00000
0: 4 4.00000 12.00000 20.00000 28.00000
0: 5 5.00000 13.00000 21.00000 29.00000
0: 6 6.00000 14.00000 22.00000 30.00000
0: 7 7.00000 15.00000 23.00000 31.00000
0: 8 8.00000 16.00000 24.00000 32.00000
0: I have [0, 0] through [7, 3] of g_a
0: I have [0, 0] through [0, 7] of g_b
0: I have [0, 0] through [3, 3] of g_c
2: I have [0, 8] through [7, 11] of g_a
1: I have [0, 4] through [7, 7] of g_a
2: I have [2, 0] through [2, 7] of g_b
1: I have [1, 0] through [1, 7] of g_b
2: I have [0, 8] through [3, 11] of g_c
1: I have [0, 4] through [3, 7] of g_c
3: I have [0, 12] through [7, 15] of g_a
3: I have [3, 0] through [3, 7] of g_b
3: I have [0, 12] through [3, 15] of g_c
0:
0: global array: Array C[1:16,1:4], handle: -998
0:
0: 1 2 3 4
0: ----------- ----------- ----------- -----------
0: 1 204.00000 492.00000 780.00000 1068.00000
0: 2 492.00000 1292.00000 2092.00000 2892.00000
0: 3 780.00000 2092.00000 3404.00000 4716.00000
0: 4 1068.00000 2892.00000 4716.00000 6540.00000
0: 5 1356.00000 3692.00000 6028.00000 8364.00000
0: 6 1644.00000 4492.00000 7340.00000 10188.00000
0: 7 1932.00000 5292.00000 8652.00000 12012.00000
0: 8 2220.00000 6092.00000 9964.00000 13836.00000
0: 9 2508.00000 6892.00000 11276.00000 15660.00000
0: 10 2796.00000 7692.00000 12588.00000 17484.00000
0: 11 3084.00000 8492.00000 13900.00000 19308.00000
0: 12 3372.00000 9292.00000 15212.00000 21132.00000
0: 13 3660.00000 10092.00000 16524.00000 22956.00000
0: 14 3948.00000 10892.00000 17836.00000 24780.00000
0: 15 4236.00000 11692.00000 19148.00000 26604.00000
0: 16 4524.00000 12492.00000 20460.00000 28428.00000
0:
0: global array: Array D[1:16,1:4], handle: -997
0:
0: 1 2 3 4
0: ----------- ----------- ----------- -----------
0: 1 204.00000 492.00000 780.00000 1068.00000
0: 2 492.00000 1292.00000 2092.00000 2892.00000
0: 3 780.00000 2092.00000 3404.00000 4716.00000
0: 4 1068.00000 2892.00000 4716.00000 6540.00000
0: 5 1356.00000 3692.00000 6028.00000 8364.00000
0: 6 1644.00000 4492.00000 7340.00000 10188.00000
0: 7 1932.00000 5292.00000 8652.00000 12012.00000
0: 8 2220.00000 6092.00000 9964.00000 13836.00000
0: 9 2508.00000 6892.00000 11276.00000 15660.00000
0: 10 2796.00000 7692.00000 12588.00000 17484.00000
0: 11 3084.00000 8492.00000 13900.00000 19308.00000
0: 12 3372.00000 9292.00000 15212.00000 21132.00000
0: 13 3660.00000 10092.00000 16524.00000 22956.00000
0: 14 3948.00000 10892.00000 17836.00000 24780.00000
0: 15 4236.00000 11692.00000 19148.00000 26604.00000
0: 16 4524.00000 12492.00000 20460.00000 28428.00000

But Wait, There’s More!

In addition to what you’ve seen here, the GA Toolkit offers a variety of higher-level tools for computational problem solving. Additional linear algebra routines (like GA_Dgemm()) are included in the package, along with interfaces to simple solvers and connections to other mathematics packages for solving eigensystems, and more. While GA is not the answer to all parallel computational science needs, for some applications it offers some powerful tools, and can provide performance as good as direct message-passing code with less work.
Download a copy and try it out on your next parallel coding problem!

Forrest Hoffman is a computer modeling and simulation researcher at Oak Ridge National Laboratory. He can be reached at class="emailaddress">forrest@climate.ornl.gov.

Fatal error: Call to undefined function aa_author_bios() in /opt/apache/dms/b2b/linux-mag.com/site/www/htdocs/wp-content/themes/linuxmag/single.php on line 62