Wednesday, August 04, 2010

Using Parallel Reduction to Optimize Dijkstra

In a previous blog post I wrote about Dijkstra's algorithm. One optimisation to the naive implementation is to calculate the minimum cost node in parallel - since this can often be the most time consuming operation for large graphs. This is a perfect candidate for a parallel reduction operation.

In serial, if we wished to sum up an array (x) of numbers, the code would be:
int sum = 0;
for(int i=0; i < N; ++i) 
 sum += x[i];

However, in parallel we could speed this up by summing parts of the array and then adding those parts together. For example, to add four numbers a,b,c,d we could first add a+b and c+d in parallel, then sum ab+cd. This is called a 'reduction' operation, and it allows us to complete an otherwise O(N) operation in O(log(N)) steps.

One particularly nifty pattern is the "Butterfly" pattern of adding elements. Best described with an animation:


In code this can be done as:
for (b=n/2;b>0;b/=2) {
 for (i=0;i<n;i++) { 
         t[i] = OP(sum[i],sum[i^b]); 
 } 
 for (i=0;i<n;i++) { 
         sum[i]=t[i]; 
 } 
}
(Where 'OP' can be any serial commutative operation, e.g. add, subtract, etc.)


To see how this works assume we are adding 8 elements (n=8) then for the first step:
b is 0100
and each subsequent operation will be operating on the following elements:
0000^0100:0100
0001^0100:0101
0010^0100:0110
0011^0100:0111
0100^0100:0000
0101^0100:0001
0110^0100:0010
0111^0100:0011
or in decimal:
t[0] = [0] + [4]
t[1] = [1] + [5] 
t[2] = [2] + [6]
t[3] = [3] + [7]
t[4] = [4] + [0]
t[5] = [5] + [1]
t[6] = [6] + [2]
t[7] = [7] + [3]
Now, for the next step:
b is 0010
0000^0010:0010
0001^0010:0011
0010^0010:0000
0011^0010:0001
0100^0010:0110
0101^0010:0111
0110^0010:0100
0111^0010:0101
Again, in decimal:
t[0] = [0] + [2]
t[1] = [1] + [3]
t[2] = [2] + [0]
t[3] = [3] + [1]
t[4] = [4] + [6]
t[5] = [5] + [7]
t[6] = [6] + [4]
t[7] = [7] + [5]
Finally,
b is 0001
0000^0001:0001
0001^0001:0000
0010^0001:0011
0011^0001:0010
0100^0001:0101
0101^0001:0100
0110^0001:0111
0111^0001:0110
And in decimal:
t[0] = [0] + [1]
t[1] = [1] + [0]
t[2] = [2] + [3]
t[3] = [3] + [2]
t[4] = [4] + [5]
t[5] = [5] + [4]
t[6] = [6] + [7]
t[7] = [7] + [6]

Thus, in only 3 parallel operations we have added 8 elements together. We also have the additional bonus of having broadcast the result to all of the eight parallel adding elements - allowing future operations for each processor to use the result, or in the case of a SIMD system, to mask the appropriate processors.

Here is the code in CUDA:

int i = threadIdx.x; // Thread i holds value x_i, one thread per element
 __shared__ int sum[blocksize]; 
sum[i] = x_i; //copy to shared memory 
__syncthreads(); //synchronise 
for(int bit=blocksize/2; bit>0; bit/=2) {
    int t=OP(sum[i],sum[i^bit]);  
    __syncthreads(); 
    sum[i]=t;  
    __syncthreads(); 
}

Nifty!

No comments: