## Wednesday, August 04, 2010

### Using Parallel Reduction to Optimize Dijkstra

In a previous blog post I wrote about Dijkstra's algorithm. One optimisation to the naive implementation is to calculate the minimum cost node in parallel - since this can often be the most time consuming operation for large graphs. This is a perfect candidate for a parallel reduction operation.

In serial, if we wished to sum up an array (x) of numbers, the code would be:
```int sum = 0;
for(int i=0; i < N; ++i)
sum += x[i];
```

However, in parallel we could speed this up by summing parts of the array and then adding those parts together. For example, to add four numbers a,b,c,d we could first add a+b and c+d in parallel, then sum ab+cd. This is called a 'reduction' operation, and it allows us to complete an otherwise O(N) operation in O(log(N)) steps.

One particularly nifty pattern is the "Butterfly" pattern of adding elements. Best described with an animation:

In code this can be done as:
```for (b=n/2;b>0;b/=2) {
for (i=0;i<n;i++) {
t[i] = OP(sum[i],sum[i^b]);
}
for (i=0;i<n;i++) {
sum[i]=t[i];
}
}```
(Where 'OP' can be any serial commutative operation, e.g. add, subtract, etc.)

To see how this works assume we are adding 8 elements (n=8) then for the first step:
b is 0100
and each subsequent operation will be operating on the following elements:
```0000^0100:0100
0001^0100:0101
0010^0100:0110
0011^0100:0111
0100^0100:0000
0101^0100:0001
0110^0100:0010
0111^0100:0011
```
or in decimal:
```t =  + 
t =  + 
t =  + 
t =  + 
t =  + 
t =  + 
t =  + 
t =  + 
```
Now, for the next step:
b is 0010
```0000^0010:0010
0001^0010:0011
0010^0010:0000
0011^0010:0001
0100^0010:0110
0101^0010:0111
0110^0010:0100
0111^0010:0101
```
Again, in decimal:
```t =  + 
t =  + 
t =  + 
t =  + 
t =  + 
t =  + 
t =  + 
t =  + 
```
Finally,
b is 0001
```0000^0001:0001
0001^0001:0000
0010^0001:0011
0011^0001:0010
0100^0001:0101
0101^0001:0100
0110^0001:0111
0111^0001:0110
```
And in decimal:
```t =  + 
t =  + 
t =  + 
t =  + 
t =  + 
t =  + 
t =  + 
t =  + 
```

Thus, in only 3 parallel operations we have added 8 elements together. We also have the additional bonus of having broadcast the result to all of the eight parallel adding elements - allowing future operations for each processor to use the result, or in the case of a SIMD system, to mask the appropriate processors.

Here is the code in CUDA:

```int i = threadIdx.x; // Thread i holds value x_i, one thread per element
__shared__ int sum[blocksize];
sum[i] = x_i; //copy to shared memory