## Wednesday, August 04, 2010

### Using Parallel Reduction to Optimize Dijkstra

In a previous blog post I wrote about Dijkstra's algorithm. One optimisation to the naive implementation is to calculate the minimum cost node in parallel - since this can often be the most time consuming operation for large graphs. This is a perfect candidate for a parallel reduction operation.

In serial, if we wished to sum up an array (x) of numbers, the code would be:
```int sum = 0;
for(int i=0; i < N; ++i)
sum += x[i];
```

However, in parallel we could speed this up by summing parts of the array and then adding those parts together. For example, to add four numbers a,b,c,d we could first add a+b and c+d in parallel, then sum ab+cd. This is called a 'reduction' operation, and it allows us to complete an otherwise O(N) operation in O(log(N)) steps.

One particularly nifty pattern is the "Butterfly" pattern of adding elements. Best described with an animation:

In code this can be done as:
```for (b=n/2;b>0;b/=2) {
for (i=0;i<n;i++) {
t[i] = OP(sum[i],sum[i^b]);
}
for (i=0;i<n;i++) {
sum[i]=t[i];
}
}```
(Where 'OP' can be any serial commutative operation, e.g. add, subtract, etc.)

To see how this works assume we are adding 8 elements (n=8) then for the first step:
b is 0100
and each subsequent operation will be operating on the following elements:
```0000^0100:0100
0001^0100:0101
0010^0100:0110
0011^0100:0111
0100^0100:0000
0101^0100:0001
0110^0100:0010
0111^0100:0011
```
or in decimal:
```t[0] = [0] + [4]
t[1] = [1] + [5]
t[2] = [2] + [6]
t[3] = [3] + [7]
t[4] = [4] + [0]
t[5] = [5] + [1]
t[6] = [6] + [2]
t[7] = [7] + [3]
```
Now, for the next step:
b is 0010
```0000^0010:0010
0001^0010:0011
0010^0010:0000
0011^0010:0001
0100^0010:0110
0101^0010:0111
0110^0010:0100
0111^0010:0101
```
Again, in decimal:
```t[0] = [0] + [2]
t[1] = [1] + [3]
t[2] = [2] + [0]
t[3] = [3] + [1]
t[4] = [4] + [6]
t[5] = [5] + [7]
t[6] = [6] + [4]
t[7] = [7] + [5]
```
Finally,
b is 0001
```0000^0001:0001
0001^0001:0000
0010^0001:0011
0011^0001:0010
0100^0001:0101
0101^0001:0100
0110^0001:0111
0111^0001:0110
```
And in decimal:
```t[0] = [0] + [1]
t[1] = [1] + [0]
t[2] = [2] + [3]
t[3] = [3] + [2]
t[4] = [4] + [5]
t[5] = [5] + [4]
t[6] = [6] + [7]
t[7] = [7] + [6]
```

Thus, in only 3 parallel operations we have added 8 elements together. We also have the additional bonus of having broadcast the result to all of the eight parallel adding elements - allowing future operations for each processor to use the result, or in the case of a SIMD system, to mask the appropriate processors.

Here is the code in CUDA:

```int i = threadIdx.x; // Thread i holds value x_i, one thread per element
__shared__ int sum[blocksize];
sum[i] = x_i; //copy to shared memory