In serial, if we wished to sum up an array (x) of numbers, the code would be:
int sum = 0; for(int i=0; i < N; ++i) sum += x[i];
However, in parallel we could speed this up by summing parts of the array and then adding those parts together. For example, to add four numbers a,b,c,d we could first add a+b and c+d in parallel, then sum ab+cd. This is called a 'reduction' operation, and it allows us to complete an otherwise O(N) operation in O(log(N)) steps.
One particularly nifty pattern is the "Butterfly" pattern of adding elements. Best described with an animation:
In code this can be done as:
for (b=n/2;b>0;b/=2) { for (i=0;i<n;i++) { t[i] = OP(sum[i],sum[i^b]); } for (i=0;i<n;i++) { sum[i]=t[i]; } }(Where 'OP' can be any serial commutative operation, e.g. add, subtract, etc.)
To see how this works assume we are adding 8 elements (n=8) then for the first step:
b is 0100
and each subsequent operation will be operating on the following elements:
0000^0100:0100 0001^0100:0101 0010^0100:0110 0011^0100:0111 0100^0100:0000 0101^0100:0001 0110^0100:0010 0111^0100:0011or in decimal:
t[0] = [0] + [4] t[1] = [1] + [5] t[2] = [2] + [6] t[3] = [3] + [7] t[4] = [4] + [0] t[5] = [5] + [1] t[6] = [6] + [2] t[7] = [7] + [3]Now, for the next step:
b is 0010
0000^0010:0010 0001^0010:0011 0010^0010:0000 0011^0010:0001 0100^0010:0110 0101^0010:0111 0110^0010:0100 0111^0010:0101Again, in decimal:
t[0] = [0] + [2] t[1] = [1] + [3] t[2] = [2] + [0] t[3] = [3] + [1] t[4] = [4] + [6] t[5] = [5] + [7] t[6] = [6] + [4] t[7] = [7] + [5]Finally,
b is 0001
0000^0001:0001 0001^0001:0000 0010^0001:0011 0011^0001:0010 0100^0001:0101 0101^0001:0100 0110^0001:0111 0111^0001:0110And in decimal:
t[0] = [0] + [1] t[1] = [1] + [0] t[2] = [2] + [3] t[3] = [3] + [2] t[4] = [4] + [5] t[5] = [5] + [4] t[6] = [6] + [7] t[7] = [7] + [6]
Thus, in only 3 parallel operations we have added 8 elements together. We also have the additional bonus of having broadcast the result to all of the eight parallel adding elements - allowing future operations for each processor to use the result, or in the case of a SIMD system, to mask the appropriate processors.
Here is the code in CUDA:
int i = threadIdx.x; // Thread i holds value x_i, one thread per element __shared__ int sum[blocksize]; sum[i] = x_i; //copy to shared memory __syncthreads(); //synchronise for(int bit=blocksize/2; bit>0; bit/=2) { int t=OP(sum[i],sum[i^bit]); __syncthreads(); sum[i]=t; __syncthreads(); }
Nifty!
No comments:
Post a Comment