I am presently developing a proof-of-concept groundwater flow GPU program. The algorithm works similar to a finite differences algorithm, or many kernel-based image processing systems.
Initially I just ported the code over to CUDA, and thanks to having done this a few times before the CPU version I wrote was easy to transfer. A good time saver is to use macros to index arrays, etc. This makes it easy to swap in __mul24, or tex2D etc. later.
The next step was to use shared memory to buffer the input data, this gave the biggest performance boost. Finally I did some arithmetic optimizations for another small gain.
On the advice of a friend I tried re-structuring the GPU program to use 'aprons' similar to the 'Separable Convolution' SDK sample, and tried restructuring read/writes. This all made almost no difference at all, so it seems that the 'sweet-spot' can be hit quite quickly as soon as you have done the obvious shared memory and arithmetic optimizations. The overall structure of the program seems to make little difference.
A common bit of advice is to leverage the texture units on the GPU's, but a simple modification of the 'Texture-based Separable Convolution' sample program, reveals it is infact almost twice as slow as the non-texture based. Seems like the texturing unit speedups are a bit of a myth.
Benchmarking the program has been a bit of a problem, since it is very dependent on the type of GPU you have, and the problem size. If I process a few hundred thousand nodes, the speedup is around 15x over the CPU, but when I move to processing tens of millions of nodes, the speedup is over 20x. Processing the same data on a slightly older GPU will only give a 4x speedup (Compute 1.0).
All in all I've found it extremely difficult to give an overall answer on the performance gain of the GPU. It seems to be highly dependent on the problem size (the bigger the better - thank god!) and the GPU technology (this is going to make porting the software to multiple GPUs a pain!).