I've been working together with Neil Osbourne on a GPU version of my SPH code. He tried a number of things and managed to get a significant speedup. He presented his results at the iVEC eResearch Forum 2009. Main lessons learnt were to keep the number of CUDA threads approx equal to the number of hardware processors, not to do memory optimizations too early on, and ofcourse, to do shared memory optimizations later on.
Never managed to get a hashed grid going within the time we had, but it was still a cool project. With a bit more effort it should find it's way into PAL or Bullet.