I attended the GPGPU presentation by Mark Harris, with lots of tips and tricks. It was really great value. I'm just going to write down what is essentially my notes on the event, so this won't be much use to you unless you already know a bit about the GPU arch and GPU programming.
First thing, I wasn't aware of, all of nVidia's tech targets 'PTX' an abstract assembly language. So Cg compiles to PTX, CUDA compiles to PTX, OpenCL compiles to PTX, etc.
PTX can be then just-in-time compiled and optimized for a target platform, or you can combine it with a targeted compile into what is called a 'Fat' binary. Neato. Apparently RapidMind are building a full C++ GPU compiler, but judging from their history I guess it would be an all-round stream-processing compiler (SPU,etc as well).
On the hardware front, the GPU is basically structured as a bunch of PE's grouped into multiprocessors (SMs) which in turn either have a 1-1 or 2-1 relationship to Cooperative Thread Arrays (CTA). Instructions are scheduled in 'Warps' which are 32 threads at a time (ie, the instruction unit synchronously decodes 32 instructions, one for each thread at a time, AFAIK). The instruction cache on current chips is ~2MB.
The hardware has special/super function units, which are used for rsqrt, etc. and they have 1/4 the throughput of normal ALU instructions, that is, if all your threads do a sqrt at once, they will need to wait on each other to issue it. Likewise double precision instructions get serialized, so you loose your 8x performance boost.
The texture units are also accessible under CUDA, these do the usual graphics texture unit things, let you interpolate between values, address them in wrap-around etc, all for nearly free. They are also cached in a 2D spatial coherency style, that could be interesting for image processing algorithms.
Memory is accessed in half-warps (16) at a time, and depending on your arch type 1.2 or above then this will cause memory coalescing. (For 1.0,etc this means if you read memory in a non in-order continues way it will cause stalls, or, in nVidia talk 'coalesced') 1.2 style hardware was very forgiving and would be smart enough to mask out different read types to optimize its read request. This was actually pretty damn impressive. Reading memory in aligned boundaries can be optimal (eg: prefer float4 over float3)
Speaking of memory, local thread memory doesn't really exist (it's just mapped to global memory, and you get no control over it - lame!), instead each thread has some allowed number of 32bit registers it can use, as specified by a compile flag. So if you want some super-fast local storage, you might want to up the reg's per thread. Ofcourse, if you do that, then not so many threads can be run at once, limiting how much you can mask synch/mem/sqrt latencies etc.
Your next bit of fast memory is of course shared memory. (Shared between blocks. Something special happens with working with shared memory if its <32 threads, I can't recall exactly, but I remember thinking it was cool. Something like not needing synch's or more likely, really fast synchs). Global memory is persistent between CPU threads and applications, which could be a nifty feature for some larger scale applications. (Lots of GPU RAM could be very handy in a future iMagic style separated rendering engine/content loading system)
When copying memory to the GPU, you can malloc from DMA bound memory, to stop CUDA from needing to do an additional memcopy, and future hardware will have 'Zero'Memory which means you can write directly into DMAed system memory. Nifty. Theres also asynch mem copies, and streaming mem stuff. Not really stuff I need right now so I wasn't all to interested in it. Didn't look too complicated though.
A fair bit of time was spent on talking about bank conflicts, but I didn't really get it. Basically having multiple different bits of your code read from the same spot is bad. This happens more than you might thing due to some magic modulo arithmetic on the GPU.
Some optimization rules of thumb: at least 64 threads per block, and you want 6 warps or more to hide instruction latency . CUDA intrinsics are specified with __sin, or use_fast_math compiler flag. At least 2 block per physical processor, so that sync calls will let the other block execute instead of just busy-waiting.
Well, thats the end of my brain dump of the day. HTH
UPDATE: No sooner had I pressed 'publish' and then:
CUDA 2.2 is out.
Visual Profiler works on Vista! Yay!