Anyway, a brief overview of how we can do square roots:
- Calculate it with the FPU, (however that was implemented by the chip manafacturer).
- Calculate it from newton-raphson. This allows you to control the accuracy of the sqrt. (Or typicaly rsqrt) This comes in two flavours:
- Use an initial estimate, and refine ALA Greg Walsh / John Carmack / Quake 3 approach.
- Use a lookup table, then refine. This is probably an obvious approach, but I think AMD did a lot of pioneering work on this approach. (Well, at least, I learned these tricks from them..) See nVidias lookup table sqrt code.
- Use an initial estimate, and refine ALA Greg Walsh / John Carmack / Quake 3 approach.
- Calculate it from the inverse. This comes in two flavours:
- Calculate the reciprical, then invert it (1/rsqrt(x)), this gives you correct results
- Multiply it by the input value (x*rsqrt(x)), this gives you faulty results around 0, but saves you a costly divide.
Note:
1.0f / rsqrtf(0.0f) = 1.0f / infinity = 0.0f
0.0f * rsqrtf(0.0f) = 0.0f * infinity = NaN
- Calculate the reciprical, then invert it (1/rsqrt(x)), this gives you correct results
I decided to test three routines for the GPU:
- native sqrt
- native rsqrt
- Carmack's rsqrt
I did my best at generating reliable results by testing block sizes from 2..256 and performing 2.5 million sqrt operations. Here are the results from my nVidia 9800GX2:
Method | Total time | Max. ticks per float | Avg. ticks per float | Std. Dev. | Avg. Error |
---|---|---|---|---|---|
GPU intrinsic SQRT | 1.285ms | 5.99 | 3.99 | 0.00 | 0.00% |
GPU intrinsic RSQRT * x | 1.281ms | 5.99 | 3.99 | 0.00 | 0.00% |
Carmack RSQRT * x | 2.759ms | 6.28 | 4.26 | 0.01 | 0.09% |
The conclusions to take from these results are simple: Carmack's inverse and other trickery isn't going to help, using the GPU RSQRT function as opposed to the inbuilt SQRT function saves you about a clock tick or two. (Probably because nVidias SQRT is implemented as 1/RSQRT, as opposed to X*RSQRT)
I'm happy to say, low level optimization tricks are still safely a thing of the past.
You can get the code for the CUDA benchmark here: GPU SQRT snippet.
Sorry but just had to say that it's not really Carmack's rqsrt, it was attributed to an unknown from SGI
ReplyDelete