Anyway, a brief overview of how we can do square roots:

- Calculate it with the FPU, (however that was implemented by the chip manafacturer).

- Calculate it from newton-raphson. This allows you to control the accuracy of the sqrt. (Or typicaly rsqrt) This comes in two flavours:

- Use an initial estimate, and refine ALA Greg Walsh / John Carmack / Quake 3 approach.

- Use a lookup table, then refine. This is probably an obvious approach, but I think AMD did a lot of pioneering work on this approach. (Well, at least, I learned these tricks from them..) See nVidias lookup table sqrt code.

- Use an initial estimate, and refine ALA Greg Walsh / John Carmack / Quake 3 approach.
- Calculate it from the inverse. This comes in two flavours:

- Calculate the reciprical, then invert it (1/rsqrt(x)), this gives you correct results

- Multiply it by the input value (x*rsqrt(x)), this gives you faulty results around 0, but saves you a costly divide.

Note:

1.0f / rsqrtf(0.0f) = 1.0f / infinity = 0.0f

0.0f * rsqrtf(0.0f) = 0.0f * infinity = NaN

- Calculate the reciprical, then invert it (1/rsqrt(x)), this gives you correct results

I decided to test three routines for the GPU:

- native sqrt
- native rsqrt
- Carmack's rsqrt

I did my best at generating reliable results by testing block sizes from 2..256 and performing 2.5 million sqrt operations. Here are the results from my nVidia 9800GX2:

Method | Total time | Max. ticks per float | Avg. ticks per float | Std. Dev. | Avg. Error |
---|---|---|---|---|---|

GPU intrinsic SQRT | 1.285ms | 5.99 | 3.99 | 0.00 | 0.00% |

GPU intrinsic RSQRT * x | 1.281ms | 5.99 | 3.99 | 0.00 | 0.00% |

Carmack RSQRT * x | 2.759ms | 6.28 | 4.26 | 0.01 | 0.09% |

The conclusions to take from these results are simple: Carmack's inverse and other trickery isn't going to help, using the GPU RSQRT function as opposed to the inbuilt SQRT function saves you about a clock tick or two. (Probably because nVidias SQRT is implemented as 1/RSQRT, as opposed to X*RSQRT)

I'm happy to say, low level optimization tricks are still safely a thing of the past.

You can get the code for the CUDA benchmark here: GPU SQRT snippet.

## 1 comment:

Sorry but just had to say that it's not really Carmack's rqsrt, it was attributed to an unknown from SGI

Post a Comment