What are the best settings for stuff like MXCSR? Which rounding mode is fastest? On what processors? Is it faster to enable signalling NaNs so I get informed when a computation results in a nan, or does this cause slowdowns in non-NaN computations?
In summary, how do you get the maximum of speed out of tight inner SSE loops?
Any related x87 floating-point speed advice also welcome.
Use Flush-to-zero and Denormals-are-zero modes: they are intended for speed at a precision cost that you probably won't notice.
I doubt that different rounding modes have different costs. Round-to-nearest is hardest in theory, but in a hardware implementation, I would guess the additional transistors to do it in the same number of cycles are probably there anyway, and are just unused for directed rounding.
Signaling NaNs do not slow down non-NaN computations.
Set the control flags word only once before your computation: changing it during the computation will dwarf any savings you achieve.
If you computation is likely to encounter denormals, and accuracy of very small values is not important to your computation, then by all means turn on FZ and DAZ (once, at the start of your computation; don't touch the MXCSR more than necessary). They won't make any difference if your computation doesn't involve denormal values, but if it does, the difference can be quite significant.
None of the other MXCSR bits have any effect on performance at all.
The only x87-related performance advice is: don't use the x87 unit. Do your computations in SSE instead whenever possible.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With