Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do you get maximal speed out of SSE?

What are the best settings for stuff like MXCSR? Which rounding mode is fastest? On what processors? Is it faster to enable signalling NaNs so I get informed when a computation results in a nan, or does this cause slowdowns in non-NaN computations?

In summary, how do you get the maximum of speed out of tight inner SSE loops?

Any related x87 floating-point speed advice also welcome.

like image 668
FeepingCreature Avatar asked Jul 30 '11 13:07

FeepingCreature


2 Answers

Use Flush-to-zero and Denormals-are-zero modes: they are intended for speed at a precision cost that you probably won't notice.

I doubt that different rounding modes have different costs. Round-to-nearest is hardest in theory, but in a hardware implementation, I would guess the additional transistors to do it in the same number of cycles are probably there anyway, and are just unused for directed rounding.

Signaling NaNs do not slow down non-NaN computations.

Set the control flags word only once before your computation: changing it during the computation will dwarf any savings you achieve.

like image 127
Pascal Cuoq Avatar answered Oct 21 '22 02:10

Pascal Cuoq


If you computation is likely to encounter denormals, and accuracy of very small values is not important to your computation, then by all means turn on FZ and DAZ (once, at the start of your computation; don't touch the MXCSR more than necessary). They won't make any difference if your computation doesn't involve denormal values, but if it does, the difference can be quite significant.

None of the other MXCSR bits have any effect on performance at all.

The only x87-related performance advice is: don't use the x87 unit. Do your computations in SSE instead whenever possible.

like image 29
Stephen Canon Avatar answered Oct 21 '22 03:10

Stephen Canon