Is this code
float a = ...;
__m256 b = _mm_broadcast_ss(&a)
always faster than this code
float a = ...;
_mm_set1_ps(a)
?
What if a
defined as static const float a = ...
rather than float a = ...
?
_mm_broadcast_ss has weaknesses imposed by the architecture which are largely hidden by the mm SSE API. The most important difference is as follows:
What this means is if you use _mm_broadcast_ss explicitly in a situation where the source is not in memory then the result will likely be less efficient than that of using _mm_set1_ps. This sort of situation typically happens when loading immediate values (constants), or when using the result of a recent calculation. In those situations the result will be mapped to a register by the compiler. To use the value for broadcast, the compiler must dump the value back to memory. Alternatively, a pshufd could be used to splat directly from register instead.
_mm_set1_ps is implementation-defined rather than being mapped to a specific underlying cpu operation (instruction). That means it might use one of several SSE instructions to perform the splat. A smart compiler with AVX support enabled should definitely use vbroadcastss internally when appropriate, but it depends on the AVX implementation state of the compilers optimizer.
If you're very confident you're loading from memory -- such as iterating over an array of data -- then direct use of broadcast is fine. But if there's any doubt at all, I would recommend stick with _mm_set1_ps.
And in the specific case of a static const float
, you absolutely want to avoid using _mm_broadcast_ss().
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With