Is _mm_broadcast_ss faster than _mm_set1_ps?

Question

Is this code

float a = ...;
__m256 b = _mm_broadcast_ss(&a)

always faster than this code

float a = ...;
_mm_set1_ps(a)

?

What if a defined as static const float a = ... rather than float a = ...?

jstine · Accepted Answer

_mm_broadcast_ss has weaknesses imposed by the architecture which are largely hidden by the mm SSE API. The most important difference is as follows:

_mm_broadcast_ss is limited to loading values from memory only.

What this means is if you use _mm_broadcast_ss explicitly in a situation where the source is not in memory then the result will likely be less efficient than that of using _mm_set1_ps. This sort of situation typically happens when loading immediate values (constants), or when using the result of a recent calculation. In those situations the result will be mapped to a register by the compiler. To use the value for broadcast, the compiler must dump the value back to memory. Alternatively, a pshufd could be used to splat directly from register instead.

_mm_set1_ps is implementation-defined rather than being mapped to a specific underlying cpu operation (instruction). That means it might use one of several SSE instructions to perform the splat. A smart compiler with AVX support enabled should definitely use vbroadcastss internally when appropriate, but it depends on the AVX implementation state of the compilers optimizer.

If you're very confident you're loading from memory -- such as iterating over an array of data -- then direct use of broadcast is fine. But if there's any doubt at all, I would recommend stick with _mm_set1_ps.

And in the specific case of a static const float, you absolutely want to avoid using _mm_broadcast_ss().

Is _mm_broadcast_ss faster than _mm_set1_ps?

Tags:

vectorization

avx

Yoav

1 Answers

jstine

Recent Activity

Donate For Us

Is _mm_broadcast_ss faster than _mm_set1_ps?

Tags:

vectorization

avx

Yoav

1 Answers

jstine

Related questions

Recent Activity

Donate For Us