When I need to do an vector operation that has an operand that is just a float broadcasted to every component, should I precompute the __m256
or __m128
, and load it when I need it, or broadcast the float to the register using _mm_set1_ps
every time I need the vector?
I have been precomputing the vectors that are very important and highly used and generating on the fly the ones that are less important. But am I really gaining any speed with precomputing? Is it worth the trouble?
Is the _mm_set1_ps
implemented with a single instruction? That might answer my question.
I believe it's generally best to factor out your SSE vector from your code (e.g. loop), and use it whenever you need to, assuming you take care not to accidentally force it into memory. (For example, if you take its address or pass it by reference to another function, then it may be forced into memory and you may get odd behavior.)
The idea is that usually it is best to avoid transferring values into and out of SSE registers, and if it happens that this isn't the case in your particular situation, the compiler already knows how the value was constructed, and could rematerialize it if need be. I think this is a lot easier than loop-invariant code motion in general, which is the reverse optimization (i.e. where the compiler factors it out for you) and which requires the compiler to prove that the code is indeed loop-invariant.
I was playing around with broadcasts for an answer to fastest way to fill a vector (SSE2) with a certain value. Templates friendly. Have a look some some asm dumps of broadcasts.
set1
every time it's used shouldn't make much difference, as long as the compiler knows the value to be broadcast doesn't alias anything. (If the compiler can't assume it doesn't alias, it will have to redo the broadcast after every write to an array or pointer that might alias.)
It's usually good style to store the set1
result in a named variable. If the compiler runs out of vector registers, it might spill the vector to the stack, and reload later, or it might re-broadcast. I'm not sure if coding style will influence this decision.
I wouldn't use a static const
variable to cache it between calls to a function. (That can lead to the compiler generating code to check if the variable was already initialized every call.)
Broadcasts of compile-time constants sometimes result in compile-time broadcasts, so your code just has 16B of const data sitting in memory.
AVX1 broadcasts of a value already in a register is the worst-case. AVX1 only provides the memory-source vbroadcastps
(uses the load port only). A broadcast takes a shufps / vinsertf128
.
AVX2 is required for vbroadcastps ymm, xmm
(uses the shuffle port)).
Naturally it's going to depend a lot on your code, but I've implemented two simple functions using both approaches. See code
__m128 calc_set1(float num1, float num2)
{
__m128 num1_4 = _mm_set1_ps(num1);
__m128 num2_4 = _mm_set1_ps(num2);
__m128 result4 = _mm_mul_ps(num1_4, num2_4);
return result4;
}
__m128 calc_mov(float* num1_4_addr, float* num2_4_addr)
{
__m128 num1_4 = _mm_load_ps(num1_4_addr);
__m128 num2_4 = _mm_load_ps(num2_4_addr);
__m128 result4 = _mm_mul_ps(num1_4, num2_4);
return result4;
}
and assembly
calc_set1(float, float):
shufps $0, %xmm0, %xmm0
shufps $0, %xmm1, %xmm1
mulps %xmm1, %xmm0
ret
calc_mov(float*, float*):
movaps (%rdi), %xmm0
mulps (%rsi), %xmm0
ret
You can see that the calc_mov()
does as what you'd expect and the calc_set1()
uses a single shuffle instruction.
A movps
instruction can take approximately four cycles for the address generation + more if the load port of the L1 cache is busy + more in the rare event of a cache miss.
shufps
will take a single cycle on any of the recent Intel microarchitectures. I believe this is true whether it's for SSE128 or AVX256. Therefore I would suggest using the mm_set1_ps
approach.
Of course, a shuffle instruction assumes the float is already in an SSE/AVX register. In the event that you're loading it from memory, then the broadcast will be better since it will capture the best of movps
and shufps
in a single instruction.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With