Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

For for an SSE vector that has all the same components, generate on the fly or precompute?

Tags:

c++

avx

simd

sse

When I need to do an vector operation that has an operand that is just a float broadcasted to every component, should I precompute the __m256 or __m128, and load it when I need it, or broadcast the float to the register using _mm_set1_ps every time I need the vector?

I have been precomputing the vectors that are very important and highly used and generating on the fly the ones that are less important. But am I really gaining any speed with precomputing? Is it worth the trouble?

Is the _mm_set1_ps implemented with a single instruction? That might answer my question.

like image 952
Thomas Avatar asked Aug 05 '15 21:08

Thomas


3 Answers

I believe it's generally best to factor out your SSE vector from your code (e.g. loop), and use it whenever you need to, assuming you take care not to accidentally force it into memory. (For example, if you take its address or pass it by reference to another function, then it may be forced into memory and you may get odd behavior.)
The idea is that usually it is best to avoid transferring values into and out of SSE registers, and if it happens that this isn't the case in your particular situation, the compiler already knows how the value was constructed, and could rematerialize it if need be. I think this is a lot easier than loop-invariant code motion in general, which is the reverse optimization (i.e. where the compiler factors it out for you) and which requires the compiler to prove that the code is indeed loop-invariant.

like image 126
user541686 Avatar answered Sep 21 '22 00:09

user541686


I was playing around with broadcasts for an answer to fastest way to fill a vector (SSE2) with a certain value. Templates friendly. Have a look some some asm dumps of broadcasts.

set1 every time it's used shouldn't make much difference, as long as the compiler knows the value to be broadcast doesn't alias anything. (If the compiler can't assume it doesn't alias, it will have to redo the broadcast after every write to an array or pointer that might alias.)

It's usually good style to store the set1 result in a named variable. If the compiler runs out of vector registers, it might spill the vector to the stack, and reload later, or it might re-broadcast. I'm not sure if coding style will influence this decision.

I wouldn't use a static const variable to cache it between calls to a function. (That can lead to the compiler generating code to check if the variable was already initialized every call.)

Broadcasts of compile-time constants sometimes result in compile-time broadcasts, so your code just has 16B of const data sitting in memory.

AVX1 broadcasts of a value already in a register is the worst-case. AVX1 only provides the memory-source vbroadcastps (uses the load port only). A broadcast takes a shufps / vinsertf128.

AVX2 is required for vbroadcastps ymm, xmm (uses the shuffle port)).

like image 43
Peter Cordes Avatar answered Sep 21 '22 00:09

Peter Cordes


Naturally it's going to depend a lot on your code, but I've implemented two simple functions using both approaches. See code

__m128 calc_set1(float num1, float num2)
{
  __m128 num1_4 = _mm_set1_ps(num1);
  __m128 num2_4 = _mm_set1_ps(num2);
  __m128 result4 = _mm_mul_ps(num1_4, num2_4);

  return result4;
}

__m128 calc_mov(float* num1_4_addr,  float* num2_4_addr)
{
   __m128 num1_4 = _mm_load_ps(num1_4_addr);
  __m128 num2_4 = _mm_load_ps(num2_4_addr);
  __m128 result4 = _mm_mul_ps(num1_4, num2_4);

  return result4;
}

and assembly

calc_set1(float, float):
    shufps  $0, %xmm0, %xmm0
    shufps  $0, %xmm1, %xmm1
    mulps   %xmm1, %xmm0
    ret
calc_mov(float*, float*):
    movaps  (%rdi), %xmm0
    mulps   (%rsi), %xmm0
    ret

You can see that the calc_mov() does as what you'd expect and the calc_set1() uses a single shuffle instruction.

A movps instruction can take approximately four cycles for the address generation + more if the load port of the L1 cache is busy + more in the rare event of a cache miss.

shufps will take a single cycle on any of the recent Intel microarchitectures. I believe this is true whether it's for SSE128 or AVX256. Therefore I would suggest using the mm_set1_ps approach.

Of course, a shuffle instruction assumes the float is already in an SSE/AVX register. In the event that you're loading it from memory, then the broadcast will be better since it will capture the best of movps and shufps in a single instruction.

like image 25
hayesti Avatar answered Sep 23 '22 00:09

hayesti