Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

difference between load1 and broadcast intrinsics

What's the difference between _mm_broadcast_ss() and _mm_load_ps1()?

void example(){
   __declspec(align(32)) const float num = 20;

   __m128 a1 = _mm_broadcast_ss(&num); 
   __declspec(align(32)) float f1[4];
   _mm_store_ps (f1, a1);
   std::cout << f1[0] << " " << f1[1] <<" " << f1[2] << " " << f1[3] << "\n";

   __m128 a2 = _mm_load_ps1(&num); 
   __declspec(align(32)) float f2[4];
   _mm_store_ps (f2, a2);
    std::cout << f2[0] << " " << f2[1] <<" " << f2[2] << " " << f2[3] << "\n";
}

I got same output in both ways, so why do they both exist?

like image 246
Stepan Loginov Avatar asked Mar 24 '16 01:03

Stepan Loginov


People also ask

What is Immintrin?

The immintrin. h header file defines a set of data types that represent different types of vectors. These are; __m256 : This is a vector of eight floating point numbers (8x32 = 256 bits)

What is __ m256i?

_mm256_maskstore_epi32(int *addr, __m256i mask, __m256i a) — store 32-bit values from a at addr , but only the values 32-bit values that mask specifies. Values are stored if the most significant (i.e. sign) bit of each 32-bit integer in mask is set.


1 Answers

_mm_broadcast_ss only compiles for AVX targets.

_mm_load1_ps / _mm_load_ps1 will compile to multiple instructions (movss / shufps) when compiling for targets that don't support AVX. When you are compiling for an AVX target, any good compiler will use a vbroadcastss to implement them.

load1 / set1 and other convenience functions were introduced early on, because it's often good to let the compiler pick the optimal strategy for moving data around.

_mm_broadcast_* intrinsics were introduced as direct wrappers around the vbroadcastss / vbroadcastsd instructions. (AVX2 has integer vpbroadcast..., and the reg-reg forms of vbroadcastss. AVX1 only has vbroadcastss x/ymm, [mem].)


AFAICT, there's no downside to just using _mm_load1_ps or _mm_set1_ps.

It makes no difference to the code, and lets the same source build for non-AVX targets.

The choice might make a difference to the asm output at -O0, but IDK. If you care about the asm output in an un-optimized build, then 1: that's weird, and 2: you'll have to see what your compiler does.


As you can see from the asm output on godbolt (for gcc):

Without AVX (-mno-avx)

bcast: compile error so I #ifdef it out

__m128 load1(const float*p) {  return _mm_load1_ps(p); }
    movss   xmm0, DWORD PTR [rdi]
    shufps  xmm0, xmm0, 0
    ret

With AVX (-mavx)

__m128 bcast(const float*p) { return _mm_broadcast_ss(p); }        
    vbroadcastss    xmm0, DWORD PTR [rdi]
    ret
__m128 load1(const float*p) {  return _mm_load1_ps(p); }
    vbroadcastss    xmm0, DWORD PTR [rdi]
    ret
like image 157
Peter Cordes Avatar answered Oct 05 '22 10:10

Peter Cordes