Why are there 128bit load functions for SSE?

Question

I'm poking around in somebody else's code and currently trying to figure out why _mm_load_si128 exists.

Essentially, I tried replacing

_ra = _mm_load_si128(reinterpret_cast<__m128i*>(&cd->data[idx]));

with

_ra = *reinterpret_cast<__m128i*>(&cd->data[idx]);

and it works and performs exactly the same.

I figured that the load functions exist for smaller types just for the sake of convenience so people wouldn't have to pack them into continuous memory manually but for data that is already in the correct order, why bother?

Is there something else that _mm_load_si128 does? Or is it essentially just a roundabout way of assigning a value?

plasmacel · Accepted Answer

There are explicit and implicit loads in SSE.

_mm_load_si128(reinterpret_cast<__m128i*>(&cd->data[idx])); is an explicit load
*reinterpret_cast<__m128i*>(&cd->data[idx]); is an implicit load

With an explicit load you explicitly instruct the compiler to load the data into an XMM register - this is the "official" Intel way to do it. You can also control whether the load is an aligned or unaligned load by using _mm_load_si128 or _mm_loadu_si128.

Although as an extension, most compilers are also able to automatically generate XMM loads when you do type-punning, but this way you cannot control whether the load is aligned or unaligned. In this case, since on modern CPUs there is no performance penalty of using unaligned loads when the data is aligned, compilers tend to use unaligned loads universally.

An another, more important aspect is that with implicit loads you violate strict aliasing rules, which can result in undefined behavior. Although it's worth to mention that - as part of the extension - compilers which support Intel intrinsics don't tend to enforce strict aliasing rules on XMM placeholder types like __m128, __m128d, __m128i.

Nevertheless I think explicit loads are cleaner and more bulletproof.

Why do compilers don't tend to enforce strict aliasing rules on SSE placeholder types?

The 1st reason lies in the design of the SSE intrinsics: there are obvious cases when you have to use type-punning, since there is no other way to use some of the intrinsics. Mysticial's answer summarizes it perfectly.

As Cody Gray pointed out in the comments, it's worth to mention that historically MMX instrinsics (which are now mostly superseded by SSE2) didn't even provide explicit loads or stores - you had to use type-punning.

The 2nd reason (somewhat related to the 1st) lies in the type definitions of these types.

GCC's typedefs for the SSE/SSE2 placeholder types in <xmmintrin.h > and <emmintrin.h>:

/* The Intel API is flexible enough that we must allow aliasing with other
   vector types, and their scalar components.  */

typedef float __m128 __attribute__ ((__vector_size__ (16), __may_alias__));    
typedef long long __m128i __attribute__ ((__vector_size__ (16), __may_alias__));
typedef double __m128d __attribute__ ((__vector_size__ (16), __may_alias__));

The key here is the __may_alias__ attribute, which makes type-punning work on these types even when strict aliasing is enabled with the -fstrict-aliasing flag.

Now, since clang and ICC are compatible with GCC, they should follow the same convention. So currently, in these 3 compilers implicit loads/stores are somewhat guaranteed to work even with -fstrict-aliasing flag. Finally, MSVC doesn't support strict aliasing at all, so it cannot even be an issue there.

Still, this doesn't mean that you should prefer implicit loads/stores over explicit ones.

Why are there 128bit load functions for SSE?

Tags:

c++

x86

simd

sse

intrinsics

user81993

1 Answers

plasmacel

Recent Activity

Donate For Us

Why are there 128bit load functions for SSE?

Tags:

c++

x86

simd

sse

intrinsics

user81993

1 Answers

plasmacel

Related questions

Recent Activity

Donate For Us