Suppose I want to add two buffers and store the result. Both buffers are already allocated 16byte aligned. I found two examples how to do that.
The first one is using _mm_load to read the data from the buffer into an SSE register, does the add operation and stores back to the result register. Until now I would have done it like that.
void _add( uint16_t * dst, uint16_t const * src, size_t n )
{
for( uint16_t const * end( dst + n ); dst != end; dst+=8, src+=8 )
{
__m128i _s = _mm_load_si128( (__m128i*) src );
__m128i _d = _mm_load_si128( (__m128i*) dst );
_d = _mm_add_epi16( _d, _s );
_mm_store_si128( (__m128i*) dst, _d );
}
}
The second example just did the add operations directly on the memory addresses without load/store operation. Both seam to work fine.
void _add( uint16_t * dst, uint16_t const * src, size_t n )
{
for( uint16_t const * end( dst + n ); dst != end; dst+=8, src+=8 )
{
*(__m128i*) dst = _mm_add_epi16( *(__m128i*) dst, *(__m128i*) src );
}
}
So the question is if the 2nd example is correct or may have any side effects and when to use load/store is mandatory.
Thanks.
Both versions are fine - if you look at the generated code you will see that the second version still generates at least one load to a vector register, since PADDW
(aka _mm_add_epi16
) can only get its second argument directly from memory.
In practice most non-trivial SIMD code will do a lot more operations between loading and storing data than just a single add, so in general you probably want to load data initially to vector variables (registers) using _mm_load_XXX
, perform all your SIMD operations on registers, then store the results back to memory via _mm_store_XXX
.
The main difference is that in the second version the compiler will generate unaligned loads ( movdqu
etc. ) if it can not prove the pointers to be 16 byte aligned. Depending on the surrounding code, it may not even be possible to write code where this property can be proven by the compiler.
Otherwise there is no difference, the compiler is smart enough to mangle two load and the add into one load and an add-from-memory if it deems useful or to split up an load-and-add instructions into two.
If you are using c++, you can also write
void _add( __v8hi* dst, __v8hi const * src, size_t n )
{
n /= 8;
for( int i=0; i<n; ++i )
d[i| += s[i];
}
__v8hi
is an abbreviation for vector of 8 half integers or typedef short __v8hi __attribute__ ((__vector_size__ (16)));
, there are similar predefined types for each vector type, supported by both gcc and icc.
This will result in almost the same code, which may or may not be even faster. But one could argue that it is more readable and it can easily be extended to AVX, possibly even by the compiler.
With gcc/clang at least, foo = *dst;
is exactly the same as foo = _mm_load_si128(dst);
. The _mm_load_si128
way is usually preferred by convention, but plain C/C++ dereferencing of an aligned __m128i*
is also safe.
The main purpose of the load
/loadu
intrinsics are to communicate alignment information to the compiler.
For float/double, they also type-cast between (const
) float*
and __m128
or (const
) double*
<-> __m128d
. For integer, you still have to cast yourself :(. But that's fixed with AVX512 intrinsics, where the integer load/store intrinsics take void*
args.
Compilers can still optimize away dead stores or reloads, and fold loads into memory operands for ALU instructions. But when they do actually emit stores or loads in their assembly output, they do it in a way that is won't fault given the alignment guarantees (or lack thereof) in your source.
Using aligned intrinsics lets compilers fold loads into memory operands for ALU instructions with SSE or AVX. But unaligned load intrinsics can only fold with AVX, because SSE memory operands are like movdqa
loads. e.g. _mm_add_epi16(xmm0, _mm_loadu_si128(rax))
could compile to vpaddw xmm0, xmm0, [rax]
with AVX, but with SSE would have to compile to movdqu xmm1, [rax]
/ paddw xmm0, xmm1
. A load
instead of loadu
could let it avoid a separate load instruction with SSE, too.
As is normal for C, dereferencing a __m128i*
is assumed to be an aligned access, like load_si128
or store_si128
.
In gcc's emmintrin.h
, the __m128i
type is defined with __attribute__ ((__vector_size__ (16), __may_alias__ ))
.
If it had used __attribute__ ((__vector_size__ (16), __may_alias__, aligned(1) ))
, gcc would treat a dereference as an unaligned access.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With