Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to implement "_mm_storeu_epi64" without aliasing problems?

(Note: Although this question is about "store", the "load" case has the same issues and is perfectly symmetric.)

The SSE intrinsics provide an _mm_storeu_pd function with the following signature:

void _mm_storeu_pd (double *p, __m128d a);

So if I have vector of two doubles, and I want to store it to an array of two doubles, I can just use this intrinsic.

However, my vector is not two doubles; it is two 64-bit integers, and I want to store it to an array of two 64-bit integers. That is, I want a function with the following signature:

void _mm_storeu_epi64 (int64_t *p, __m128i a);

But the intrinsics provide no such function. The closest they have is _mm_storeu_si128:

void _mm_storeu_si128 (__m128i *p, __m128i a);

The problem is that this function takes a pointer to __m128i, while my array is an array of int64_t. Writing to an object via the wrong type of pointer is a violation of strict aliasing and is definitely undefined behavior. I am concerned that my compiler, now or in the future, will reorder or otherwise optimize away the store thus breaking my program in strange ways.

To be clear, what I want is a function I can invoke like this:

__m128i v = _mm_set_epi64x(2,1);
int64_t ra[2];
_mm_storeu_epi64(&ra[0], v); // does not exist, so I want to implement it

Here are six attempts to create such a function.

Attempt #1

void _mm_storeu_epi64(int64_t *p, __m128i a) {
    _mm_storeu_si128(reinterpret_cast<__m128i *>(p), a);
}

This appears to have the strict aliasing problem I am worried about.

Attempt #2

void _mm_storeu_epi64(int64_t *p, __m128i a) {
    _mm_storeu_si128(static_cast<__m128i *>(static_cast<void *>(p)), a);
}

Possibly better in general, but I do not think it makes any difference in this case.

Attempt #3

void _mm_storeu_epi64(int64_t *p, __m128i a) {
    union TypePun {
        int64_t a[2];
        __m128i v;
     };
    TypePun *p_u = reinterpret_cast<TypePun *>(p);
    p_u->v = a;
}

This generates incorrect code on my compiler (GCC 4.9.0), which emits an aligned movaps instruction instead of an unaligned movups. (The union is aligned, so the reinterpret_cast tricks GCC into assuming p_u is aligned, too.)

Attempt #4

void _mm_storeu_epi64(int64_t *p, __m128i a) {
    union TypePun {
        int64_t a[2];
        __m128i v;
     };
    TypePun *p_u = reinterpret_cast<TypePun *>(p);
    _mm_storeu_si128(&p_u->v, a);
}

This appears to emit the code I want. The "type-punning via union" trick, although technically undefined in C++, is widely-supported. But is this example -- where I pass a pointer to an element of a union rather than access via the union itself -- really a valid way to use the union for type-punning?

Attempt #5

void _mm_storeu_epi64(int64_t *p, __m128i a) {
    p[0] = _mm_extract_epi64(a, 0);
    p[1] = _mm_extract_epi64(a, 1);
}

This works and is perfectly valid, but it emits two instructions instead of one.

Attempt #6

void _mm_storeu_epi64(int64_t *p, __m128i a) {
    std::memcpy(p, &a, sizeof(a));
}

This works and is perfectly valid... I think. But it emits frankly terrible code on my system. GCC spills a to an aligned stack slot via an aligned store, then manually moves the component words to the destination. (Actually it spills it twice, once for each component. Very strange.)

...

Is there any way to write this function that will (a) generate optimal code on a typical modern compiler and (b) have minimal risk of running afoul of strict aliasing?

like image 946
Nemo Avatar asked Jul 16 '14 17:07

Nemo


1 Answers

SSE intrinsics is one of those niche corner cases where you have to push the rules a bit.

Since these intrinsics are compiler extensions (somewhat standardized by Intel), they are already outside the specification of the C and C++ language standards. So it's somewhat self-defeating to try to be "standard compliant" while using a feature that clearly is not.

Despite the fact that the SSE intrinsic libraries try to act like normal 3rd party libraries, underneath, they are all specially handled by the compiler.


The Intent:

The SSE intrinsics were likely designed from the beginning to allow aliasing between the vector and scalar types - since a vector really is just an aggregate of the scalar type.

But whoever designed the SSE intrinsics probably wasn't a language pedant.
(That's not too surprising. Hard-core low-level performance programmers and language lawyering enthusiasts tend to be very different groups of people who don't always get along.)

We can see evidence of this in the load/store intrinsics:

  • __m128i _mm_stream_load_si128(__m128i* mem_addr) - A load intrinsic that takes a non-const pointer?
  • void _mm_storeu_pd(double* mem_addr, __m128d a) - What if I want to store to __m128i*?

The strict aliasing problems are a direct result of these poor prototypes.

Starting from AVX512, the intrinsics have all been converted to void* to address this problem:

  • __m512d _mm512_load_pd(void const* mem_addr)
  • void _mm512_store_epi64 (void* mem_addr, __m512i a)

Compiler Specifics:

  • Visual Studio defines each of the SSE/AVX types as a union of the scalar types. This by itself allows strict-aliasing. Furthermore, Visual Studio doesn't do strict-aliasing so the point is moot:

  • The Intel Compiler has never failed me with all sorts of aliasing. It probably doesn't do strict-aliasing either - though I've never found any reliable source for this.

  • GCC does do strict-aliasing, but from my experience, not across function boundaries. It has never failed me to cast pointers which are passed in (on any type). GCC also declares SSE types as __may_alias__ thereby explicitly allowing it to alias other types.


My Recommendation:

  • For function parameters that are of the wrong pointer type, just cast it.
  • For variables declared and aliased on the stack, use a union. That union will already be aligned so you can read/write to them directly without intrinsics. (But be aware of store-forwarding issues that come with interleaving vector/scalar accesses.)
  • If you need to access a vector both as a whole and by its scalar components, consider using insert/extract intrinsics instead of aliasing.
  • When using GCC, turn on -Wall or -Wstrict-aliasing. It will tell you about strict-aliasing violations.
like image 51
Mysticial Avatar answered Nov 17 '22 01:11

Mysticial