Converting to and from __m256i and std::vector

Question

I want to convert to and from __m256i instances and std::vector<uint32_t> instances (containing exactly 8 elements).

So far I came up with this:

using vu32 = std::vector<uint32_t>;

__m256i v2v(const vu32& in) {
    assert(in.size() == 8);
    return _mm256_loadu_si256(reinterpret_cast<const __m256i*>(in.data()));
}

vu32 v2v(__m256i in) {
    vu32 out(8);
    _mm256_storeu_si256(reinterpret_cast<__m256i*>(out.data()), in);
    return out;
}

Is it safe?

Is there a more idiomatic way to do it?

Peter Cordes · Accepted Answer

Well first of all, SIMD vectors and std::vector have basically nothing to do with each other. I know you already know this, but future readers should think carefully about whether this is really something they want to do.

It's safe; .data() has to return a pointer that can be read or written at any valid index. It's certainly safe in practice given the implementation details of real std::vector libraries. And I'm pretty sure in the abstract as far as on-paper standards.

From comments, it seems you're worried about strict-aliasing UB.

Read/write of other objects via may_alias pointer types (including char* or __m256i*) is fine. memcpy(&a, &b, sizeof(a)) is a common example of modifying the object-representation of a via char*. There's nothing special about memcpy itself; that's well-defined because of the char* aliasing special case.

may_alias is a GNU C extension that lets you define types other than char which are allowed to alias the way char* can. GNU C's definition of __m128 / __m256i is in terms of GNU C native vectors like typedef long long __m256i __attribute((vector_size(32), may_alias)); Other C++ implementations (like MSVC) define __m256i differently, but the Intel intrinsics API guarantees that aliasing vector-pointers onto other types is legal in any case where char* / memcpy would be.

See also Is `reinterpret_cast`ing between hardware vector pointer and the corresponding type an undefined behavior?

Also: SSE: Difference between _mm_load/store vs. using direct pointer access - loadu / storeu are like casting a an aligned(1) version of the vector type before dereferencing. So all this reasoning about pointers and aliasing applies to passing a pointer to _mm_storeu, not just to to dereferencing directly.

Idiomatic; well sure, this looks like pretty idiomatic C++. I might still use C-style casts with intrinsics just because reinterpret is so long to read and the poorly-designed intrinsics API for integer vectors needs it all over the place. Maybe a templated wrapper function for si256 load/loadu and store/storeu would be appropriate, that casts to __m256i* or const __m256i* from any pointer type.

I might prefer something that passed the __m256i elements to the constructor of out, though, to stop stupid compilers from potentially zeroing the memory and then storing the vector. But hopefully that doesn't happen.

In practice gcc and clang do optimize away the dead stores to zero 8 elements before storing the vector. Any attempt to use the vector(begin, end) iterator constructor instead makes things worse, with extra code for exception handling on top of the store/reload of in to the stack (around new), then storing it into the newly-allocated memory.

See some attempts on the Godbolt compiler explorer, note that they save/restore r13 where @Bee's version doesn't, as well having extra code generated outside the normal path through the function. This goes away with -fno-exceptions, but then they're just equal, not better, than @Bee's version. So use the code in the question; it compiles at least as well as any of my attempts to be different.

I might also prefer doing something to get the new std::vector<uint32_t> allocated with 32-byte aligned memory, if that's possible without changing the template type. I'm not sure if that is possible.

Even if we could just make this initial allocation aligned in practice without changing the type to make that a compile-time guarantee for future use, that would potentially help. AVX code that leaves unaligned handling to HW would benefit from not having cache-line splits.

But I don't think that's possible either without hacking a custom constructor for std::vector that does the initial allocation with an aligned new, assuming that's compatible with regular delete.

If you can use a std::vector<uint32_t, some_aligned_allocator> everywhere in your code, that might be worth doing. But probably not worth the trouble if you have to pass it to code that uses normal vector<uint32_t>.

You could lie to your compiler because that type is binary-compatible (but not source-compatible) with regular std::vector<uint32_t>, on systems where aligned new/delete are compatible with plain new/delete. But I don't recommend that.

Converting to and from __m256i and std::vector<uint32_t>

Tags:

c++

intel

simd

intrinsics

avx2

BeeOnRope

1 Answers

Peter Cordes

Recent Activity

Donate For Us