I don't know if there's something missing on my understanding of how AVX intrinsics works with std::array
, but I'm having a strange issue with Clang when I combine the two.
Sample code:
std::array<__m256, 1> gen_data()
{
std::array<__m256, 1> res;
res[0] = _mm256_set1_ps(1);
return res;
}
void main()
{
auto v = gen_data();
float a[8];
_mm256_storeu_ps(a, v[0]);
for(size_t i = 0; i < 8; ++i)
{
std::cout << a[i] << std::endl;
}
}
Output from Clang 3.5.0 (upper 4 floats are garbage data):
1 1 1 1 8.82272e-39 0 5.88148e-39 0
Output from GCC 4.8.2/4.9.1 (expected):
1 1 1 1 1 1 1 1
If I instead pass v
into gen_data
as an output parameter it works just fine on both compilers. I'm willing to accept that this might be a bug in Clang, however I don't know if this might be undefined behavior(UB). Testing with Clang 3.7* (svn build) and Clang appears to now give my expected result. If I switch to SSE 128-bit intrinsics (__m128
) then all compilers give the same expected results.
So my questions are:
This looks like this is clang bug that is now fixed, we can see this from this bug report , which demonstrates a very similar problem using regular arrays.
Assuming std::array
implements its storage similar to this:
T elems[N];
which both libc++
and libstdc++
seem to do then this should analogous. One of the comments says:
However, libc++'s
std::array<__m256i, 1>
does not work at any optimization level.
The bug report was actually based off of this SO question: Is this incorrect code generation with arrays of __m256 values a clang bug? which is very similar but deals with the regular array case.
The bug report contains one possible work-around, which the OP stated is sufficient:
In my actual code,
num_vectors
is calculated based on some C++ template parameters to thesimd_pack
type. In many cases, that comes out to be 1, but it also is often greater than 1. Your observation gives me an idea, though; I could try to introduce a template specialization that catches the case wherenum_vectors == 1
. It could instead just use a single__m256
member instead of an array of size 1. I'll have to check to see how feasible that is.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With