The code i want to optimize is basically a simple but large arithmetic formula, it should be fairly simple to analyze the code automatically to compute the independent multiplications/additions in parallel, but i read that autovectorization only works for loops.
I've read multiple times now that access of single elements in a vector via union or some other way should be avoided at all costs, instead should be replaced by a _mm_shuffle_pd (i'm working on doubles only)...
I don't seem to figure out how I can store the content of a __m128d vector as doubles without accessing it as a union. Also, does an operation like this give any performance gain when compared to scalar code?
union {
__m128d v;
double d[2];
} vec;
union {
__m128d v;
double d[2];
} vec2;
vec.v = index1;
vec2.v = index2;
temp1 = _mm_mul_pd(temp1, _mm_set_pd(bvec[vec.d[1]], bvec[vec2[1]]));
also, the two unions look ridiculously ugly, but when using
union dvec {
__m128d v;
double d[2];
} vec;
Trying to declare the indexX as dvec, the compiler complained dvec is undeclared.
Unfortunately if you look at MSDN it says the following:
You should not access the __m128d fields directly. You can, however, see these types in the debugger. A variable of type __m128 maps to the XMM[0-7] registers.
I'm no expert in SIMD, however this tells me that what you're doing won't work as it's just not designed to.
EDIT:
I've just found this, and it says:
Use __m128, __m128d, and __m128i only on the left-hand side of an assignment, as a return value, or as a parameter. Do not use it in other arithmetic expressions such as "+" and ">>".
It also says:
Use __m128, __m128d, and __m128i objects in aggregates, such as unions (for example, to access the float elements) and structures.
So maybe you can use them, but only in unions. Seems contradictory to what MSDN says, however.
EDIT2:
Here is another interesting resource that describes with examples on how to use these SIMD types
In the above link, you'll find this line:
#include <math.h>
#include <emmintrin.h>
double in1_min(__m128d x)
{
return x[0];
}
In the above we use a new extension in gcc 4.6 to access the high and low parts via indexing. Older versions of gcc require using a union and writing to an array of two doubles. This is cumbersome, and extra slow when optimization is turned off.
_mm_cvtsd_f64
+ _mm_unpackhi_pd
For doubles:
#include <assert.h>
#include <x86intrin.h>
int main(void) {
__m128d x = _mm_set_pd(1.5, 2.5);
/* _mm_cvtsd_f64 + _mm_unpackhi_pd */
assert(_mm_cvtsd_f64(x) == 2.5);
assert(_mm_cvtsd_f64(_mm_unpackhi_pd(x, x)) == 1.5);
}
For floats, I have posted the following examples at How to convert a hex float to a float in C/C++ using _mm_extract_ps SSE GCC instrinc function
_mm_cvtss_f32
+ _mm_shuffle_ps
_MM_EXTRACT_FLOAT
For ints you can use _mm_extract_epi32
:
#include <assert.h>
#include <x86intrin.h>
int main(void) {
__m128i x = _mm_set_epi32(1, 2, 3, 4);
assert(_mm_extract_epi32(x, 3) == 4);
assert(_mm_extract_epi32(x, 2) == 3);
assert(_mm_extract_epi32(x, 1) == 1);
assert(_mm_extract_epi32(x, 0) == 1);
}
GitHub upstream.
Compile and run examples with:
gcc -ggdb3 -O0 -std=c99 -Wall -Wextra -pedantic -o main.out main.c
./main.out
Tested on Ubuntu 19.04 amd64.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With