How do I write a portable GNU C builtin vectors version of this, which doesn't depend on the x86 set1 intrinsic?
typedef uint16_t v8su __attribute__((vector_size(16)));
v8su set1_u16_x86(uint16_t scalar) {
return (v8su)_mm_set1_epi16(scalar); // cast needed for gcc
}
Surely there must be a better way than
v8su set1_u16(uint16_t s) {
return (v8su){s,s,s,s, s,s,s,s};
}
I don't want to write an AVX2 version of that for broadcasting a single byte!
Even a gcc-only or clang-only answer to this part would be interesting, for cases where you want to assign to a variable instead of only using as an operand to a binary operator (which works well with gcc, see below).
If I want to use a broadcast-scalar as one operand of a binary operator, this works with gcc (as documented in the manual), but not with clang:
v8su vecdiv10(v8su v) { return v / 10; } // doesn't compile with clang
With clang, if I'm targeting only x86 and just using native vector syntax to get the compiler to generate modular multiplicative inverse constants and instructions for me, I can write:
v8su vecdiv_set1(v8su v) {
return v / (v8su)_mm_set1_epi16(10); // gcc needs the cast
}
But then I have to change the intrinsic if I widen the vector (to _mm256_set1_epi16
), instead of converting the whole code to AVX2 by changing to vector_size(32)
in one place (for pure-vertical SIMD that doesn't need shuffling). It also defeats part of the purpose of native vectors, since that won't compile for ARM or any non-x86 target.
The ugly cast is required because gcc, unlike clang, doesn't consider v8us {aka __vector(8) short unsigned int}
compatible with __m128i {aka __vector(2) long long int}
.
BTW, all of this compiles to good asm with gcc and clang (see it on Godbolt). This is just a question of how to write elegantly, with readable syntax that doesn't repeat the scalar N times. e.g. v / 10
is compact enough that there's no need to even put it in its own function.
Compiling efficiently with ICC is a bonus, but not required. GNU C native vectors are clearly an afterthought for ICC, and even simple stuff like this doesn't compile efficiently. set1_u16
compiles to 8 scalar stores and a vector load, instead of MOVD / VPBROADCASTW (with -xHOST
enabled, because it doesn't recognize -march=haswell
, but Godbolt runs on a server with AVX2 support). Purely casting the results of _mm_
intrinsics is ok, but the division calls an SVML function!
A generic broadcast solution can be found for GCC and Clang using two observations
scalar - vector
operations. x - 0 = x
(but x + 0
does not work due to signed zero). Here is a solution for a vector of four floats.
#if defined (__clang__)
typedef float v4sf __attribute__((ext_vector_type(4)));
#else
typedef float v4sf __attribute__ ((vector_size (16)));
#endif
v4sf broadcast4f(float x) {
return x - (v4sf){};
}
https://godbolt.org/g/PXr3Xb
The same generic solution can be used for different vectors. Here is an example for a vector of eight unsigned shorts.
#if defined (__clang__)
typedef unsigned short v8su __attribute__((ext_vector_type(8)));
#else
typedef unsigned short v8su __attribute__((vector_size(16)));
#endif
v8su broadcast8us(short x) {
return x - (v8su){};
}
ICC (17) supports a subset of the GCC vector extensions but does not support either vector + scalar
or vector*scalar
yet so intrinsics are still necessary for broadcasts. MSVC does not support any vector
extensions.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With