Background: GCC C's builtin vector extensions allow for a fairly natural representation of SIMD vectors as C "types." According to the documentation, many built-in operations are supported (+, -, etc). However, the ternary operator, as well as logical operators (&&, ||) for some reason only work in C++. This is an issue for an all=C codebase.
The question: In GCC C, how would one implement SIMD-compatible [branchless] conditionals of the form:
v4si a = {2,-1,3,4}, b, indicesLessThan0;
indicesLessThan0 = a < 0;
b = indicesLessThan0 ? a : 0;
And, more generally, how to perform an arbitrary independent block of statements based on that same result:
v4si c = {9,8,7,6}, d;
for (int i = 0; i < 4; i++) {
if (indicesLessThan0[i]) { // consider tests one by one
b[i] = a[i] // as the ternary operator does above
d[i] = c[i] + 1; // some other independent operation
}
else {
b[i] = 0; // as the ternary operator does above
d[i] = c[i] - 1; // another independent operation
}
}
If doing a block of statements is harder (SIMD branching is bad), it would be fine to perform the ternary test again for any additional statements at the cost (supposedly) of some efficiency:
d = indicesLessThan0 ? c + 1 : c - 1; // the other operation in the loop
But the ternary operator doesn't work in C for some reason the manual doesn't explain. Is there another easy way? Some way of using if statements?
I have found 3 solutions as a result of hitting the code with the kitchen sink.
Switch to g++. Not too hard, and turns out most of the code can be swapped just by putting a (type *) before all the -allocs. Then I can just do:
v16s8 condStor = test ? a : b;
Even better, I discovered you can just bitbash using various mixes of &'s and |'s, the same way everyone does with bits inside of integers. The trick is that vectors set all truth to 11111111... (-1 unsigned), which makes values stick when using bitwise operators.
v16s8 condStor = b; __builtin_ia32_maskmovdqu (a, test, (char *)(&condStor));
Not convinced? Check the assembly:
pxor %xmm1, %xmm1
movdqa -64(%rbp), %xmm0
pcmpeqb %xmm1, %xmm0
pcmpeqd %xmm1, %xmm1
pandn %xmm1, %xmm0
pxor %xmm1, %xmm1
pcmpgtb %xmm0, %xmm1
movdqa %xmm1, %xmm0
movdqa -32(%rbp), %xmm2
movdqa -16(%rbp), %xmm1
pand %xmm0, %xmm1
pandn %xmm2, %xmm0
por %xmm1, %xmm0
movaps %xmm0, -80(%rbp)
movdqa -64(%rbp), %xmm0
movdqa %xmm0, %xmm1
pand -16(%rbp), %xmm1
pcmpeqd %xmm0, %xmm0
pxor -64(%rbp), %xmm0
pand -32(%rbp), %xmm0
por %xmm1, %xmm0
movaps %xmm0, -80(%rbp)
movdqa -32(%rbp), %xmm0
movaps %xmm0, -80(%rbp)
leaq -80(%rbp), %rax
movdqa -16(%rbp), %xmm0
movdqa -64(%rbp), %xmm1
movq %rax, %rdi
maskmovdqu %xmm1, %xmm0
Judging by how convoluted 1 appears to be, followed by 2, followed by 3, I now see the cost of the C++ abstraction. Maybe this is what Linus was ranting about back in the day. (No, probably not.) Anyway, hope this helps someone!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With