This question relates to gcc (4.6.3 Ubuntu) and its behavior in unrolling loops for SSE intrinsics with immediate operands.
An example of an intrinsic with immediate operand is _mm_blend_ps. It expects a 4-bit immediate integer which can only be a constant. However, using the -O3 option, the compiler apparently automatically unrolls loops (if the loop counter values can be determined at compile time) and produces multiple instances of the corresponding blend instruction with different immediate values.
This is a simple test code (blendsimple.c) which runs through the 16 possible values of the immediate operand of blend:
#include <stdio.h>
#include <x86intrin.h>
#define PRINT(V) \
printf("%s: ", #V); \
for (i = 3; i >= 0; i--) printf("%3g ", V[i]); \
printf("\n");
int
main()
{
__m128 a = _mm_set_ps(1, 2, 3, 4);
__m128 b = _mm_set_ps(5, 6, 7, 8);
int i;
PRINT(a);
PRINT(b);
unsigned mask;
__m128 r;
for (mask = 0; mask < 16; mask++) {
r = _mm_blend_ps(a, b, mask);
PRINT(r);
}
return 0;
}
It is possible compile this code with
gcc -Wall -march=native -O3 -o blendsimple blendsimple.c
and the code works. Obviously the compiler unrolls the loop and inserts constants for the immediate operand.
However, if you compile the code with
gcc -Wall -march=native -O2 -o blendsimple blendsimple.c
you get the following error for the blend intrinsic:
error: the last argument must be a 4-bit immediate
Now I tried to find out which specific compiler flag is active in -O3 but not in -O2 which allows the compiler to unroll the loop, but failed. Following the gcc online docs at
https://gcc.gnu.org/onlinedocs/gcc-4.8.2/gcc/Overall-Options.html
I executed the following commands:
gcc -c -Q -O3 --help=optimizers > /tmp/O3-opts
gcc -c -Q -O2 --help=optimizers > /tmp/O2-opts
diff /tmp/O2-opts /tmp/O3-opts | grep enabled
which lists all options enabled by -O3 but not by -O2. When I add all of the 7 listed flags in addition to -O2
gcc -Wall -march=native -O2 -fgcse-after-reload -finline-functions -fipa-cp-clone -fpredictive-commoning -ftree-loop-distribute-patterns -ftree-vectorize -funswitch-loops blendsimple blendsimple.c
I would expect that the behavior is exactly the same as with -O3. However, the compiler complains that "the last argument must be a 4-bit immediate".
Does anyone have an idea what the problem is? I think it would be good to know which flag is required to enable this type of loop unrolling so that it can be activated selectively using #pragma GCC optimize or by a function attribute.
(I was also surprised that -O3 obviously doesn't even enable the unroll-loops option).
I would be grateful for any help. This is for a lecture on SSE programming I give.
Edit: Thanks a lot for your comments. jtaylor seems to be right. I got my hand on two newer versions of gcc (4.7.3, 4.8.2), and 4.8.2 complains on the immediate problem regardless of the optimization level. Moverover, I later noticed that gcc 4.6.3 compiles the code with -O2 -funroll-loops, but this also fails in 4.8.2. So apparently one cannot trust this feature and should always unroll "manually" using cpp or templates, as Jason R pointed out.
Loop unrolling, also known as loop unwinding, is a loop transformation technique that attempts to optimize a program's execution speed at the expense of its binary size, which is an approach known as space–time tradeoff.
The UNROLL pragma specifies to the compiler how many times a loop should be unrolled. The UNROLL pragma is useful for helping the compiler utilize SIMD instructions. It is also useful in cases where better utilization of software pipeline resources are needed over a non-unrolled loop.
I am not sure if this applies to your situation, since I am not familiar with SSE intrinsics. But generally, you can tell the compiler to specifically optimize a section of code with :
#pragma GCC push_options
#pragma GCC optimize ("unroll-loops")
do your stuff
#pragma GCC pop_options
Source: Tell gcc to specifically unroll a loop
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With