This question relates to gcc (4.6.3 Ubuntu) and its behavior in unrolling loops for SSE intrinsics with immediate operands. An example of an intrinsic with immediate operand is _mm_blend_ps. It expects a 4-bit immediate integer which can only be a constant. However, using the -O3 option, the compiler apparently automatically unrolls loops (if the loop counter values can be determined at compile time) and produces multiple instances of the corresponding blend instruction with different immediate values. This is a simple test code (blendsimple.c) which runs through the 16 possible values of the immediate operand of blend: <pre class="prettyprint"><code>#include <stdio.h> #include <x86intrin.h> #define PRINT(V) \ printf("%s: ", #V); \ for (i = 3; i >= 0; i--) printf("%3g ", V[i]); \ printf("\n"); int main() { __m128 a = _mm_set_ps(1, 2, 3, 4); __m128 b = _mm_set_ps(5, 6, 7, 8); int i; PRINT(a); PRINT(b); unsigned mask; __m128 r; for (mask = 0; mask < 16; mask++) { r = _mm_blend_ps(a, b, mask); PRINT(r); } return 0; } </code></pre> It is possible compile this code with <pre class="prettyprint"><code>gcc -Wall -march=native -O3 -o blendsimple blendsimple.c </code></pre> and the code works. Obviously the compiler unrolls the loop and inserts constants for the immediate operand. However, if you compile the code with <pre class="prettyprint"><code>gcc -Wall -march=native -O2 -o blendsimple blendsimple.c </code></pre> you get the following error for the blend intrinsic: <pre class="prettyprint"><code>error: the last argument must be a 4-bit immediate </code></pre> Now I tried to find out which specific compiler flag is active in -O3 but not in -O2 which allows the compiler to unroll the loop, but failed. Following the gcc online docs at https://gcc.gnu.org/onlinedocs/gcc-4.8.2/gcc/Overall-Options.html I executed the following commands: <pre class="prettyprint"><code>gcc -c -Q -O3 --help=optimizers > /tmp/O3-opts gcc -c -Q -O2 --help=optimizers > /tmp/O2-opts diff /tmp/O2-opts /tmp/O3-opts | grep enabled </code></pre> which lists all options enabled by -O3 but not by -O2. When I add all of the 7 listed flags in addition to -O2 <pre class="prettyprint"><code>gcc -Wall -march=native -O2 -fgcse-after-reload -finline-functions -fipa-cp-clone -fpredictive-commoning -ftree-loop-distribute-patterns -ftree-vectorize -funswitch-loops blendsimple blendsimple.c </code></pre> I would expect that the behavior is exactly the same as with -O3. However, the compiler complains that "the last argument must be a 4-bit immediate". Does anyone have an idea what the problem is? I think it would be good to know which flag is required to enable this type of loop unrolling so that it can be activated selectively using #pragma GCC optimize or by a function attribute. (I was also surprised that -O3 obviously doesn't even enable the unroll-loops option). I would be grateful for any help. This is for a lecture on SSE programming I give. Edit: Thanks a lot for your comments. jtaylor seems to be right. I got my hand on two newer versions of gcc (4.7.3, 4.8.2), and 4.8.2 complains on the immediate problem regardless of the optimization level. Moverover, I later noticed that gcc 4.6.3 compiles the code with -O2 -funroll-loops, but this also fails in 4.8.2. So apparently one cannot trust this feature and should always unroll "manually" using cpp or templates, as Jason R pointed out.

I am not sure if this applies to your situation, since I am not familiar with SSE intrinsics. But generally, you can tell the compiler to specifically optimize a section of code with : <pre class="prettyprint"><code> #pragma GCC push_options #pragma GCC optimize ("unroll-loops") do your stuff #pragma GCC pop_options </code></pre> Source: Tell gcc to specifically unroll a loop

What gcc option enables loop unrolling for SSE intrinsics with immediate operands?

Tags:

c

gcc

sse

This question relates to gcc (4.6.3 Ubuntu) and its behavior in unrolling loops for SSE intrinsics with immediate operands.

An example of an intrinsic with immediate operand is _mm_blend_ps. It expects a 4-bit immediate integer which can only be a constant. However, using the -O3 option, the compiler apparently automatically unrolls loops (if the loop counter values can be determined at compile time) and produces multiple instances of the corresponding blend instruction with different immediate values.

This is a simple test code (blendsimple.c) which runs through the 16 possible values of the immediate operand of blend:

#include <stdio.h>
#include <x86intrin.h>

#define PRINT(V)                \
  printf("%s: ", #V);               \
  for (i = 3; i >= 0; i--) printf("%3g ", V[i]);    \
  printf("\n");

int
main()
{
  __m128 a = _mm_set_ps(1, 2, 3, 4);
  __m128 b = _mm_set_ps(5, 6, 7, 8);
  int i;
  PRINT(a);
  PRINT(b);
  unsigned mask;
  __m128 r;
  for (mask = 0; mask < 16; mask++) {
    r = _mm_blend_ps(a, b, mask);
    PRINT(r);
  }
  return 0;
}

It is possible compile this code with

gcc -Wall -march=native -O3 -o blendsimple blendsimple.c

and the code works. Obviously the compiler unrolls the loop and inserts constants for the immediate operand.

However, if you compile the code with

gcc -Wall -march=native -O2 -o blendsimple blendsimple.c

you get the following error for the blend intrinsic:

error: the last argument must be a 4-bit immediate

Now I tried to find out which specific compiler flag is active in -O3 but not in -O2 which allows the compiler to unroll the loop, but failed. Following the gcc online docs at

https://gcc.gnu.org/onlinedocs/gcc-4.8.2/gcc/Overall-Options.html

I executed the following commands:

gcc -c -Q -O3 --help=optimizers > /tmp/O3-opts
gcc -c -Q -O2 --help=optimizers > /tmp/O2-opts
diff /tmp/O2-opts /tmp/O3-opts | grep enabled

which lists all options enabled by -O3 but not by -O2. When I add all of the 7 listed flags in addition to -O2

gcc -Wall -march=native -O2 -fgcse-after-reload -finline-functions -fipa-cp-clone -fpredictive-commoning -ftree-loop-distribute-patterns -ftree-vectorize -funswitch-loops blendsimple blendsimple.c

I would expect that the behavior is exactly the same as with -O3. However, the compiler complains that "the last argument must be a 4-bit immediate".

Does anyone have an idea what the problem is? I think it would be good to know which flag is required to enable this type of loop unrolling so that it can be activated selectively using #pragma GCC optimize or by a function attribute.

(I was also surprised that -O3 obviously doesn't even enable the unroll-loops option).

I would be grateful for any help. This is for a lecture on SSE programming I give.

Edit: Thanks a lot for your comments. jtaylor seems to be right. I got my hand on two newer versions of gcc (4.7.3, 4.8.2), and 4.8.2 complains on the immediate problem regardless of the optimization level. Moverover, I later noticed that gcc 4.6.3 compiles the code with -O2 -funroll-loops, but this also fails in 4.8.2. So apparently one cannot trust this feature and should always unroll "manually" using cpp or templates, as Jason R pointed out.

980

asked Jul 18 '14 11:07

Ralf

1 Answers

I am not sure if this applies to your situation, since I am not familiar with SSE intrinsics. But generally, you can tell the compiler to specifically optimize a section of code with :

 #pragma GCC push_options
 #pragma GCC optimize ("unroll-loops")

 do your stuff

 #pragma GCC pop_options

Source: Tell gcc to specifically unroll a loop

133

answered Sep 28 '22 20:09

pAndrei

Related questions
                            
                                Traverse through layers of array using pointer to layer of array
                            
                                Getting symbol information for value held in GDB convenience variable
                            
                                How can I optimize a looped 4D matrix-vector-multiplication with ARM NEON?
                            
                                Calling C code from C# - stuck with a couple of issues
                            
                                Interview : function pointers vs switch case
                            
                                Conditional breakpoint using strcmp() in GDB on Mac OS X conflicts with Objective-C runtime
                            
                                GCC access high/low machine words in double machine word types (including asm)
                            
                                Passing data from a C process to a C++ process
                            
                                How to define and trigger my own new softirq in linux kernel?
                            
                                Problems in code with multiple child creation
                            
                                Understand whether code sample is CPU bound or Memory bound
                            
                                Using the third parameter (void* context) of a sigaction handler with SIG_INFO results in a Segmentation Fault
                            
                                Unable to bind sample program to LDAP server via SSL (ldaps://)
                            
                                How to clear buffer in receiving multiple strings?
                            
                                FFmpeg libraries: Exactly constant segment duration for HLS
                            
                                socket connect() returning 0 even after peer reset
                            
                                Emitting keyboard input using WM_CHAR message?
                            
                                What is execution wide-character set and its encoding?
                            
                                Change an mmap'd memory region from MAP_SHARED to MAP_PRIVATE
                            
                                Improving "randomness" when extending the range of rand()

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With