I have a question about C compiler optimization and when/how loops in inline functions are unrolled. I am developing a numerical code which does something like the example below. Basically, <code>my_for()</code> would compute some kind of stencil and call <code>op()</code> to do something with the data in <code>my_type *arg</code> for each <code>i</code>. Here, <code>my_func()</code> wraps <code>my_for()</code>, creating the argument and sending the function pointer to <code>my_op()</code>... who’s job it is to modify the <code>i</code>th double for each of the (<code>arg->n</code>) double arrays <code>arg->dest[j]</code>. <pre class="prettyprint"><code>typedef struct my_type { int const n; double *dest[16]; double const *src[16]; } my_type; static inline void my_for( void (*op)(my_type *,int), my_type *arg, int N ) { int i; for( i=0; i<N; ++i ) op( arg, i ); } static inline void my_op( my_type *arg, int i ) { int j; int const n = arg->n; for( j=0; j<n; ++j ) arg->dest[j][i] += arg->src[j][i]; } void my_func( double *dest0, double *dest1, double const *src0, double const *src1, int N ) { my_type Arg = { .n = 2, .dest = { dest0, dest1 }, .src = { src0, src1 } }; my_for( &my_op, &Arg, N ); } </code></pre> This works fine. The functions are inlining as they should and the code is (almost) as efficient as having written everything inline in a single function and unrolled the <code>j</code> loop, without any sort of <code>my_type Arg</code>. Here’s the confusion: if I set <code>int const n = 2;</code> rather than <code>int const n = arg->n;</code> in <code>my_op()</code>, then the code becomes as fast as the unrolled single-function version. So, the question is: why? If everything is being inlined into <code>my_func()</code>, why doesn’t the compiler see that I am literally defining <code>Arg.n = 2</code>? Furthermore, there is no improvement when I explicitly make the bound on the <code>j</code> loop <code>arg->n</code>, which should look just like the speedier <code>int const n = 2;</code> after inlining. I also tried using <code>my_type const</code> everywhere to really signal this const-ness to the compiler, but it just doesn't want to unroll the loop. In my numerical code, this amounts to about a 15% performance hit. If it matters, there, <code>n=4</code> and these <code>j</code> loops appear in a couple of conditional branches in an <code>op()</code>. I am compiling with icc (ICC) 12.1.5 20120612. I tried <code>#pragma unroll</code>. Here are my compiler options (did I miss any good ones?): <code>-O3 -ipo -static -unroll-aggressive -fp-model precise -fp-model source -openmp -std=gnu99 -Wall -Wextra -Wno-unused -Winline -pedantic</code> Thanks!

It's faster, because your program does not assign memory to the variable. If you don't have to perform any operations on unknown values they are treated as if they were <code>#define constant 2</code> with type checking. They are just added while the compilation. Could you please chose one of the two tags (I mean C or C++), it's confusing, because the languages treat <code>const</code> values differently - C treats them like normal variables which value just can't be changed, and in C++ they do or don't have memory assigned depending on the context (if you need their address or if you need to compute them when the program is running, then memory is assigned). Source: "Thinking in C++". No exact quote.

Loop unrolling in inlined functions in C

Tags:

c

optimization

inline

icc

I have a question about C compiler optimization and when/how loops in inline functions are unrolled.

I am developing a numerical code which does something like the example below. Basically, my_for() would compute some kind of stencil and call op() to do something with the data in my_type *arg for each i. Here, my_func() wraps my_for(), creating the argument and sending the function pointer to my_op()... who’s job it is to modify the ith double for each of the (arg->n) double arrays arg->dest[j].

typedef struct my_type {
  int const n;
  double *dest[16];
  double const *src[16];
} my_type;

static inline void my_for( void (*op)(my_type *,int), my_type *arg, int N ) {
  int i;

  for( i=0; i<N; ++i )
    op( arg, i );
}

static inline void my_op( my_type *arg, int i ) {
  int j;
  int const n = arg->n;

  for( j=0; j<n; ++j )
    arg->dest[j][i] += arg->src[j][i];
}

void my_func( double *dest0, double *dest1, double const *src0, double const *src1, int N ) {
  my_type Arg = {
    .n = 2,
    .dest = { dest0, dest1 },
    .src = { src0, src1 }
  };

  my_for( &my_op, &Arg, N );
}

This works fine. The functions are inlining as they should and the code is (almost) as efficient as having written everything inline in a single function and unrolled the j loop, without any sort of my_type Arg.

Here’s the confusion: if I set int const n = 2; rather than int const n = arg->n; in my_op(), then the code becomes as fast as the unrolled single-function version. So, the question is: why? If everything is being inlined into my_func(), why doesn’t the compiler see that I am literally defining Arg.n = 2? Furthermore, there is no improvement when I explicitly make the bound on the j loop arg->n, which should look just like the speedier int const n = 2; after inlining. I also tried using my_type const everywhere to really signal this const-ness to the compiler, but it just doesn't want to unroll the loop.

In my numerical code, this amounts to about a 15% performance hit. If it matters, there, n=4 and these j loops appear in a couple of conditional branches in an op().

I am compiling with icc (ICC) 12.1.5 20120612. I tried #pragma unroll. Here are my compiler options (did I miss any good ones?):

-O3 -ipo -static -unroll-aggressive -fp-model precise -fp-model source -openmp -std=gnu99 -Wall -Wextra -Wno-unused -Winline -pedantic

Thanks!

273

asked Jun 12 '15 11:06

FiniteElement

2 Answers

Well, obviously the compiler isn't 'smart' enough to propagate the n constant and unroll the for loop. Actually it plays it safe since arg->n can change between instantiation and usage.

In order to have consistent performance across compiler generations and squeeze the maximum out of your code, do the unrolling by hand.

What people like myself do in these situations (performance is king) is rely on macros.

Macros will 'inline' in debug builds (useful) and can be templated (to a point) using macro parameters. Macro parameters which are compile time constants are guaranteed to remain this way.

answered Oct 12 '22 11:10

egur

It's faster, because your program does not assign memory to the variable.

If you don't have to perform any operations on unknown values they are treated as if they were #define constant 2 with type checking. They are just added while the compilation.

Could you please chose one of the two tags (I mean C or C++), it's confusing, because the languages treat const values differently - C treats them like normal variables which value just can't be changed, and in C++ they do or don't have memory assigned depending on the context (if you need their address or if you need to compute them when the program is running, then memory is assigned).

Source: "Thinking in C++". No exact quote.

answered Oct 12 '22 12:10

Adrian Jałoszewski

Related questions
                            
                                Xlib: Closing window always causes fatal IO error?
                            
                                Is there a way to show where LLVM is auto vectorising?
                            
                                Doxygen enum not showing
                            
                                Difference between __always_inline and inline
                            
                                How can I obtain the identifier of the current function?
                            
                                Hoisting/Reordering in C, C++ and Java: Must variable declarations always be on top in a context?
                            
                                What's the efficient algorithm to find the Integer square root of a very large number, digit by digit?
                            
                                C99: Flexible array inside union?
                            
                                Is it safe to cast a heap allocated pointer to a pointer to a VLA?
                            
                                Flow of program execution during Thread creation
                            
                                Mixed 'switch' and 'while' in C
                            
                                Creating a static C struct containing strings
                            
                                Hide terminal output from Execve
                            
                                Lex/Flex :Regular expression for string literals in C/C++?
                            
                                Does MPI provide preprocessor macros?
                            
                                Obtaining pointer to a string constant
                            
                                On Linux, is TLS set up by the kernel or by libc (or other language runtime)?
                            
                                Overflow of an enum type in C?
                            
                                recursive function that tells if a Tree is a Binary Search Tree ( BST ) (Modified code)
                            
                                Is it always safe to negate a floating point number

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With