Consider the following code: <pre class="prettyprint"><code>typedef void (*Fn)(); volatile long sum = 0; inline void accu() { sum+=4; } static const Fn map[4] = {&accu, &accu, &accu, &accu}; int main(int argc, char** argv) { static const long N = 10000000L; if (argc == 1) { for (long i = 0; i < N; i++) { accu(); accu(); accu(); accu(); } } else { for (long i = 0; i < N; i++) { for (int j = 0; j < 4; j++) (*map[j])(); } } } </code></pre> When I compiled it with: <pre class="prettyprint"><code>g++ -O3 test.cpp </code></pre> I'm expecting the first branch to run faster because the compiler could inline the function call to accu. And the second branch cannot be inlined because accu is called through function pointer stored in an array. But the results surprised me: <pre class="prettyprint"><code>time ./a.out real 0m0.108s user 0m0.104s sys 0m0.000s time ./a.out 1 real 0m0.095s user 0m0.088s sys 0m0.004s </code></pre> I don't understand why, so I did an objdump: <pre class="prettyprint"><code>objdump -DStTrR a.out > a.s </code></pre> and the disassembly doesn't seem to explain the performance result I got: <pre class="prettyprint"><code>8048300 <main>: 8048300: 55 push %ebp 8048301: 89 e5 mov %esp,%ebp 8048303: 53 push %ebx 8048304: bb 80 96 98 00 mov $0x989680,%ebx 8048309: 83 e4 f0 and $0xfffffff0,%esp 804830c: 83 7d 08 01 cmpl $0x1,0x8(%ebp) 8048310: 74 27 je 8048339 <main+0x39> 8048312: 8d b6 00 00 00 00 lea 0x0(%esi),%esi 8048318: e8 23 01 00 00 call 8048440 <_Z4accuv> 804831d: e8 1e 01 00 00 call 8048440 <_Z4accuv> 8048322: e8 19 01 00 00 call 8048440 <_Z4accuv> 8048327: e8 14 01 00 00 call 8048440 <_Z4accuv> 804832c: 83 eb 01 sub $0x1,%ebx 804832f: 90 nop 8048330: 75 e6 jne 8048318 <main+0x18> 8048332: 31 c0 xor %eax,%eax 8048334: 8b 5d fc mov -0x4(%ebp),%ebx 8048337: c9 leave 8048338: c3 ret 8048339: b8 80 96 98 00 mov $0x989680,%eax 804833e: 66 90 xchg %ax,%ax 8048340: 8b 15 18 a0 04 08 mov 0x804a018,%edx 8048346: 83 c2 04 add $0x4,%edx 8048349: 89 15 18 a0 04 08 mov %edx,0x804a018 804834f: 8b 15 18 a0 04 08 mov 0x804a018,%edx 8048355: 83 c2 04 add $0x4,%edx 8048358: 89 15 18 a0 04 08 mov %edx,0x804a018 804835e: 8b 15 18 a0 04 08 mov 0x804a018,%edx 8048364: 83 c2 04 add $0x4,%edx 8048367: 89 15 18 a0 04 08 mov %edx,0x804a018 804836d: 8b 15 18 a0 04 08 mov 0x804a018,%edx 8048373: 83 c2 04 add $0x4,%edx 8048376: 83 e8 01 sub $0x1,%eax 8048379: 89 15 18 a0 04 08 mov %edx,0x804a018 804837f: 75 bf jne 8048340 <main+0x40> 8048381: eb af jmp 8048332 <main+0x32> 8048383: 90 nop ... 8048440 <_Z4accuv>: 8048440: a1 18 a0 04 08 mov 0x804a018,%eax 8048445: 83 c0 04 add $0x4,%eax 8048448: a3 18 a0 04 08 mov %eax,0x804a018 804844d: c3 ret 804844e: 90 nop 804844f: 90 nop </code></pre> It seems the direct call branch is definitely doing less than the function pointer branch. But why does the function pointer branch run faster than the direct call? And note that I only used "time" for measuring the time. I've used clock_gettime to do the measurement and got similar results.

It is not completely true that the second branch cannot be inlined. In fact, all the function pointers stored in the array are seen at compile time. So compiler can substitute indirect function calls by direct calls (and it does so). In theory it can go further and inline them (and in this case we have two identical branches). But this particular compiler is not smart enough to do so. As a result, the first branch is optimized "better". But with one exception. Compiler is not allowed to optimize volatile variable <code>sum</code>. As you can see from disassembled code, this produces store instructions immediately followed by load instructions (depending on these store instructions): <pre class="prettyprint"><code>mov %edx,0x804a018 mov 0x804a018,%edx </code></pre> Intel's Software Optimization Manual (section 3.6.5.2) does not recommend arranging instructions like this: <blockquote> ... if a load is scheduled too soon after the store it depends on or if the generation of the data to be stored is delayed, there can be a significant penalty. </blockquote> The second branch avoids this problem because of additional call/return instructions between store and load. So it performs better. Similar improvements may be done for the first branch if we add some (not very expensive) calculations in-between: <pre class="prettyprint"><code>long x1 = 0; for (long i = 0; i < N; i++) { x1 ^= i<<8; accu(); x1 ^= i<<1; accu(); x1 ^= i<<2; accu(); x1 ^= i<<4; accu(); } sum += x1; </code></pre>

why is inlined function slower than function pointer?

Tags:

c++

performance

g++

inline

disassembly

Consider the following code:

typedef void (*Fn)();

volatile long sum = 0;

inline void accu() {
    sum+=4;
}

static const Fn map[4] = {&accu, &accu, &accu, &accu};

int main(int argc, char** argv) {
    static const long N = 10000000L;
    if (argc == 1)
    {
            for (long i = 0; i < N; i++)
            {
                    accu();
                    accu();
                    accu();
                    accu();
            }
    }
    else
    {
            for (long i = 0; i < N; i++)
            {
                    for (int j = 0; j < 4; j++)
                            (*map[j])();
            }
    }
}

When I compiled it with:

g++ -O3 test.cpp

I'm expecting the first branch to run faster because the compiler could inline the function call to accu. And the second branch cannot be inlined because accu is called through function pointer stored in an array.

But the results surprised me:

time ./a.out 

real    0m0.108s
user    0m0.104s
sys 0m0.000s

time ./a.out 1

real    0m0.095s
user    0m0.088s
sys 0m0.004s

I don't understand why, so I did an objdump:

objdump -DStTrR a.out > a.s

and the disassembly doesn't seem to explain the performance result I got:

8048300 <main>:
8048300:       55                      push   %ebp
8048301:       89 e5                   mov    %esp,%ebp
8048303:       53                      push   %ebx
8048304:       bb 80 96 98 00          mov    $0x989680,%ebx
8048309:       83 e4 f0                and    $0xfffffff0,%esp
804830c:       83 7d 08 01             cmpl   $0x1,0x8(%ebp)
8048310:       74 27                   je     8048339 <main+0x39>
8048312:       8d b6 00 00 00 00       lea    0x0(%esi),%esi
8048318:       e8 23 01 00 00          call   8048440 <_Z4accuv>
804831d:       e8 1e 01 00 00          call   8048440 <_Z4accuv>
8048322:       e8 19 01 00 00          call   8048440 <_Z4accuv>
8048327:       e8 14 01 00 00          call   8048440 <_Z4accuv>
804832c:       83 eb 01                sub    $0x1,%ebx
804832f:       90                      nop
8048330:       75 e6                   jne    8048318 <main+0x18>
8048332:       31 c0                   xor    %eax,%eax
8048334:       8b 5d fc                mov    -0x4(%ebp),%ebx
8048337:       c9                      leave
8048338:       c3                      ret
8048339:       b8 80 96 98 00          mov    $0x989680,%eax
804833e:       66 90                   xchg   %ax,%ax
8048340:       8b 15 18 a0 04 08       mov    0x804a018,%edx
8048346:       83 c2 04                add    $0x4,%edx
8048349:       89 15 18 a0 04 08       mov    %edx,0x804a018
804834f:       8b 15 18 a0 04 08       mov    0x804a018,%edx
8048355:       83 c2 04                add    $0x4,%edx
8048358:       89 15 18 a0 04 08       mov    %edx,0x804a018
804835e:       8b 15 18 a0 04 08       mov    0x804a018,%edx
8048364:       83 c2 04                add    $0x4,%edx
8048367:       89 15 18 a0 04 08       mov    %edx,0x804a018
804836d:       8b 15 18 a0 04 08       mov    0x804a018,%edx
8048373:       83 c2 04                add    $0x4,%edx
8048376:       83 e8 01                sub    $0x1,%eax
8048379:       89 15 18 a0 04 08       mov    %edx,0x804a018
804837f:       75 bf                   jne    8048340 <main+0x40>
8048381:       eb af                   jmp    8048332 <main+0x32>
8048383:       90                      nop
...
8048440 <_Z4accuv>:
8048440:       a1 18 a0 04 08          mov    0x804a018,%eax
8048445:       83 c0 04                add    $0x4,%eax
8048448:       a3 18 a0 04 08          mov    %eax,0x804a018
804844d:       c3                      ret
804844e:       90                      nop
804844f:       90                      nop

It seems the direct call branch is definitely doing less than the function pointer branch. But why does the function pointer branch run faster than the direct call?

And note that I only used "time" for measuring the time. I've used clock_gettime to do the measurement and got similar results.

231

asked May 11 '13 02:05

Ming

1 Answers

It is not completely true that the second branch cannot be inlined. In fact, all the function pointers stored in the array are seen at compile time. So compiler can substitute indirect function calls by direct calls (and it does so). In theory it can go further and inline them (and in this case we have two identical branches). But this particular compiler is not smart enough to do so.

As a result, the first branch is optimized "better". But with one exception. Compiler is not allowed to optimize volatile variable sum. As you can see from disassembled code, this produces store instructions immediately followed by load instructions (depending on these store instructions):

mov    %edx,0x804a018
mov    0x804a018,%edx

Intel's Software Optimization Manual (section 3.6.5.2) does not recommend arranging instructions like this:

... if a load is scheduled too soon after the store it depends on or if the generation of the data to be stored is delayed, there can be a significant penalty.

The second branch avoids this problem because of additional call/return instructions between store and load. So it performs better.

Similar improvements may be done for the first branch if we add some (not very expensive) calculations in-between:

long x1 = 0;
for (long i = 0; i < N; i++)
{
  x1 ^= i<<8;
  accu();
  x1 ^= i<<1;
  accu();
  x1 ^= i<<2;
  accu();
  x1 ^= i<<4;
  accu();
}
sum += x1;

187

answered Sep 17 '22 15:09

Evgeny Kluev

Related questions
                            
                                How to invoke aligned new/delete properly?
                            
                                Can a variable template be passed as a template template argument?
                            
                                Why did expression types change in C++ between versions?
                            
                                Declare a C++ function that has C calling convention but internal linkage [duplicate]
                            
                                Excluding certain functions from gcov/lcov coverage results
                            
                                Is anyone using Maven/NAR for any large scale C/C++ projects? [closed]
                            
                                Deterministically Generating cryptographically secure keys and IVEC's
                            
                                C++ - Iterating over a tuple & resolution of type vs constant parameters
                            
                                Eclipse - C++ hello world project's error
                            
                                How can I maintain a weak reference on a COM object in C++?
                            
                                David Lowe's SIFT -- Question about scale space and image coordinates (weird offset problem)
                            
                                Program structure for bi-directional TCP communication using Boost::Asio
                            
                                How to recursively dereference pointer (C++03)?
                            
                                C++ code for CPU load and CPU temperature
                            
                                Perfect pass-through
                            
                                Static code analysis for detecting passing a wchar_t* to BSTR
                            
                                Library ABI compatibility between versions of Visual Studio
                            
                                Android: build native GUI app with NDK? [closed]
                            
                                Best practices for storing settings
                            
                                How to access network file?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With