Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

why is inlined function slower than function pointer?

Consider the following code:

typedef void (*Fn)();

volatile long sum = 0;

inline void accu() {
    sum+=4;
}

static const Fn map[4] = {&accu, &accu, &accu, &accu};

int main(int argc, char** argv) {
    static const long N = 10000000L;
    if (argc == 1)
    {
            for (long i = 0; i < N; i++)
            {
                    accu();
                    accu();
                    accu();
                    accu();
            }
    }
    else
    {
            for (long i = 0; i < N; i++)
            {
                    for (int j = 0; j < 4; j++)
                            (*map[j])();
            }
    }
}

When I compiled it with:

g++ -O3 test.cpp

I'm expecting the first branch to run faster because the compiler could inline the function call to accu. And the second branch cannot be inlined because accu is called through function pointer stored in an array.

But the results surprised me:

time ./a.out 

real    0m0.108s
user    0m0.104s
sys 0m0.000s

time ./a.out 1

real    0m0.095s
user    0m0.088s
sys 0m0.004s

I don't understand why, so I did an objdump:

objdump -DStTrR a.out > a.s

and the disassembly doesn't seem to explain the performance result I got:

8048300 <main>:
8048300:       55                      push   %ebp
8048301:       89 e5                   mov    %esp,%ebp
8048303:       53                      push   %ebx
8048304:       bb 80 96 98 00          mov    $0x989680,%ebx
8048309:       83 e4 f0                and    $0xfffffff0,%esp
804830c:       83 7d 08 01             cmpl   $0x1,0x8(%ebp)
8048310:       74 27                   je     8048339 <main+0x39>
8048312:       8d b6 00 00 00 00       lea    0x0(%esi),%esi
8048318:       e8 23 01 00 00          call   8048440 <_Z4accuv>
804831d:       e8 1e 01 00 00          call   8048440 <_Z4accuv>
8048322:       e8 19 01 00 00          call   8048440 <_Z4accuv>
8048327:       e8 14 01 00 00          call   8048440 <_Z4accuv>
804832c:       83 eb 01                sub    $0x1,%ebx
804832f:       90                      nop
8048330:       75 e6                   jne    8048318 <main+0x18>
8048332:       31 c0                   xor    %eax,%eax
8048334:       8b 5d fc                mov    -0x4(%ebp),%ebx
8048337:       c9                      leave
8048338:       c3                      ret
8048339:       b8 80 96 98 00          mov    $0x989680,%eax
804833e:       66 90                   xchg   %ax,%ax
8048340:       8b 15 18 a0 04 08       mov    0x804a018,%edx
8048346:       83 c2 04                add    $0x4,%edx
8048349:       89 15 18 a0 04 08       mov    %edx,0x804a018
804834f:       8b 15 18 a0 04 08       mov    0x804a018,%edx
8048355:       83 c2 04                add    $0x4,%edx
8048358:       89 15 18 a0 04 08       mov    %edx,0x804a018
804835e:       8b 15 18 a0 04 08       mov    0x804a018,%edx
8048364:       83 c2 04                add    $0x4,%edx
8048367:       89 15 18 a0 04 08       mov    %edx,0x804a018
804836d:       8b 15 18 a0 04 08       mov    0x804a018,%edx
8048373:       83 c2 04                add    $0x4,%edx
8048376:       83 e8 01                sub    $0x1,%eax
8048379:       89 15 18 a0 04 08       mov    %edx,0x804a018
804837f:       75 bf                   jne    8048340 <main+0x40>
8048381:       eb af                   jmp    8048332 <main+0x32>
8048383:       90                      nop
...
8048440 <_Z4accuv>:
8048440:       a1 18 a0 04 08          mov    0x804a018,%eax
8048445:       83 c0 04                add    $0x4,%eax
8048448:       a3 18 a0 04 08          mov    %eax,0x804a018
804844d:       c3                      ret
804844e:       90                      nop
804844f:       90                      nop

It seems the direct call branch is definitely doing less than the function pointer branch. But why does the function pointer branch run faster than the direct call?

And note that I only used "time" for measuring the time. I've used clock_gettime to do the measurement and got similar results.

like image 231
Ming Avatar asked May 11 '13 02:05

Ming


People also ask

Are inline functions slower?

Inline functions are faster because you don't need to push and pop things on/off the stack like parameters and the return address; however, it does make your binary slightly larger.

Is inline function is faster than normal function?

yes. Inline functions execute faster then a normal one. DecodesWorm an inline function means that the compiler replaces all the calls of the function with the code inside the function while it compiles.

What are the disadvantages of an inline function?

5) Inline functions may not be useful for many embedded systems. Because in embedded systems code size is more important than speed. 6) Inline functions might cause thrashing because inlining might increase size of the binary executable file. Thrashing in memory causes performance of computer to degrade.

Which is faster inline or macro?

By declaring a function inline, you can direct GCC to integrate that function's code into the code for its callers.


1 Answers

It is not completely true that the second branch cannot be inlined. In fact, all the function pointers stored in the array are seen at compile time. So compiler can substitute indirect function calls by direct calls (and it does so). In theory it can go further and inline them (and in this case we have two identical branches). But this particular compiler is not smart enough to do so.

As a result, the first branch is optimized "better". But with one exception. Compiler is not allowed to optimize volatile variable sum. As you can see from disassembled code, this produces store instructions immediately followed by load instructions (depending on these store instructions):

mov    %edx,0x804a018
mov    0x804a018,%edx

Intel's Software Optimization Manual (section 3.6.5.2) does not recommend arranging instructions like this:

... if a load is scheduled too soon after the store it depends on or if the generation of the data to be stored is delayed, there can be a significant penalty.

The second branch avoids this problem because of additional call/return instructions between store and load. So it performs better.

Similar improvements may be done for the first branch if we add some (not very expensive) calculations in-between:

long x1 = 0;
for (long i = 0; i < N; i++)
{
  x1 ^= i<<8;
  accu();
  x1 ^= i<<1;
  accu();
  x1 ^= i<<2;
  accu();
  x1 ^= i<<4;
  accu();
}
sum += x1;
like image 187
Evgeny Kluev Avatar answered Sep 17 '22 15:09

Evgeny Kluev