Function pointer runs faster than inline function. Why?

Question

I ran a benchmark of mine on my computer (Intel i3-3220 @ 3.3GHz, Fedora 18), and got very unexpected results. A function pointer was actually a bit faster than an inline function.

Code:

#include <iostream>
#include <chrono>
inline short toBigEndian(short i)
{
    return (i<<8)|(i>>8);
}
short (*toBigEndianPtr)(short i)=toBigEndian;
int main()
{  
    std::chrono::duration<double> t;
    int total=0;
    for(int i=0;i<10000000;i++)
    {
        auto begin=std::chrono::high_resolution_clock::now();
        short a=toBigEndian((short)i);//toBigEndianPtr((short)i);
        total+=a;
        auto end=std::chrono::high_resolution_clock::now();
        t+=std::chrono::duration_cast<std::chrono::duration<double>>(end-begin);
    }
    std::cout<<t.count()<<", "<<total<<std::endl;
    return 0;
}

compiled with

g++ test.cpp -std=c++0x -O0

The 'toBigEndian' loop finishes always at around 0.26-0.27 seconds, while 'toBigEndianPtr' takes 0.21-0.22 seconds.

What makes this even more odd is that when I remove 'total', the function pointer becomes the slower one at 0.35-0.37 seconds, while the inline function is at about 0.27-0.28 seconds.

My question is:

Why is the function pointer faster than the inline function when 'total' exists?

Arne Mertz · Accepted Answer

Short answer: it isn't.

You compile with -O0, wich does not optimize (much). Without optimization, you have no saying in "fast", because unptimized code is not as fast as can be.
You take the address of toBigEndian, wich prevents inlining. inline keyword is a hint for the compiler anyway, wich it may or may not follow. You did the best to not make it follow that hint.

So, to give your measurements any meaning,

optimize your code
use two functions, doing the same thing, one that gets inlined, the other one taken the addres of

Brian · Answer

A common mistake in measuring performance (besides forgetting to optimize) is to use the wrong tool to measure. Using std::chrono would be fine, if you were measuring the performance of your entire, 10000000 or 500000000 iterations. Instead, you are asking it to measure the call / inline of toBigEndian. A function that is all of 6 instructions. So I switched to rdtsc (read time stamp counter, i.e. clock cycles).

Allowing the compiler to really optimize everything in the loop, not cluttering it with recording the time on every tiny iteration, we have a different code sequence. Now, after compiling with g++ -O3 fp_test.cpp -o fp_test -std=c++11, I observe the desired effect. The inlined version averages around 2.15 cycles per iteration, while the function pointer takes around 7.0 cycles per iteration.

Even without using rdtsc, the difference is still quite observable. The wall clock time was 360ms for the inlined code and 1.17s for the function pointer. So one could use std::chrono in place of rdtsc in this code.

Modified code follows:

#include <iostream>
static inline uint64_t rdtsc(void)
{
  uint32_t hi, lo;
  asm volatile ("rdtsc" : "=a"(lo), "=d"(hi));
  return ( (uint64_t)lo)|( ((uint64_t)hi)<<32 );
}
inline short toBigEndian(short i)
{
    return (i<<8)|(i>>8);
}
short (*toBigEndianPtr)(short i)=toBigEndian;
#define LOOP_COUNT 500000000
int main()
{
    uint64_t t = 0, begin=0, end=0;
    int total=0;
    begin=rdtsc();
    for(int i=0;i<LOOP_COUNT;i++)
    {
        short a=0;
        a=toBigEndianPtr((short)i);
        //a=toBigEndian((short)i);
        total+=a;   
    }
    end=rdtsc();
    t+=(end-begin);
    std::cout<<((double)t/LOOP_COUNT)<<", "<<total<<std::endl;
    return 0;
}

Hassedev · Answer

Oh s**t (do I need to censor swearing here?), I found it out. It was somehow related to the timing being inside the loop. When I moved it outside as following,

#include <iostream>
#include <chrono>
inline short toBigEndian(short i)
{
    return (i<<8)|(i>>8);
}

short (*toBigEndianPtr)(short i)=toBigEndian;
int main()
{  
    int total=0;
    auto begin=std::chrono::high_resolution_clock::now();
    for(int i=0;i<100000000;i++)
    {
        short a=toBigEndianPtr((short)i);
        total+=a;
    }
    auto end=std::chrono::high_resolution_clock::now();
    std::cout<<std::chrono::duration_cast<std::chrono::duration<double>>(end-begin).count()<<", "<<total<<std::endl;
    return 0;
}

the results are just as they should be. 0.08 seconds for inline, 0.20 seconds for pointer. Sorry for bothering you guys.

Function pointer runs faster than inline function. Why?

Tags:

c++

performance

function-pointers

inline-functions

Hassedev

3 Answers

Arne Mertz

Brian

Hassedev

Recent Activity

Donate For Us

Function pointer runs faster than inline function. Why?

Tags:

c++

performance

function-pointers

inline-functions

Hassedev

3 Answers

Arne Mertz

Brian

Hassedev

Related questions

Recent Activity

Donate For Us