Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

C++ : difference of execution time between two call of a virtual function

Consider this code under gcc 4.5.1 (Ubuntu 10.04, intel core2duo 3.0 Ghz) It's just 2 tests, in the first one I make a direct call on virtual fucnion and in the second I call it via a Wrapper class :

test.cpp

#define ITER 100000000

class Print{

public:

typedef Print* Ptr;

virtual void print(int p1, float p2, float p3, float p4){/*DOES NOTHING */}

};

class PrintWrapper
{

    public:

      typedef PrintWrapper* Ptr;

      PrintWrapper(Print::Ptr print, int p1, float p2, float p3, float p4) :
      m_print(print), _p1(p1),_p2(p2),_p3(p3),_p4(p4){}

      ~PrintWrapper(){}

      void execute()
      { 
        m_print->print(_p1,_p2,_p3,_p4); 
      }

    private:

      Print::Ptr m_print;
      int _p1;
      float _p2,_p3,_p4;

};

 Print::Ptr p = new Print();
 PrintWrapper::Ptr pw = new PrintWrapper(p, 1, 2.f,3.0f,4.0f);

void test1()
{

 //-------------test 1-------------------------

 for (auto var = 0; var < ITER; ++var) 
 {
   p->print(1, 2.f,3.0f,4.0f);
 }

 }

 void test2()
 {

  //-------------test 2-------------------------

 for (auto var = 0; var < ITER; ++var) 
 {
   pw->execute();
 }

}

int main() 
{ 
  test1(); 
  test2();
}

I profiled it with gprof and objdump :

g++ -c -std=c++0x -pg -g -O2 test.cpp
objdump -d -M intel -S test.o > objdump.txt
g++ -pg test.o -o test
./test
gprof test > gprof.output

in gprof.output I observed that test2() takes more time than test1() but I can't explain it

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ms/call  ms/call  name    
 49.40      0.41     0.41        1   410.00   540.00  test2()
 31.33      0.67     0.26 200000000     0.00     0.00  Print::print(int, float, float, float)
 19.28      0.83     0.16        1   160.00   290.00  test1()
  0.00      0.83     0.00        1     0.00     0.00  global constructors keyed to p

The assembly code in objdump.txt doesn't help me either :

 //-------------test 1-------------------------
 for (auto var = 0; var < ITER; ++var) 
  15:   83 c3 01                add    ebx,0x1
 {
   p->print(1, 2.f,3.0f,4.0f);
  18:   8b 10                   mov    edx,DWORD PTR [eax]
  1a:   c7 44 24 10 00 00 80    mov    DWORD PTR [esp+0x10],0x40800000
  21:   40 
  22:   c7 44 24 0c 00 00 40    mov    DWORD PTR [esp+0xc],0x40400000
  29:   40 
  2a:   c7 44 24 08 00 00 00    mov    DWORD PTR [esp+0x8],0x40000000
  31:   40 
  32:   c7 44 24 04 01 00 00    mov    DWORD PTR [esp+0x4],0x1
  39:   00 
  3a:   89 04 24                mov    DWORD PTR [esp],eax
  3d:   ff 12                   call   DWORD PTR [edx]

  //-------------test 2-------------------------
 for (auto var = 0; var < ITER; ++var) 
  65:   83 c3 01                add    ebx,0x1

      ~PrintWrapper(){}

      void execute()
      { 
        m_print->print(_p1,_p2,_p3,_p4); 
  68:   8b 10                   mov    edx,DWORD PTR [eax]
  6a:   8b 70 10                mov    esi,DWORD PTR [eax+0x10]
  6d:   8b 0a                   mov    ecx,DWORD PTR [edx]
  6f:   89 74 24 10             mov    DWORD PTR [esp+0x10],esi
  73:   8b 70 0c                mov    esi,DWORD PTR [eax+0xc]
  76:   89 74 24 0c             mov    DWORD PTR [esp+0xc],esi
  7a:   8b 70 08                mov    esi,DWORD PTR [eax+0x8]
  7d:   89 74 24 08             mov    DWORD PTR [esp+0x8],esi
  81:   8b 40 04                mov    eax,DWORD PTR [eax+0x4]
  84:   89 14 24                mov    DWORD PTR [esp],edx
  87:   89 44 24 04             mov    DWORD PTR [esp+0x4],eax
  8b:   ff 11                   call   DWORD PTR [ecx]

How can we explain such a difference ?

like image 811
codablank1 Avatar asked Dec 06 '11 19:12

codablank1


Video Answer


1 Answers

In test2(), the program must first load pw from the heap, then call pw->execute() (which incurs call overhead), then load pw->m_print as well as the _p1 through _p4 arguments, then load the vtable pointer for pw, then load the vtable slot for pw->Print, then call pw->Print. Because the compiler can't see through the virtual call, it then must assume all of these values have changed for the next iteration, and reload them all.

In test(), the arguments are inline in the code segment, and we only have to load p, the vtable pointer, and the vtable slot. We've saved five loads this way. This could easily account for the time difference.

In short - the loads of pw->m_print and pw->_p1 through pw->_p4 are the culprit here.

like image 197
bdonlan Avatar answered Oct 14 '22 13:10

bdonlan