I have the following two files :-
single.cpp :-
#include <iostream> #include <stdlib.h> using namespace std; unsigned long a=0; class A { public: virtual int f() __attribute__ ((noinline)) { return a; } }; class B : public A { public: virtual int f() __attribute__ ((noinline)) { return a; } void g() __attribute__ ((noinline)) { return; } }; int main() { cin>>a; A* obj; if (a>3) obj = new B(); else obj = new A(); unsigned long result=0; for (int i=0; i<65535; i++) { for (int j=0; j<65535; j++) { result+=obj->f(); } } cout<<result<<"\n"; }
And
multiple.cpp :-
#include <iostream> #include <stdlib.h> using namespace std; unsigned long a=0; class A { public: virtual int f() __attribute__ ((noinline)) { return a; } }; class dummy { public: virtual void g() __attribute__ ((noinline)) { return; } }; class B : public A, public dummy { public: virtual int f() __attribute__ ((noinline)) { return a; } virtual void g() __attribute__ ((noinline)) { return; } }; int main() { cin>>a; A* obj; if (a>3) obj = new B(); else obj = new A(); unsigned long result=0; for (int i=0; i<65535; i++) { for (int j=0; j<65535; j++) { result+=obj->f(); } } cout<<result<<"\n"; }
I am using gcc version 3.4.6 with flags -O2
And this is the timings results I get :-
multiple :-
real 0m8.635s user 0m8.608s sys 0m0.003s
single :-
real 0m10.072s user 0m10.045s sys 0m0.001s
On the other hand, if in multiple.cpp I invert the order of class derivation thus :-
class B : public dummy, public A {
Then I get the following timings (which is slightly slower than that for single inheritance as one might expect thanks to 'thunk' adjustments to the this pointer that the code would need to do) :-
real 0m11.516s user 0m11.479s sys 0m0.002s
Any idea why this may be happening? There doesn't seem to be any difference in the assembly generated for all three cases as far as the loop is concerned. Is there some other place that I need to look at?
Also, I have bound the process to a specific cpu core and I am running it on a real-time priority with SCHED_RR.
EDIT:- This was noticed by Mysticial and reproduced by me. Doing a
cout << "vtable: " << *(void**)obj << endl;
just before the loop in single.cpp leads to single also being as fast as multiple clocking in at 8.4 s just like public A, public dummy.
Note, this answer is highly speculative.
Unlike some of my other answers to questions of the type "Why is X slower than Y", I've been unable to provide solid evidence to backup this answer.
After tinkering with this for about an hour now, I think it's due to the address alignment of three things:
obj
A
f()
(owagh's answer also hints at the possibility of instruction alignment.)
The reason why multiple inheritance is slower than the single inheritance is not because it is "magically" fast, but because the single inheritance case is running into either a compiler or a hardware "hiccup".
If you dump out the assembly for the single and multiple inheritance cases, they are identical (register names and everything) within the nested loop.
Here's the code I compiled:
#include <iostream> #include <stdlib.h> #include <time.h> using namespace std; unsigned long a=0; #ifdef SINGLE class A { public: virtual int f() { return a; } }; class B : public A { public: virtual int f() { return a; } void g() { return; } }; #endif #ifdef MULTIPLE class A { public: virtual int f() { return a; } }; class dummy { public: virtual void g() { return; } }; class B : public A, public dummy { public: virtual int f() { return a; } virtual void g() { return; } }; #endif int main() { cin >> a; A* obj; if (a > 3) obj = new B(); else obj = new A(); unsigned long result = 0; clock_t time0 = clock(); for (int i=0; i<65535; i++) { for (int j=0; j<65535; j++) { result += obj->f(); } } clock_t time1 = clock(); cout << (double)(time1 - time0) / CLOCKS_PER_SEC << endl; cout << result << "\n"; system("pause"); // This is useless in Linux, but I left it here for a reason. }
The assembly for the nested loop is identical in both single and multiple inheritance cases:
.L5: call clock movl $65535, %r13d movq %rax, %r14 xorl %r12d, %r12d .p2align 4,,10 .p2align 3 .L6: movl $65535, %ebx .p2align 4,,10 .p2align 3 .L7: movq 0(%rbp), %rax movq %rbp, %rdi call *(%rax) cltq addq %rax, %r12 subl $1, %ebx jne .L7 subl $1, %r13d jne .L6 call clock
Yet the performance difference I see is:
Xeon X5482, Ubuntu, GCC 4.6.1 x64.
This leads me to the conclusion that the difference must be data dependent.
If you look at that assembly, you'll notice that the only instructions that could have variable latency are the loads:
; %rbp = vtable movq 0(%rbp), %rax ; Dereference function pointer from vtable movq %rbp, %rdi call *(%rax) ; Call function pointer - f()
followed by a few more memory accesses inside the call the f()
.
It just happens to be that in the single inheritance example, the offsets of the aforementioned values are not favorable to the processor. I have no idea why. But I had to suspect something, it'd be cache-bank conflicts in a similar manner to region 2 in the diagram of this question.
By rearranging the code and adding dummy functions, I can change these offsets - which in a lot of cases will eliminate this slow down and make the single inheritance as fast as the multiple inheritance case.
For example, removing the system("pause")
inverts the times:
#ifdef SINGLE class A { public: virtual int f() { return a; } }; class B : public A { public: virtual int f() { return a; } void g() { return; } }; #endif #ifdef MULTIPLE class A { public: virtual int f() { return a; } }; class dummy { public: virtual void g() { return; } }; class B : public A, public dummy { public: virtual int f() { return a; } virtual void g() { return; } }; #endif int main() { cin >> a; A* obj; if (a > 3) obj = new B(); else obj = new A(); unsigned long result = 0; clock_t time0 = clock(); for (int i=0; i<65535; i++) { for (int j=0; j<65535; j++) { result += obj->f(); } } clock_t time1 = clock(); cout << (double)(time1 - time0) / CLOCKS_PER_SEC << endl; cout << result << "\n"; // system("pause"); }
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With