In an AI application I am writing in C++, <ol> <li>there is not much numerical computation </li> <li>there are lot of structures for which run-time polymorphism is needed </li> <li>very often, several polymorphic structures interact during computation</li> </ol> In such a situation, are there any optimization techniques? While I won't care to optimize the application just now, one aspect of selecting C++ over Java for the project was to enable more leverage to optimize and to be able to use non-object oriented methods (templates, procedures, overloading). In particular, what are the optimization techniques related to virtual functions? Virtual functions are implemented through virtual tables in memory. Is there some way to pre-fetch these virtual tables onto L2 cache (the cost of fetching from memory/L2 cache is increasing)? Apart from this, are there good references for data locality techniques in C++? These techniques would reduce the wait time for data fetch into L2 cache needed for computation. Update: Also see the following related forums: Performance Penalty for Interface, Several Levels of Base Classes

Virtual functions are very efficient. Assuming 32 bit pointers the memory layout is approximately: <pre class="prettyprint"><code>classptr -> [vtable:4][classdata:x] vtable -> [first:4][second:4][third:4][fourth:4][...] first -> [code:x] second -> [code:x] ... </code></pre> The classptr points to memory that is typically on the heap, occasionally on the stack, and starts with a four byte pointer to the vtable for that class. But the important thing to remember is the vtable itself is not allocated memory. It's a static resource and all objects of the same class type will point to the exactly the same memory location for their vtable array. Calling on different instances won't pull different memory locations into L2 cache. This example from msdn shows the vtable for class A with virtual func1, func2, and func3. Nothing more than 12 bytes. There is a good chance the vtables of different classes will also be physically adjacent in the compiled library (you'll want to verify this is you're especially concerned) which could increase cache efficiency microscopically. <pre class="prettyprint"><code>CONST SEGMENT ??_7A@@6B@ DD FLAT:?func1@A@@UAEXXZ DD FLAT:?func2@A@@UAEXXZ DD FLAT:?func3@A@@UAEXXZ CONST ENDS </code></pre> The other performance concern would be instruction overhead of calling through a vtable function. This is also very efficient. Nearly identical to calling a non-virtual function. Again from the example from msdn: <pre class="prettyprint"><code>; A* pa; ; pa->func3(); mov eax, DWORD PTR _pa$[ebp] mov edx, DWORD PTR [eax] mov ecx, DWORD PTR _pa$[ebp] call DWORD PTR [edx+8] </code></pre> In this example ebp, the stack frame base pointer, has the variable <code>A* pa</code> at zero offset. The register eax is loaded with the value at location [ebp], so it has the A*, and edx is loaded with the value at location [eax], so it has class A vtable. Then ecx is loaded with [ebp], because ecx represents "this" it now holds the A*, and finally the call is made to the value at location [edx+8] which is the third function address in the vtable. If this function call was not virtual the mov eax and mov edx would not be needed, but the difference in performance would be immeasurably small.

AI Applications in C++: How costly are virtual functions? What are the possible optimizations?

Tags:

c++

optimization

In an AI application I am writing in C++,

there is not much numerical computation
there are lot of structures for which run-time polymorphism is needed
very often, several polymorphic structures interact during computation

In such a situation, are there any optimization techniques? While I won't care to optimize the application just now, one aspect of selecting C++ over Java for the project was to enable more leverage to optimize and to be able to use non-object oriented methods (templates, procedures, overloading).

In particular, what are the optimization techniques related to virtual functions? Virtual functions are implemented through virtual tables in memory. Is there some way to pre-fetch these virtual tables onto L2 cache (the cost of fetching from memory/L2 cache is increasing)?

Apart from this, are there good references for data locality techniques in C++? These techniques would reduce the wait time for data fetch into L2 cache needed for computation.

Update: Also see the following related forums: Performance Penalty for Interface, Several Levels of Base Classes

322

asked Oct 01 '08 04:10

amit kumar

2 Answers

Virtual functions are very efficient. Assuming 32 bit pointers the memory layout is approximately:

classptr -> [vtable:4][classdata:x]
vtable -> [first:4][second:4][third:4][fourth:4][...]
first -> [code:x]
second -> [code:x]
...

The classptr points to memory that is typically on the heap, occasionally on the stack, and starts with a four byte pointer to the vtable for that class. But the important thing to remember is the vtable itself is not allocated memory. It's a static resource and all objects of the same class type will point to the exactly the same memory location for their vtable array. Calling on different instances won't pull different memory locations into L2 cache.

This example from msdn shows the vtable for class A with virtual func1, func2, and func3. Nothing more than 12 bytes. There is a good chance the vtables of different classes will also be physically adjacent in the compiled library (you'll want to verify this is you're especially concerned) which could increase cache efficiency microscopically.

CONST SEGMENT
??_7A@@6B@
   DD  FLAT:?func1@A@@UAEXXZ
   DD  FLAT:?func2@A@@UAEXXZ
   DD  FLAT:?func3@A@@UAEXXZ
CONST ENDS

The other performance concern would be instruction overhead of calling through a vtable function. This is also very efficient. Nearly identical to calling a non-virtual function. Again from the example from msdn:

; A* pa;
; pa->func3();
mov eax, DWORD PTR _pa$[ebp]
mov edx, DWORD PTR [eax]
mov ecx, DWORD PTR _pa$[ebp]
call  DWORD PTR [edx+8]

In this example ebp, the stack frame base pointer, has the variable A* pa at zero offset. The register eax is loaded with the value at location [ebp], so it has the A*, and edx is loaded with the value at location [eax], so it has class A vtable. Then ecx is loaded with [ebp], because ecx represents "this" it now holds the A*, and finally the call is made to the value at location [edx+8] which is the third function address in the vtable.

If this function call was not virtual the mov eax and mov edx would not be needed, but the difference in performance would be immeasurably small.

171

answered Nov 15 '22 17:11

loudej

Section 5.3.3 of the draft Technical Report on C++ Performance is entirely devoted to the overhead of virtual functions.

answered Nov 15 '22 19:11

Xavier Nodet

Related questions
                            
                                initializing a C++ std::istringstream from an in memory buffer?
                            
                                Clearing a Layout in Qt
                            
                                How many bytes does a string take? A char?
                            
                                Getting hex through Cin
                            
                                True random numbers with C++11 and RDRAND
                            
                                What is the Best Practice for Combating the Console Closing Issue?
                            
                                Is there any LAME C++ wrapper\simplifier (working on Linux Mac and Win from pure code)?
                            
                                On a disadvantage of exceptions in C++
                            
                                C++ 'strcpy' gives a Warning (C4996)
                            
                                How to trigger the __cplusplus (C++) #ifdef?
                            
                                Fastest way to copy the contents of a vector into an array? [duplicate]
                            
                                Different sizeof results
                            
                                C++ empty String constructor
                            
                                What's the difference between type(myVar) and (type)myVar?
                            
                                Standard vector and boost array: which is faster?
                            
                                Is STL Vector calling a destructor of a not-allocated object?
                            
                                new int[size] vs std::vector
                            
                                Eigen: Is there an inbuilt way to calculate sample covariance
                            
                                Conflict between dynamic linking priority in OSX?
                            
                                Why is the value of i == 0 in this C++ code? [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With