Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Avoiding repeated C++ virtual table lookup

I have C++ program that reads a config file when the binary is executed, creates a number of child class instances based on the config file, and then periodically iterates over these instances and calls their respective virtual functions.

Gprof is telling me that these function calls are taking up a lot of time (the aforementioned iteration happens very frequently), so I want to try to avoid the repeated virtual function calls somehow.

The code is similar to the following. Once the program populates vector v at the start of the program, this vector won't change anymore for the rest of the program, so it seems inefficient to repeatedly have to do a virtual table lookup every time I want to call f(). I would think there must be a way to cache or save the function pointers somehow, but I'm not sure how.

Would love any suggestions you have on speeding things up. Thank you!

Edit: Sorry, I forgot to mention that the function calls f() for the vector of Child instances has to be in order from 0 to v.size() - 1, so I can't group together the elements of v that have the same derived type.

Also, this was built with -O3 -std=c++14

class Parent {
public:
    virtual void f() { } 
};

class Child1 : public Parent {
public:
    void f() { /* do stuff for child1 */ }
};

//...

class Child9 : public Parent {
public:
    void f() { /* do stuff for child9 */ }
};

int main() {
    vector<Parent*> v;

    // read config file and add Child instances to v based on the file contents

    while (true) {
        // do other stuff
        for (size_t i = 0; i != v.size(); ++i) {
             v[i]->f(); // expensive to do the same virtual table lookups every loop!
        }
    }
};
like image 748
galpo Avatar asked Mar 21 '20 00:03

galpo


People also ask

Is it necessary to override every virtual function?

It is not mandatory for the derived class to override (or re-define the virtual function), in that case, the base class version of the function is used.

Is vtable slow?

Virtual functions are slow when you have a cache miss looking them up. As we'll see through benchmarks, they can be very slow. They can also be very fast when used carefully — to the point where it's impossible to measure the overhead.

What is a vtable lookup?

The virtual table is a lookup table of functions used to resolve function calls in a dynamic/late binding manner. The virtual table sometimes goes by other names, such as “vtable”, “virtual function table”, “virtual method table”, or “dispatch table”.

Can virtual methods be inherited?

Base classes can't inherit what the child has (such as a new function or variable). Virtual functions are simply functions that can be overridden by the child class if the that child class changes the implementation of the virtual function so that the base virtual function isn't called. A is the base class for B,C,D.


2 Answers

Based on some of the questions and your answers in the comments, here are a couple of considerations.

1) Your problem (if there is one, your solution might already be close to optimal, depending on details you have not mentioned) is most likely somewhere else, not in the overhead of a virtual function call.

If you really run this in a tight loop, and there's not much going on in the implementations of f() that touches a lot of memory, your vtables probably remain in the L1 cache, and the virtual function call overhead will be absolutely minimal, if any, on modern hardware.

2) You say "the functions f() themselves are very simple, for example one of them just multiplies the values at two memory addresses and stores the product in a third address" - this might not be as innocent as you expect. For reference, going to L1 cache will cost you about 3 cycles, going to RAM may cost as much as 60-200, depedning on your hardware.

If you have enough of these objects (so that keeping all of the memory they reference in L1 cache is not possible), and the memory locations they reference are basically random (so that prefetching is ineffective), and/or you touch enough things in the rest of your program (so that all the relevant data gets vacated from cache between the loops over your vector), the cost of fetching and storing the values from and to memory/lower levels of cache will outweigh the cost of the virtual function calls by orders of magnitude in the worst case.

3) You iterate over a vector of pointers to objects - not the objects themselves.

Depending on how you allocate the objects and how big they are, this might not be an issue - prefetching will do wonders for you if you allocate them in a tight loop and your allocator packs them nicely. If, however, you allocate/free a lot of other things and mix in the allocations of these objects in between, they may end up located sparsely and in basically random locations in memory; then iterating over them in the order of creation will involve a lot random reads from memory, which will again be far slower than any virtual function overhead.

4) You say "calls to f() for the vector of children has to be in order" - do they?

If they do, then you are out of luck in some ways. If, however, you can re-architect your system so that they can be called ordered by type, then there is a lot of speed to be gained in various aspects - you could probably allocate an array of each type of object (nice, dense packing in memory), iterate over them in order (prefetcher friendly), and call your f()'s in groups for a single, well known type (inlining friendly, instruction cache friendly).

5) And finally - if none of the above applies and your problem is really in virtual function calls (unlikely), then, yes, you can try storing a pointer to the exact function you need to call for each object in some fashion - either manually or by using one of the type erasure / duck typing methods others have suggested.

My main point is this - there a lot of performance benefits to be had from changing the architecture of your system in some ways.

Remember: accessing things that are already in L1/L2 cache is good, having to go to L3/RAM for data is worse; accessing memory in a sequential order is good, jumping all over memory is bad; calling the same method in a tight loop, potentially inlining it, is good, calling a lot of different methods in a tight loop is worse.

If this is a part of your program the performance of which really matters, you should consider changing the architecture of your system to allow for some of the previously mentioned optimizations. I know this may seem daunting, but that is the game we are playing. Sometimes you need to sacrifice "clean" OOP and abstractions for performance, if the problem you are solving allows for it.

like image 62
DeducibleSteak Avatar answered Sep 28 '22 05:09

DeducibleSteak


Edit: For vector of arbitrary child types mixed in together, I recommend going with the virtual call.


If, depending on config, there were a vector of only one child type - or if you can separate the different types into separate containers, then this could be a case where compile time polymorphism might be an option instead of runtime one. For example:

template<class Child, class Range>
void f_for(Range& r) {
    for (Parent* p : r) {
        Child* c = static_cast<Child*>(p);
        c->Child::f(); // use static dispatch to avoid virtual lookup
    }
}

...

if (config)
    f_for<Child1>(v);
else
    f_for<Child2>(v);

Alternative to explicit static dispatch would be to mark the child class or the member function final.

You might even expand the static portion of the program so that you get to use vector<Child1> or vector<Child2> directly, avoiding the extra indirection. At this point the inheritance is not even necessary.

like image 23
eerorika Avatar answered Sep 28 '22 04:09

eerorika