Performance hit of vtable lookup in C++

Tags:

I'm evaluating to rewrite a piece of real-time software from C/assembly language to C++/assembly language (for reasons not relevant to the question parts of the code are absolutely necessary to do in assembly).

An interrupt comes with a 3 kHz frequency, and for each interrupt around 200 different things are to be done in a sequence. The processor runs with 300 MHz, giving us 100,000 cycles to do the job. This has been solved in C with an array of function pointers:

Click to copy

// Each function does a different thing, all take one parameter being a pointer // to a struct, each struct also being different. void (*todolist[200])(void *parameters);  // Array of pointers to structs containing each function's parameters. void *paramlist[200];  void realtime(void) {   int i;   for (i = 0; i < 200; i++)     (*todolist[i])(paramlist[i]); }

Speed is important. The above 200 iterations are done 3,000 times per second, so practically we do 600,000 iterations per second. The above for loop compiles to five cycles per iteration, yielding a total cost of 3,000,000 cycles per second, i.e. 1% CPU load. Assembler optimization might bring that down to four instructions, however I fear we might get some extra delay due to memory accesses close to each other, etc. In short, I believe those five cycles are pretty optimal.

Now to the C++ rewrite. Those 200 things we do are sort of related to each other. There is a subset of parameters that they all need and use, and have in their respective structs. In a C++ implementation they could thus neatly be regarded as inheriting from a common base class:

Click to copy

class Base {   virtual void Execute();   int something_all_things_need; } class Derived1 : Base {   void Execute() { /* Do something */ }   int own_parameter;   // Other own parameters } class Derived2 : Base { /* Etc. */ }  Base *todolist[200];  void realtime(void) {   for (int i = 0; i < 200; i++)     todolist[i]->Execute(); // vtable look-up! 20+ cycles. }

My problem is the vtable lookup. I cannot do 600,000 lookups per second; this would account for more than 4% of wasted CPU load. Moreover the todolist never changes during run-time, it is only set up once at start-up, so the effort of looking up what function to call is truly wasted. Upon asking myself the question "what is the most optimal end result possible", I look at the assembler code given by the C solution, and refind an array of function pointers...

What is the clean and proper way to do this in C++? Making a nice base class, derived classes and so on feels pretty pointless when in the end one again picks out function pointers for performance reasons.

Update (including correction of where the loop starts):

The processor is an ADSP-214xx, and the compiler is VisualDSP++ 5.0. When enabling #pragma optimize_for_speed, the C loop is 9 cycles. Assembly-optimizing it in my mind yields 4 cycles, however I didn't test it so it's not guaranteed. The C++ loop is 14 cycles. I'm aware of the compiler could do a better job, however I did not want to dismiss this as a compiler issue - getting by without polymorphism is still preferable in an embedded context, and the design choice still interests me. For reference, here the resulting assembly:

Click to copy

i3=0xb27ba; i5=0xb28e6; r15=0xc8;

Here's the actual loop:

Click to copy

r4=dm(i5,m6); i12=dm(i3,m6); r2=i6; i6=i7; jump (m13,i12) (db); dm(i7,m7)=r2; dm(i7,m7)=0x1279de; r15=r15-1; if ne jump (pc, 0xfffffff2);

C++ :

Click to copy

i5=0xb279a; r15=0xc8;

Here's the actual loop:

Click to copy

i5=modify(i5,m6); i4=dm(m7,i5); r2=i4; i4=dm(m6,i4); r1=dm(0x3,i4); r4=r2+r1; i12=dm(0x5,i4); r2=i6; i6=i7; jump (m13,i12) (db); dm(i7,m7)=r2; dm(i7,m7)=0x1279e2; r15=r15-1; if ne jump (pc, 0xffffffe7);

In the meanwhile, I think I have found sort of an answer. The lowest amount of cycles is achieved by doing the very least possible. I have to fetch a data pointer, fetch a function pointer, and call the function with the data pointer as parameter. When fetching a pointer the index register is automatically modified by a constant, and one can just as well let this constant equal 1. So once again one finds oneself with an array of function pointers, and an array of data pointers.

Naturally, the limit is what can be done in assembly, and that has now been explored. Having this in mind, I now understand that even though it comes natural to one to introduce a base class, it was not really what fit the bill. So I guess the answer is that if one wants an array of function pointers, one should make oneself an array of function pointers...

915

asked Aug 23 '13 14:08

user2711077

1 Answers

What makes you think vtable lookup overhead is 20 cycles? If that's really true, you need a better C++ compiler.

I tried this on an Intel box, not knowing anything about the processor you're using, and as expected the difference between the C despatch code and the C++ vtable despatch is one instruction, having to do with the fact that the vtable involves an extra indirect.

C code (based on OP):

Click to copy

void (*todolist[200])(void *parameters);                                   void *paramlist[200]; void realtime(void) {          int i;   for (i = 0; i < 200; i++)                                                    (*todolist[i])(paramlist[i]);                                          }

C++ code:

Click to copy

class Base {   public:     Base(void* unsafe_pointer) : unsafe_pointer_(unsafe_pointer) {}     virtual void operator()() = 0;   protected:     void* unsafe_pointer_; };  Base* todolist[200]; void realtime() {   for (int i = 0; i < 200; ++i)     (*todolist[i])(); }

Both compiled with gcc 4.8, -O3:

Click to copy

realtime:                             |_Z8realtimev: .LFB0:                                |.LFB3:         .cfi_startproc                |        .cfi_startproc         pushq   %rbx                  |        pushq   %rbx         .cfi_def_cfa_offset 16        |        .cfi_def_cfa_offset 16         .cfi_offset 3, -16            |        .cfi_offset 3, -16         xorl    %ebx, %ebx            |        movl    $todolist, %ebx         .p2align 4,,10                |        .p2align 4,,10         .p2align 3                    |        .p2align 3 .L3:                                  |.L3:         movq    paramlist(%rbx), %rdi |        movq    (%rbx), %rdi         call    *todolist(%rbx)       |        addq    $8, %rbx         addq    $8, %rbx              |        movq    (%rdi), %rax                                       |        call    *(%rax)         cmpq    $1600, %rbx           |        cmpq    $todolist+1600, %rbx         jne     .L3                   |        jne     .L3         popq    %rbx                  |        popq    %rbx         .cfi_def_cfa_offset 8         |        .cfi_def_cfa_offset 8         ret                           |        ret

In the C++ code, the first movq gets the address of the vtable, and the call then indexes through that. So that's one instruction overhead.

According to OP, the DSP's C++ compiler produces the following code. I've inserted comments based on my understanding of what's going on (which might be wrong). Note that (IMO) the loop starts one location earlier than OP indicates; otherwise, it makes no sense (to me).

Click to copy

# Initialization. # i3=todolist; i5=paramlist           | # i5=todolist holds paramlist i3=0xb27ba;                           | # No paramlist in C++ i5=0xb28e6;                           | i5=0xb279a; # r15=count r15=0xc8;                             | r15=0xc8;  # Loop. We need to set up r4 (first parameter) and figure out the branch address. # In C++ by convention, the first parameter is 'this' # Note 1: r4=dm(i5,m6); # r4 = *paramlist++;    | i5=modify(i5,m6); # i4 = *todolist++                                       | i4=dm(m7,i5);     # .. # Note 2:                                                                   | r2=i4;            # r2 = obj                                       | i4=dm(m6,i4);     # vtable = *(obj + 1)                                       | r1=dm(0x3,i4);    # r1 = vtable[3]                                       | r4=r2+r1;         # param = obj + r1  i12=dm(i3,m6); # i12 = *todolist++;   | i12=dm(0x5,i4);   # i12 = vtable[5]  # Boilerplate call. Set frame pointer, push return address and old frame pointer. # The two (push) instructions after jump are actually executed before the jump. r2=i6;                                | r2=i6; i6=i7;                                | i6=i7; jump (m13,i12) (db);                  | jump (m13,i12) (db); dm(i7,m7)=r2;                         | dm(i7,m7)=r2; dm(i7,m7)=0x1279de;                   | dm(i7,m7)=0x1279e2;  # if (count--) loop r15=r15-1;                            | r15=r15-1; if ne jump (pc, 0xfffffff2);          | if ne jump (pc, 0xffffffe7);

Notes:

In the C++ version, it seems that the compiler has decided to do the post-increment in two steps, presumably because it wants the result in an i register rather than in r4. This is undoubtedly related to the issue below.
The compiler has decided to compute the base address of the object's real class, using the object's vtable. This occupies three instructions, and presumably also requires the use of i4 as a temporary in step 1. The vtable lookup itself occupies one instruction.

So: the issue is not vtable lookup, which could have been done in a single extra instruction (but actually requires two). The problem is that the compiler feels the need to "find" the object. But why doesn't gcc/i86 need to do that?

The answer is: it used to, but it doesn't any more. In many cases (where there is no multiple inheritance, for example), the cast of a pointer to a derived class to a pointer of a base class does not require modifying the pointer. Consequently, when we call a method of the derived class, we can just give it the base class pointer as its this parameter. But in other cases, that doesn't work, and we have to adjust the pointer when we do the cast, and consequently adjust it back when we do the call.

There are (at least) two ways to perform the second adjustment. One is the way shown by the generated DSP code, where the adjustment is stored in the vtable -- even if it is 0 -- and then applied during the call. The other way, (called vtable-thunks) is to create a thunk -- a little bit of executable code -- which adjusts the this pointer and then jumps to the method's entry point, and put a pointer to this thunk into the vtable. (This can all be done at compile time.) The advantage of the thunk solution is that in the common case where no adjustment needs to be done, we can optimize away the thunk and there is no adjustment code left. (The disadvantage is that if we do need an adjustment, we've generated an extra branch.)

As I understand it, VisualDSP++ is based on gcc, and it might have the -fvtable-thunks and -fno-vtable-thunks options. So you might be able to compile with -fvtable-thunks. But if you do that, you would need to compile all the C++ libraries you use with that option, because you cannot mix the two calling styles. Also, there were (15 years ago) various bugs in gcc's vtable-thunks implementation, so if the version of gcc used by VisualDSP++ is old enough, you might run into those problems too (IIRC, they all involved multiple inheritance, so they might not apply to your use case.)

(Original test, before update):

I tried the following simple case (no multiple inheritance, which can slow things down):

Click to copy

class Base {   public:     Base(int val) : val_(val) {}     virtual int binary(int a, int b) = 0;     virtual int unary(int a) = 0;     virtual int nullary() = 0;   protected:     int val_; };  int binary(Base* begin, Base* end, int a, int b) {   int accum = 0;   for (; begin != end; ++begin) { accum += begin->binary(a, b); }   return accum; }  int unary(Base* begin, Base* end, int a) {   int accum = 0;   for (; begin != end; ++begin) { accum += begin->unary(a); }   return accum; }  int nullary(Base* begin, Base* end) {   int accum = 0;   for (; begin != end; ++begin) { accum += begin->nullary(); }   return accum; }

And compiled it with gcc (4.8) using -O3. As I expected, it produced exactly the same assembly code as your C despatch would have done. Here's the for loop in the case of the unary function, for example:

Click to copy

.L9:         movq    (%rbx), %rax         movq    %rbx, %rdi         addq    $16, %rbx         movl    %r13d, %esi         call    *8(%rax)         addl    %eax, %ebp         cmpq    %rbx, %r12         jne     .L9

138

answered Oct 21 '22 09:10

rici

Related questions
                            
                                The max product of consecutive elements in an array
                            
                                The type or namespace name 'HttpGet' could not be found when add 'System.Web.Http' namespace
                            
                                Wix - Setting Install Folder correctly
                            
                                What happens if I use malloc twice on the same pointer (C)?
                            
                                Mocking a Private Variable that is Assumed to Exist
                            
                                Difference between array_filter() and array_map()? [duplicate]
                            
                                How consider NULL as the MAX date instead of ignoring it in MySQL?
                            
                                Set unchangeable some part of editText android
                            
                                How can I create a CloseableHttpResponse object to help testing?
                            
                                C subscripted value is neither array nor pointer nor vector when assigning an array element value
                            
                                PHP - Content-type not specified assuming application/x-www-form-urlencoded
                            
                                Emacs : Open buffer in vertical split by default

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Performance hit of vtable lookup in C++

Tags:

user2711077

People also ask

1 Answers

rici

Recent Activity

Donate For Us