Why is my stack-based implementation of this code so much slower than recursion?

Tags:

I have a tree whose nodes store either -1 or a non-negative integer that is the name of a vertex. Each vertex appears at most once within the tree. The following function is a bottleneck in my code:

Version A:

void node_vertex_members(node *A, vector<int> *vertexList){
   if(A->contents != -1){
      vertexList->push_back(A->contents);
   }
   else{
      for(int i=0;i<A->children.size();i++){
          node_vertex_members(A->children[i],vertexList);
      }
   }
}

Version B:

void node_vertex_members(node *A, vector<int> *vertexList){
   stack<node*> q;
   q.push(A);
   while(!q.empty()){
      int x = q.top()->contents;
      if(x != -1){
         vertexList->push_back(x);
         q.pop();
      }
      else{
         node *temp = q.top();
         q.pop();
         for(int i=temp->children.size()-1; i>=0; --i){
            q.push(temp->children[i]);
         }
      }
   }
}

For some reason, version B takes significantly longer to run than version A, which I did not expect. What might the compiler be doing that's so much more clever than my code? Put another way, what am I doing that's so inefficient? Also perplexing to me is that if I try anything such as checking in version B whether the children's contents are -1 before putting them on the stack, it slows down dramatically (almost 3x). For reference, I am using g++ in Cygwin with the -O3 option.

Update:

I was able to match the recursive version using the following code (version C):

node *node_list[65536];

void node_vertex_members(node *A, vector<int> *vertex_list){
   int top = 0;
   node_list[top] = A;
   while(top >= 0){
      int x = node_list[top]->contents;
      if(x != -1){
         vertex_list->push_back(x);
         --top;
      }
      else{
         node* temp = node_list[top];
         --top;
         for(int i=temp->children.size()-1; i>=0; --i){
            ++top;
            node_list[top] = temp->children[i];
         }
      }
   }
}

Obvious downsides are the code length and the magic number (and associated hard limit). And, as I said, this only matches the version A performance. I will of course be sticking with the recursive version, but I am satisfied now that it was basically STL overhead biting me.

360

asked Jun 23 '13 05:06

Eric Tressler

2 Answers

Version A has one significant advantage: far smaller code size.

Version B has one significant disadvantage: memory allocation for the stack elements. Consider that the stack starts out empty and has elements pushed into it one by one. Every so often, a new allocation will have to be made for the underlying deque. This is an expensive operation, and it may be repeated a few times for each call of your function.

Edit: here's the assembly generated by g++ -O2 -S with GCC 4.7.3 on Mac OS, run through c++filt and annotated by me:

versionA(node*, std::vector<int, std::allocator<int> >*):
LFB609:
        pushq   %r12
LCFI5:
        movq    %rsi, %r12
        pushq   %rbp
LCFI6:
        movq    %rdi, %rbp
        pushq   %rbx
LCFI7:
        movl    (%rdi), %eax
        cmpl    $-1, %eax ; if(A->contents != -1)
        jne     L36 ; vertexList->push_back(A->contents)
        movq    8(%rdi), %rcx
        xorl    %r8d, %r8d
        movl    $1, %ebx
        movq    16(%rdi), %rax
        subq    %rcx, %rax
        sarq    $3, %rax
        testq   %rax, %rax
        jne     L46 ; i < A->children.size()
        jmp     L35
L43: ; for(int i=0;i<A->children.size();i++)
        movq    %rdx, %rbx
L46:
        movq    (%rcx,%r8,8), %rdi
        movq    %r12, %rsi
        call    versionA(node*, std::vector<int, std::allocator<int> >*)
        movq    8(%rbp), %rcx
        leaq    1(%rbx), %rdx
        movq    16(%rbp), %rax
        movq    %rbx, %r8
        subq    %rcx, %rax
        sarq    $3, %rax
        cmpq    %rbx, %rax
        ja      L43 ; continue
L35:
        popq    %rbx
LCFI8:
        popq    %rbp
LCFI9:
        popq    %r12
LCFI10:
        ret

L36: ; vertexList->push_back(A->contents)
LCFI11:
        movq    8(%rsi), %rsi
        cmpq    16(%r12), %rsi ; vector::size == vector::capacity
        je      L39
        testq   %rsi, %rsi
        je      L40
        movl    %eax, (%rsi)
L40:
        popq    %rbx
LCFI12:
        addq    $4, %rsi
        movq    %rsi, 8(%r12)
        popq    %rbp
LCFI13:
        popq    %r12
LCFI14:
        ret
L39: ; slow path for vector to expand capacity
LCFI15:
        movq    %rdi, %rdx
        movq    %r12, %rdi
        call    std::vector<int, std::allocator<int> >::_M_insert_aux(__gnu_cxx::__normal_iterator<int*, std::vector<int, std::allocator<int> > >, int const&)
        jmp     L35

That's fairly succinct and at a glance seems fairly free of "speed bumps." When I compile with -O3 I get an unholy mess, with unrolled loops and other fun stuff. I don't have time to annotate Version B right now, but suffice to say it is more complex due to many deque functions and scribbling on a lot more memory. No surprise it's slower.

110

answered Oct 11 '22 17:10

John Zwinck

The maximum size of q in version B is significantly greater than the maximum recursion depth in version A. That could make your cache performance quite a bit less efficient.

(version A: depth is log(N)/log(b), version B: queue length hits b*log(N)/log(b))

answered Oct 11 '22 16:10

Ben Voigt

Related questions
                            
                                Why does boost::variant not provide operator !=
                            
                                Why is this boost header file not included
                            
                                How to skip common classes in VS 2008 when stepping in?
                            
                                Issue using Visual Studio 2010 compiled C++ DLL in Windows 2000
                            
                                How do C++ compilers actually pass reference parameters?
                            
                                Is "++l *= m" undefined behaviour?
                            
                                Why does ReadProcessMemory have `lpNumberOfBytesRead`?
                            
                                Stopping the debugger when a NaN floating point number is produced
                            
                                STL Container that preserves order of insertion but allows no duplicates [duplicate]
                            
                                Given that b is always non-zero, why `b ? --b : ++b` works, but `--b` does not?
                            
                                Hoisting the dynamic type out of a loop (a.k.a. doing Java the C++ way)
                            
                                std::vector differences
                            
                                What comes first - stack unwinding or copying of return values
                            
                                When linking statically, does the linker include the whole library?
                            
                                std::tuple and standard layout
                            
                                Nodejs communicate with C++ program?
                            
                                What extra optimisation does g++ do with -Ofast?
                            
                                Why is there no multiple definition error when you define a class in a header file?
                            
                                How to add static libraries inside a C++ project with Xcode
                            
                                Differences between running an executable with Visual Studio debugger vs without debugger

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why is my stack-based implementation of this code so much slower than recursion?

Tags:

c++

performance

optimization

Eric Tressler

People also ask

2 Answers

John Zwinck

Ben Voigt

Recent Activity

Donate For Us