I have a tree whose nodes store either -1 or a non-negative integer that is the name of a vertex. Each vertex appears at most once within the tree. The following function is a bottleneck in my code:
Version A:
void node_vertex_members(node *A, vector<int> *vertexList){
if(A->contents != -1){
vertexList->push_back(A->contents);
}
else{
for(int i=0;i<A->children.size();i++){
node_vertex_members(A->children[i],vertexList);
}
}
}
Version B:
void node_vertex_members(node *A, vector<int> *vertexList){
stack<node*> q;
q.push(A);
while(!q.empty()){
int x = q.top()->contents;
if(x != -1){
vertexList->push_back(x);
q.pop();
}
else{
node *temp = q.top();
q.pop();
for(int i=temp->children.size()-1; i>=0; --i){
q.push(temp->children[i]);
}
}
}
}
For some reason, version B takes significantly longer to run than version A, which I did not expect. What might the compiler be doing that's so much more clever than my code? Put another way, what am I doing that's so inefficient? Also perplexing to me is that if I try anything such as checking in version B whether the children's contents are -1 before putting them on the stack, it slows down dramatically (almost 3x). For reference, I am using g++ in Cygwin with the -O3 option.
Update:
I was able to match the recursive version using the following code (version C):
node *node_list[65536];
void node_vertex_members(node *A, vector<int> *vertex_list){
int top = 0;
node_list[top] = A;
while(top >= 0){
int x = node_list[top]->contents;
if(x != -1){
vertex_list->push_back(x);
--top;
}
else{
node* temp = node_list[top];
--top;
for(int i=temp->children.size()-1; i>=0; --i){
++top;
node_list[top] = temp->children[i];
}
}
}
}
Obvious downsides are the code length and the magic number (and associated hard limit). And, as I said, this only matches the version A performance. I will of course be sticking with the recursive version, but I am satisfied now that it was basically STL overhead biting me.
Recursion doesn't necessarily have to use more memory, either (consider e.g. a depth first traversal of a large tree, where a stack approach consumes memory for pending nodes, while a recursive approach does not).
As a simple example of replacing recursion with a stack, consider the following non-recursive version of the factorial function. Here, we simply push successively smaller values of n onto the stack until the base case is reached, then repeatedly pop off the stored values and multiply them into the result.
Version A has one significant advantage: far smaller code size.
Version B has one significant disadvantage: memory allocation for the stack elements. Consider that the stack starts out empty and has elements pushed into it one by one. Every so often, a new allocation will have to be made for the underlying deque. This is an expensive operation, and it may be repeated a few times for each call of your function.
Edit: here's the assembly generated by g++ -O2 -S
with GCC 4.7.3 on Mac OS, run through c++filt
and annotated by me:
versionA(node*, std::vector<int, std::allocator<int> >*):
LFB609:
pushq %r12
LCFI5:
movq %rsi, %r12
pushq %rbp
LCFI6:
movq %rdi, %rbp
pushq %rbx
LCFI7:
movl (%rdi), %eax
cmpl $-1, %eax ; if(A->contents != -1)
jne L36 ; vertexList->push_back(A->contents)
movq 8(%rdi), %rcx
xorl %r8d, %r8d
movl $1, %ebx
movq 16(%rdi), %rax
subq %rcx, %rax
sarq $3, %rax
testq %rax, %rax
jne L46 ; i < A->children.size()
jmp L35
L43: ; for(int i=0;i<A->children.size();i++)
movq %rdx, %rbx
L46:
movq (%rcx,%r8,8), %rdi
movq %r12, %rsi
call versionA(node*, std::vector<int, std::allocator<int> >*)
movq 8(%rbp), %rcx
leaq 1(%rbx), %rdx
movq 16(%rbp), %rax
movq %rbx, %r8
subq %rcx, %rax
sarq $3, %rax
cmpq %rbx, %rax
ja L43 ; continue
L35:
popq %rbx
LCFI8:
popq %rbp
LCFI9:
popq %r12
LCFI10:
ret
L36: ; vertexList->push_back(A->contents)
LCFI11:
movq 8(%rsi), %rsi
cmpq 16(%r12), %rsi ; vector::size == vector::capacity
je L39
testq %rsi, %rsi
je L40
movl %eax, (%rsi)
L40:
popq %rbx
LCFI12:
addq $4, %rsi
movq %rsi, 8(%r12)
popq %rbp
LCFI13:
popq %r12
LCFI14:
ret
L39: ; slow path for vector to expand capacity
LCFI15:
movq %rdi, %rdx
movq %r12, %rdi
call std::vector<int, std::allocator<int> >::_M_insert_aux(__gnu_cxx::__normal_iterator<int*, std::vector<int, std::allocator<int> > >, int const&)
jmp L35
That's fairly succinct and at a glance seems fairly free of "speed bumps." When I compile with -O3 I get an unholy mess, with unrolled loops and other fun stuff. I don't have time to annotate Version B right now, but suffice to say it is more complex due to many deque functions and scribbling on a lot more memory. No surprise it's slower.
The maximum size of q
in version B is significantly greater than the maximum recursion depth in version A. That could make your cache performance quite a bit less efficient.
(version A: depth is log(N)/log(b)
, version B: queue length hits b*log(N)/log(b)
)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With