Iterating over linked list in C++ is slower than in Go with analogous memory access

Tags:

In a variety of contexts I've observed that linked list iteration is consistently slower in C++ than in Go by 10-15%. My first attempt at resolving this mystery on Stack Overflow is here. The example I coded up was problematic because:

1) memory access was unpredictable because of heap allocations, and

2) because there was no actual work being done, some people's compilers were optimizing away the main loop.

To resolve these issues I have a new program with implementations in C++ and Go. The C++ version takes 1.75 secs compared to 1.48 secs for the Go version. This time, I do one large heap allocation before timing begins and use it to operate an object pool from which I release and acquire nodes for the linked list. This way the memory access should be completely analogous between the two implementations.

Hopefully this makes the mystery more reproducible!

C++:

#include <iostream>
#include <sstream>
#include <fstream>
#include <string>
#include <vector>
#include <boost/timer.hpp>

using namespace std;

struct Node {
    Node *next; // 8 bytes
    int age;   // 4 bytes
};

// Object pool, where every free slot points to the previous free slot
template<typename T, int n>
struct ObjPool
{
    typedef T*       pointer;
    typedef pointer* metapointer;

    ObjPool() :
        _top(NULL),
        _size(0)
    {
        pointer chunks = new T[n];
        for (int i=0; i < n; i++) {
            release(&chunks[i]);
        }
    }

    // Giver an available pointer to the object pool
    void release(pointer ptr)
    {
        // Store the current pointer at the given address
        *(reinterpret_cast<metapointer>(ptr)) = _top;

        // Advance the pointer
        _top = ptr;

        // Increment the size
        ++_size;
    }

    // Pop an available pointer off the object pool for program use
    pointer acquire(void)
    {
        if(_size == 0){throw std::out_of_range("");}

        // Pop the top of the stack
        pointer retval = _top;

        // Step back to the previous address
        _top = *(reinterpret_cast<metapointer>(_top));

        // Decrement the size
        --_size;

        // Return the next free address
        return retval;
    }

    unsigned int size(void) const {return _size;}

protected:
    pointer _top;

    // Number of free slots available
    unsigned int _size;
};

Node *nodes = nullptr;
ObjPool<Node, 1000> p;

void processAge(int age) {
    // If the object pool is full, pop off the head of the linked list and release
    // it from the pool
    if (p.size() == 0) {
        Node *head = nodes;
        nodes = nodes->next;
        p.release(head);
    }

    // Insert the new Node with given age in global linked list. The linked list is sorted by age, so this requires iterating through the nodes.
    Node *node = nodes;
    Node *prev = nullptr;
    while (true) {
        if (node == nullptr || age < node->age) {
            Node *newNode = p.acquire();
            newNode->age = age;
            newNode->next = node;

            if (prev == nullptr) {
                nodes = newNode;
            } else {
                prev->next = newNode;
            }

            return;
        }

        prev = node;
        node = node->next;
    }
}

int main() {
    Node x = {};
    std::cout << "Size of struct: " << sizeof(x) << "\n"; // 16 bytes

    boost::timer t;
    for (int i=0; i<1000000; i++) {
        processAge(i);
    }

    std::cout << t.elapsed() << "\n";
}

Go:

package main

import (
    "time"
    "fmt"
    "unsafe"
)

type Node struct {
    next *Node // 8 bytes
    age int32 // 4 bytes
}

// Every free slot points to the previous free slot
type NodePool struct {
    top *Node
    size int
}

func NewPool(n int) NodePool {
    p := NodePool{nil, 0}
    slots := make([]Node, n, n)
    for i := 0; i < n; i++ {
        p.Release(&slots[i])
    }

    return p
}

func (p *NodePool) Release(l *Node) {
    // Store the current top at the given address
    *((**Node)(unsafe.Pointer(l))) = p.top
    p.top = l
    p.size++
}

func (p *NodePool) Acquire() *Node {
    if p.size == 0 {
        fmt.Printf("Attempting to pop from empty pool!\n")
    }
    retval := p.top

    // Step back to the previous address in stack of addresses
    p.top = *((**Node)(unsafe.Pointer(p.top)))
    p.size--
    return retval
}

func processAge(age int32) {
    // If the object pool is full, pop off the head of the linked list and release
    // it from the pool
    if p.size == 0 {
        head := nodes
        nodes = nodes.next
        p.Release(head)
    }

    // Insert the new Node with given age in global linked list. The linked list is sorted by age, so this requires iterating through the nodes.
    node := nodes
    var prev *Node = nil
    for true {
        if node == nil || age < node.age {
            newNode := p.Acquire()
            newNode.age = age
            newNode.next = node

            if prev == nil {
                nodes = newNode
            } else {
                prev.next = newNode
            }
            return
        }

        prev = node
        node = node.next
    }
}

// Linked list of nodes, in ascending order by age
var nodes *Node = nil
var p NodePool = NewPool(1000)

func main() {
    x := Node{};
    fmt.Printf("Size of struct: %d\n", unsafe.Sizeof(x)) // 16 bytes

    start := time.Now()
    for i := 0; i < 1000000; i++ {
        processAge(int32(i))
    }

    fmt.Printf("Time elapsed: %s\n", time.Since(start))
}

Output:

clang++ -std=c++11 -stdlib=libc++ minimalPool.cpp -O3; ./a.out
Size of struct: 16
1.7548

go run minimalPool.go
Size of struct: 16
Time elapsed: 1.487930629s

386

asked May 10 '18 22:05

rampatowl

Video Answer

1 Answers

The big difference between your two programs is that your Go code ignores errors (and will panic or segfault, if you're lucky, if you empty the pool), while your C++ code propagates errors via exception. Compare:

if p.size == 0 {
    fmt.Printf("Attempting to pop from empty pool!\n")
}

vs.

if(_size == 0){throw std::out_of_range("");}

There are at least three ways¹ to make the comparison fair:

Can change the C++ code to ignore the error, as you do in Go,
Change both versions to panic/abort on error.
Change the Go version to handle errors idiomatically,² as you do in C++.

So, let's do all of them and compare the results³:

C++ ignoring error: 1.059329s wall, 1.050000s user + 0.000000s system = 1.050000s CPU (99.1%)
C++ abort on error: 1.081585s wall, 1.060000s user + 0.000000s system = 1.060000s CPU (98.0%)
Go panic on error: Time elapsed: 1.152942427s
Go ignoring error: Time elapsed: 1.196426068s
Go idiomatic error handling: Time elapsed: 1.322005119s
C++ exception: 1.373458s wall, 1.360000s user + 0.000000s system = 1.360000s CPU (99.0%)

So:

Without error handling, C++ is faster than Go.
With panicking, Go gets faster,⁴ but still not as fast as C++.
With idiomatic error handling, C++ slows down a lot more than Go.

Why? This exception never actually happens in your test run, so the actual error-handling code never runs in either language. But clang can't prove that it doesn't happen. And, since you never catch the exception anywhere, that means it has to emit exception handlers and stack unwinders for every non-elided frame all the way up the stack. So it's doing more work on each function call and return—not much more work, but then your function is doing so little real work that the unnecessary extra work adds up.

_{1. You could also change the C++ version to do C-style error handling, or to use an Option type, and probably other possibilities.}

_{2. This, of course, requires a lot more changes: you need to import errors, change the return type of Acquire to (*Node, error), change the return type of processAge to error, change all your return statements, and add at least two if err != nil { … } checks. But that's supposed to be a good thing about Go, right?}

_{3. While I was at it, I replaced your legacy boost::timer with boost::auto_cpu_timer, so we're now seeing wall clock time (as with Go) as well as CPU time.}

_{4. I won't attempt to explain why, because I don't understand it. From a quick glance at the assembly, it's clearly optimized out some checks, but I can't see why it couldn't optimize out those same checks without the panic.}

109

answered Sep 30 '22 14:09

abarnert

Related questions
                            
                                #elseif vs #elif (C/C++ preprocessor)
                            
                                Nesting Lambda functions - Performance implications
                            
                                C++: rvalue reference memory
                            
                                Typedef of anonymous class
                            
                                What is a "virtual thunk" to a virtual function that inherits from a virtual base class?
                            
                                What is the purpose of 61 in tm_sec field from the tm structure
                            
                                Reducing the footprint of debug symbols (executable is bloated to 4 GB)
                            
                                How to make [std::operator""s] visible in a namespace?
                            
                                How to setup Visual Studio Code with OpenGL?
                            
                                Does adding an arguments with default values to functions break ABI?
                            
                                What's are practical example where acquire release memory order differs from sequential consistency?
                            
                                Overloading cast operator for enum class
                            
                                Difference between ClientSession and Session in TensorFlow C++ API
                            
                                Why is stack unwinding guaranteed only for handled exceptions?
                            
                                C++ struct declared in function visible in main
                            
                                Pass a c++ object as a pointer an reuse in another function in Rcpp
                            
                                Does shared_ptr still own its object when calling the deleter?
                            
                                Initializing an std::array of non-default-constructible elements?
                            
                                constexpr expression and variable lifetime, an example where g++ and clang disagree
                            
                                stm32 hal library warning with C++14 & above

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With