Why is my double buffer implementation 8x slower on Linux than Windows?

Question

I've written this implementation of a double buffer:

// ping_pong_buffer.hpp

#include <vector>
#include <mutex>
#include <condition_variable>

template <typename T>
class ping_pong_buffer {
public:

    using single_buffer_type = std::vector<T>;
    using pointer = typename single_buffer_type::pointer;
    using const_pointer = typename single_buffer_type::const_pointer;

    ping_pong_buffer(std::size_t size)
        : _read_buffer{ size }
        , _read_valid{ false }
        , _write_buffer{ size }
        , _write_valid{ false } {}

    const_pointer get_buffer_read() {
        {
            std::unique_lock<std::mutex> lk(_mtx);
            _cv.wait(lk, [this] { return _read_valid; });
        }
        return _read_buffer.data();
    }

    void end_reading() {
        {
            std::lock_guard<std::mutex> lk(_mtx);
            _read_valid = false;
        }
        _cv.notify_one();
    }

    pointer get_buffer_write() {
        _write_valid = true;
        return _write_buffer.data();
    }

    void end_writing() {
        {
            std::unique_lock<std::mutex> lk(_mtx);
            _cv.wait(lk, [this] { return !_read_valid; });
            std::swap(_read_buffer, _write_buffer);
            std::swap(_read_valid, _write_valid);
        }
        _cv.notify_one();
    }

private:

    single_buffer_type _read_buffer;
    bool _read_valid;
    single_buffer_type _write_buffer;
    bool _write_valid;
    mutable std::mutex _mtx;
    mutable std::condition_variable _cv;

};

Using this dummy test that performs just swaps, its performances are about 20 times worse on Linux than Windows:

#include <thread>
#include <iostream>
#include <chrono>

#include "ping_pong_buffer.hpp"

constexpr std::size_t n = 100000;

int main() {

    ping_pong_buffer<std::size_t> ppb(1);

    std::thread producer([&ppb] {
        for (std::size_t i = 0; i < n; ++i) {
            auto p = ppb.get_buffer_write();
            p[0] = i;
            ppb.end_writing();
        }
    });

    const auto t_begin = std::chrono::steady_clock::now();

    for (;;) {
        auto p = ppb.get_buffer_read();
        if (p[0] == n - 1)
            break;
        ppb.end_reading();
    }

    const auto t_end = std::chrono::steady_clock::now();

    producer.join();

    std::cout << std::chrono::duration_cast<std::chrono::milliseconds>(t_end - t_begin).count() << '
';

    return 0;

}

Environments of the tests are:

Linux (Debian Stretch): Intel Xeon E5-2650 v4, GCC: 900 to 1000 ms
- GCC flags: -O3 -pthread
Windows (10): Intel i7 10700K, VS2019: 45 to 55 ms
- VS2019 flags: /O2

You may find the code in here in godbolt, with ASM output for both GCC and VS2019 with compiler flags actually used.

This huge gap has been found also in other machines and seems to be due to the OS.

Which could be the reason of this surprising difference?

UPDATE:

The test has been performed also on Linux in the same 10700K, and is still a factor 8 slower than Windows.

Linux (Ubuntu 18.04.5): Intel i7 10700K, GCC: 290 to 300 ms
- GCC flags: -O3 -pthread

If the number of iterations is increased by a factor 10, I get 2900 ms.

GandhiGandhi · Accepted Answer

As Mike Robinson answered, this is likely to do with the different locking implementations on Windows and Linux. We could get a quick idea of the overhead of the feature by profiling how often each implementation switches contexts. I can do the Linux profile, curious if anyone else can try to profile on Windows.

I'm running Ubuntu 18.04 on a Intel(R) Core(TM) i9-8950HK CPU @ 2.90GHz CPU

I compiled with g++ -O3 -pthread -g test.cpp -o ping_pong, and I recorded how context switches with this command: sudo perf record -s -e sched:sched_switch -g --call-graph dwarf -- ./ping_pong I extracted a report from the perf counts with this command: sudo perf report -n --header --stdio > linux_ping_pong_report.sched

The report is large, but I'm only interested in this section that shows that about 200,000 context switches were recorded:

# Total Lost Samples: 0
#
# Samples: 198K of event 'sched:sched_switch'
# Event count (approx.): 198860
#

I think that indicates really bad performance, since there in the test, there are n=100000 items pushed & popped to the double buffer, so there is a context switch almost every time we call end_reading() or end_writing(), which is what I'd expect from using std::condition_variable.

Why is my double buffer implementation 8x slower on Linux than Windows?

Tags:

c++

multithreading

x86-64

Giovanni Cerretani

1 Answers

GandhiGandhi

Recent Activity

Donate For Us

Why is my double buffer implementation 8x slower on Linux than Windows?

Tags:

c++

multithreading

x86-64

Giovanni Cerretani

1 Answers

GandhiGandhi

Related questions

Recent Activity

Donate For Us