I've written this implementation of a double buffer:
// ping_pong_buffer.hpp
#include <vector>
#include <mutex>
#include <condition_variable>
template <typename T>
class ping_pong_buffer {
public:
using single_buffer_type = std::vector<T>;
using pointer = typename single_buffer_type::pointer;
using const_pointer = typename single_buffer_type::const_pointer;
ping_pong_buffer(std::size_t size)
: _read_buffer{ size }
, _read_valid{ false }
, _write_buffer{ size }
, _write_valid{ false } {}
const_pointer get_buffer_read() {
{
std::unique_lock<std::mutex> lk(_mtx);
_cv.wait(lk, [this] { return _read_valid; });
}
return _read_buffer.data();
}
void end_reading() {
{
std::lock_guard<std::mutex> lk(_mtx);
_read_valid = false;
}
_cv.notify_one();
}
pointer get_buffer_write() {
_write_valid = true;
return _write_buffer.data();
}
void end_writing() {
{
std::unique_lock<std::mutex> lk(_mtx);
_cv.wait(lk, [this] { return !_read_valid; });
std::swap(_read_buffer, _write_buffer);
std::swap(_read_valid, _write_valid);
}
_cv.notify_one();
}
private:
single_buffer_type _read_buffer;
bool _read_valid;
single_buffer_type _write_buffer;
bool _write_valid;
mutable std::mutex _mtx;
mutable std::condition_variable _cv;
};
Using this dummy test that performs just swaps, its performances are about 20 times worse on Linux than Windows:
#include <thread>
#include <iostream>
#include <chrono>
#include "ping_pong_buffer.hpp"
constexpr std::size_t n = 100000;
int main() {
ping_pong_buffer<std::size_t> ppb(1);
std::thread producer([&ppb] {
for (std::size_t i = 0; i < n; ++i) {
auto p = ppb.get_buffer_write();
p[0] = i;
ppb.end_writing();
}
});
const auto t_begin = std::chrono::steady_clock::now();
for (;;) {
auto p = ppb.get_buffer_read();
if (p[0] == n - 1)
break;
ppb.end_reading();
}
const auto t_end = std::chrono::steady_clock::now();
producer.join();
std::cout << std::chrono::duration_cast<std::chrono::milliseconds>(t_end - t_begin).count() << '\n';
return 0;
}
Environments of the tests are:
-O3 -pthread
/O2
You may find the code in here in godbolt, with ASM output for both GCC and VS2019 with compiler flags actually used.
This huge gap has been found also in other machines and seems to be due to the OS.
Which could be the reason of this surprising difference?
UPDATE:
The test has been performed also on Linux in the same 10700K, and is still a factor 8 slower than Windows.
-O3 -pthread
If the number of iterations is increased by a factor 10, I get 2900 ms.
As Mike Robinson answered, this is likely to do with the different locking implementations on Windows and Linux. We could get a quick idea of the overhead of the feature by profiling how often each implementation switches contexts. I can do the Linux profile, curious if anyone else can try to profile on Windows.
I'm running Ubuntu 18.04 on a Intel(R) Core(TM) i9-8950HK CPU @ 2.90GHz CPU
I compiled with g++ -O3 -pthread -g test.cpp -o ping_pong
, and I recorded how context switches with this command: sudo perf record -s -e sched:sched_switch -g --call-graph dwarf -- ./ping_pong
I extracted a report from the perf counts with this command: sudo perf report -n --header --stdio > linux_ping_pong_report.sched
The report is large, but I'm only interested in this section that shows that about 200,000 context switches were recorded:
# Total Lost Samples: 0
#
# Samples: 198K of event 'sched:sched_switch'
# Event count (approx.): 198860
#
I think that indicates really bad performance, since there in the test, there are n=100000
items pushed & popped to the double buffer, so there is a context switch almost every time we call end_reading()
or end_writing()
, which is what I'd expect from using std::condition_variable
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With