The code below demonstrates curiosities of multi-threaded programming. In particular the performance of std::memory_order_relaxed
increment vs regular increment in a single thread. What I do not understand why fetch_add(relaxed) single-threaded is twice slower than a regular increment.
static void BM_IncrementCounterLocal(benchmark::State& state) {
volatile std::atomic_int val2;
while (state.KeepRunning()) {
for (int i = 0; i < 10; ++i) {
DoNotOptimize(val2.fetch_add(1, std::memory_order_relaxed));
}
}
}
BENCHMARK(BM_IncrementCounterLocal)->ThreadRange(1, 8);
static void BM_IncrementCounterLocalInt(benchmark::State& state) {
volatile int val3 = 0;
while (state.KeepRunning()) {
for (int i = 0; i < 10; ++i) {
DoNotOptimize(++val3);
}
}
}
BENCHMARK(BM_IncrementCounterLocalInt)->ThreadRange(1, 8);
Output:
Benchmark Time(ns) CPU(ns) Iterations ---------------------------------------------------------------------- BM_IncrementCounterLocal/threads:1 59 60 11402509 BM_IncrementCounterLocal/threads:2 30 61 11284498 BM_IncrementCounterLocal/threads:4 19 62 11373100 BM_IncrementCounterLocal/threads:8 17 62 10491608 BM_IncrementCounterLocalInt/threads:1 31 31 22592452 BM_IncrementCounterLocalInt/threads:2 15 31 22170842 BM_IncrementCounterLocalInt/threads:4 8 31 22214640 BM_IncrementCounterLocalInt/threads:8 9 31 21889704
With the volatile int
, the compiler must ensure that it does not optimize away and/or reorder any reads/writes of the variable.
With the fetch_add
, the CPU must take precautions that the read-modify-write operation is atomic.
These are two completely different requirements: The atomicity requirement means that the CPU has to communicate with other CPUs on your machine, ensuring that they don't read/write the given memory location between its own read and write. If the compiler compiles the fetch_add
using a compare-and-swap instruction, it will actually emit a short loop to catch the case that some other CPU modified the value in between.
For the volatile int
no such communication is necessary. On the contrary, volatile
requires that the compiler does not invent any reads: volatile
was designed for single thread communication with hardware registers, where the mere act of reading the value may have side effects.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With