int i = 0;
if(i == 10) {...} // [1]
std::atomic<int> ai{0};
if(ai == 10) {...} // [2]
if(ai.load(std::memory_order_relaxed) == 10) {...} // [3]
Is the statement [1] any faster than the statements [2] & [3] in a multithreaded environment?
Assume that ai
may or may not be written in another thread, when [2] & [3] are executing.
Add-on: Provided that accurate value of the underlying integer is not a necessity, which is the fastest way to read an atomic variable?
It depends on the architecture, but in general loads are cheap, paired with a store with a strict memory ordering can be expensive though.
On x86_64, loads and stores of up to 64-bits are atomic on their own (but read-modify-write is decidedly not).
As you have it, the default memory ordering in C++ is std::memory_order_seq_cst
, which gives you sequential consistency, ie: there's some order that all threads will see loads/stores occurring in. To accomplish this on x86 (and indeed all multi-core systems) requires a memory fence on stores to ensure that loads occurring after the store read the new value.
Reading in this case does not require a memory fence on strongly-ordered x86, but writing does. On most weakly-ordered ISAs, even seq_cst reading would require some barrier instructions, but not a full barrier. If we look at this code:
#include <atomic>
#include <stdlib.h>
int main(int argc, const char* argv[]) {
std::atomic<int> num;
num = 12;
if (num == 10) {
return 0;
}
return 1;
}
compiled with -O3:
0x0000000000000560 <+0>: sub $0x18,%rsp
0x0000000000000564 <+4>: mov %fs:0x28,%rax
0x000000000000056d <+13>: mov %rax,0x8(%rsp)
0x0000000000000572 <+18>: xor %eax,%eax
0x0000000000000574 <+20>: movl $0xc,0x4(%rsp)
0x000000000000057c <+28>: mfence
0x000000000000057f <+31>: mov 0x4(%rsp),%eax
0x0000000000000583 <+35>: cmp $0xa,%eax
0x0000000000000586 <+38>: setne %al
0x0000000000000589 <+41>: mov 0x8(%rsp),%rdx
0x000000000000058e <+46>: xor %fs:0x28,%rdx
0x0000000000000597 <+55>: jne 0x5a1 <main+65>
0x0000000000000599 <+57>: movzbl %al,%eax
0x000000000000059c <+60>: add $0x18,%rsp
0x00000000000005a0 <+64>: retq
We can see that the read from the atomic variable at +31 doesn't require anything special, but because we wrote to the atomic at +20, the compiler had to insert an mfence
instruction afterwards which ensures that this thread waits for its store to become visible before doing any later loads. This is expensive, stalling this core until the store buffer drains. (Out-of-order exec of later non-memory instructions is still possible on some x86 CPUs.)
If we instead us a weaker ordering (such as std::memory_order_release
) on the write:
#include <atomic>
#include <stdlib.h>
int main(int argc, const char* argv[]) {
std::atomic<int> num;
num.store(12, std::memory_order_release);
if (num == 10) {
return 0;
}
return 1;
}
Then on x86 we don't need the fence:
0x0000000000000560 <+0>: sub $0x18,%rsp
0x0000000000000564 <+4>: mov %fs:0x28,%rax
0x000000000000056d <+13>: mov %rax,0x8(%rsp)
0x0000000000000572 <+18>: xor %eax,%eax
0x0000000000000574 <+20>: movl $0xc,0x4(%rsp)
0x000000000000057c <+28>: mov 0x4(%rsp),%eax
0x0000000000000580 <+32>: cmp $0xa,%eax
0x0000000000000583 <+35>: setne %al
0x0000000000000586 <+38>: mov 0x8(%rsp),%rdx
0x000000000000058b <+43>: xor %fs:0x28,%rdx
0x0000000000000594 <+52>: jne 0x59e <main+62>
0x0000000000000596 <+54>: movzbl %al,%eax
0x0000000000000599 <+57>: add $0x18,%rsp
0x000000000000059d <+61>: retq
Note though, if we compile this same code for AArch64:
0x0000000000400530 <+0>: stp x29, x30, [sp,#-32]!
0x0000000000400534 <+4>: adrp x0, 0x411000
0x0000000000400538 <+8>: add x0, x0, #0x30
0x000000000040053c <+12>: mov x2, #0xc
0x0000000000400540 <+16>: mov x29, sp
0x0000000000400544 <+20>: ldr x1, [x0]
0x0000000000400548 <+24>: str x1, [x29,#24]
0x000000000040054c <+28>: mov x1, #0x0
0x0000000000400550 <+32>: add x1, x29, #0x10
0x0000000000400554 <+36>: stlr x2, [x1]
0x0000000000400558 <+40>: ldar x2, [x1]
0x000000000040055c <+44>: ldr x3, [x29,#24]
0x0000000000400560 <+48>: ldr x1, [x0]
0x0000000000400564 <+52>: eor x1, x3, x1
0x0000000000400568 <+56>: cbnz x1, 0x40057c <main+76>
0x000000000040056c <+60>: cmp x2, #0xa
0x0000000000400570 <+64>: cset w0, ne
0x0000000000400574 <+68>: ldp x29, x30, [sp],#32
0x0000000000400578 <+72>: ret
When we write to the variable at +36, we use a Store-Release instruction (stlr), and loading at +40 uses a Load-Acquire (ldar). These each provide a partial memory fence (and together form a full fence).
You should only use atomic when you have to reason about the access ordering on the variable. To answer your add-on question, use std::memory_order_relaxed
for the memory to read the atomic, with no guarantees on synchronizing with writes. Only atomicity is guaranteed.
The 3 cases presented have different semantics, so it may be pointless to reason about their relative performance, unless the value is never written after the threads have started.
Case 1:
int i = 0;
if(i == 10) {...} // may actually be optimized away since `i` is clearly 0 now
If i
is accessed by more than one thread, which includes a write, the behavior is undefined.
In the absence of synchronization, the compiler is free to assume no other thread can modify i
, and may reorder/optimize access to it. For example, it may load i
into a register once and never re-read it from memory, or it may hoist writes out of the loop and only write once at the end.
Case 2:
std::atomic<int> ai{0};
if(ai == 10) {...} // [2]
By default reads and writes to an atomic
are done in std::memory_order_seq_cst
(sequentially-consistent) memory order. This means that not only are reads/writes to ai
atomic, but they are also visible to other threads in a timely manner, including any other variable's reads/writes before/after it.
So reading/writing an atomic
acts as a memory fence. This however, is much slower since (1) an SMP system must synchronize caches between processors and (2) the compiler has much less freedom in optimizing code around the atomic access.
Case 3:
std::atomic<int> ai{0};
if(ai.load(std::memory_order_relaxed) == 10) {...} // [3]
This mode allows and guarantees atomicity of ai
reads/writes only. So the compiler is again free to reorder access to it, and only guarantees that writes are visible to other threads in a reasonable amount of time.
It's applicability is very limited, as it makes it very hard to reason about the order of events in a program. For example
std::atomic<int> ai{0}, aj{0};
// thread 1
aj.store(1, std::memory_order_relaxed);
ai.store(10, std::memory_order_relaxed);
// thread 2
if(ai.load(std::memory_order_relaxed) == 10) {
aj.fetch_add(1, std::memory_order_relaxed);
// is aj 1 or 2 now??? no way to tell.
}
This mode is potentially (and often) slower than case 1 since the compiler must ensure each read/write actually goes out to cache/RAM, but is faster than case 2, since it's still possible to optimize other variables around it.
For more details about atomics and memory ordering, see Herb Sutter's excellent atomic<> weapons talk.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With