Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there any performance difference in just reading an atomic variable compared to a normal variable?

int i = 0;
if(i == 10)  {...}  // [1]

std::atomic<int> ai{0};
if(ai == 10) {...}  // [2]
if(ai.load(std::memory_order_relaxed) == 10) {...}  // [3]

Is the statement [1] any faster than the statements [2] & [3] in a multithreaded environment?
Assume that ai may or may not be written in another thread, when [2] & [3] are executing.

Add-on: Provided that accurate value of the underlying integer is not a necessity, which is the fastest way to read an atomic variable?

like image 812
iammilind Avatar asked Jan 03 '20 15:01

iammilind


2 Answers

It depends on the architecture, but in general loads are cheap, paired with a store with a strict memory ordering can be expensive though.

On x86_64, loads and stores of up to 64-bits are atomic on their own (but read-modify-write is decidedly not).

As you have it, the default memory ordering in C++ is std::memory_order_seq_cst, which gives you sequential consistency, ie: there's some order that all threads will see loads/stores occurring in. To accomplish this on x86 (and indeed all multi-core systems) requires a memory fence on stores to ensure that loads occurring after the store read the new value.

Reading in this case does not require a memory fence on strongly-ordered x86, but writing does. On most weakly-ordered ISAs, even seq_cst reading would require some barrier instructions, but not a full barrier. If we look at this code:

#include <atomic>
#include <stdlib.h>

int main(int argc, const char* argv[]) {
    std::atomic<int> num;

    num = 12;
    if (num == 10) {
        return 0;
    }
    return 1;
}

compiled with -O3:

   0x0000000000000560 <+0>:     sub    $0x18,%rsp
   0x0000000000000564 <+4>:     mov    %fs:0x28,%rax
   0x000000000000056d <+13>:    mov    %rax,0x8(%rsp)
   0x0000000000000572 <+18>:    xor    %eax,%eax
   0x0000000000000574 <+20>:    movl   $0xc,0x4(%rsp)
   0x000000000000057c <+28>:    mfence 
   0x000000000000057f <+31>:    mov    0x4(%rsp),%eax
   0x0000000000000583 <+35>:    cmp    $0xa,%eax
   0x0000000000000586 <+38>:    setne  %al
   0x0000000000000589 <+41>:    mov    0x8(%rsp),%rdx
   0x000000000000058e <+46>:    xor    %fs:0x28,%rdx
   0x0000000000000597 <+55>:    jne    0x5a1 <main+65>
   0x0000000000000599 <+57>:    movzbl %al,%eax
   0x000000000000059c <+60>:    add    $0x18,%rsp
   0x00000000000005a0 <+64>:    retq

We can see that the read from the atomic variable at +31 doesn't require anything special, but because we wrote to the atomic at +20, the compiler had to insert an mfence instruction afterwards which ensures that this thread waits for its store to become visible before doing any later loads. This is expensive, stalling this core until the store buffer drains. (Out-of-order exec of later non-memory instructions is still possible on some x86 CPUs.)

If we instead us a weaker ordering (such as std::memory_order_release) on the write:

#include <atomic>
#include <stdlib.h>

int main(int argc, const char* argv[]) {
    std::atomic<int> num;

    num.store(12, std::memory_order_release);
    if (num == 10) {
        return 0;
    }
    return 1;
}

Then on x86 we don't need the fence:

   0x0000000000000560 <+0>:     sub    $0x18,%rsp
   0x0000000000000564 <+4>:     mov    %fs:0x28,%rax
   0x000000000000056d <+13>:    mov    %rax,0x8(%rsp)
   0x0000000000000572 <+18>:    xor    %eax,%eax
   0x0000000000000574 <+20>:    movl   $0xc,0x4(%rsp)
   0x000000000000057c <+28>:    mov    0x4(%rsp),%eax
   0x0000000000000580 <+32>:    cmp    $0xa,%eax
   0x0000000000000583 <+35>:    setne  %al
   0x0000000000000586 <+38>:    mov    0x8(%rsp),%rdx
   0x000000000000058b <+43>:    xor    %fs:0x28,%rdx
   0x0000000000000594 <+52>:    jne    0x59e <main+62>
   0x0000000000000596 <+54>:    movzbl %al,%eax
   0x0000000000000599 <+57>:    add    $0x18,%rsp
   0x000000000000059d <+61>:    retq   

Note though, if we compile this same code for AArch64:

   0x0000000000400530 <+0>:     stp  x29, x30, [sp,#-32]!
   0x0000000000400534 <+4>:     adrp x0, 0x411000
   0x0000000000400538 <+8>:     add  x0, x0, #0x30
   0x000000000040053c <+12>:    mov  x2, #0xc
   0x0000000000400540 <+16>:    mov  x29, sp
   0x0000000000400544 <+20>:    ldr  x1, [x0]
   0x0000000000400548 <+24>:    str  x1, [x29,#24]
   0x000000000040054c <+28>:    mov  x1, #0x0
   0x0000000000400550 <+32>:    add  x1, x29, #0x10
   0x0000000000400554 <+36>:    stlr x2, [x1]
   0x0000000000400558 <+40>:    ldar x2, [x1]
   0x000000000040055c <+44>:    ldr  x3, [x29,#24]
   0x0000000000400560 <+48>:    ldr  x1, [x0]
   0x0000000000400564 <+52>:    eor  x1, x3, x1
   0x0000000000400568 <+56>:    cbnz x1, 0x40057c <main+76>
   0x000000000040056c <+60>:    cmp  x2, #0xa
   0x0000000000400570 <+64>:    cset w0, ne
   0x0000000000400574 <+68>:    ldp  x29, x30, [sp],#32
   0x0000000000400578 <+72>:    ret

When we write to the variable at +36, we use a Store-Release instruction (stlr), and loading at +40 uses a Load-Acquire (ldar). These each provide a partial memory fence (and together form a full fence).

You should only use atomic when you have to reason about the access ordering on the variable. To answer your add-on question, use std::memory_order_relaxed for the memory to read the atomic, with no guarantees on synchronizing with writes. Only atomicity is guaranteed.

like image 187
gct Avatar answered Oct 06 '22 02:10

gct


The 3 cases presented have different semantics, so it may be pointless to reason about their relative performance, unless the value is never written after the threads have started.

Case 1:

int i = 0;
if(i == 10)  {...}  // may actually be optimized away since `i` is clearly 0 now

If i is accessed by more than one thread, which includes a write, the behavior is undefined.

In the absence of synchronization, the compiler is free to assume no other thread can modify i, and may reorder/optimize access to it. For example, it may load i into a register once and never re-read it from memory, or it may hoist writes out of the loop and only write once at the end.

Case 2:

std::atomic<int> ai{0};
if(ai == 10) {...}  // [2]

By default reads and writes to an atomic are done in std::memory_order_seq_cst (sequentially-consistent) memory order. This means that not only are reads/writes to ai atomic, but they are also visible to other threads in a timely manner, including any other variable's reads/writes before/after it.

So reading/writing an atomic acts as a memory fence. This however, is much slower since (1) an SMP system must synchronize caches between processors and (2) the compiler has much less freedom in optimizing code around the atomic access.

Case 3:

std::atomic<int> ai{0};
if(ai.load(std::memory_order_relaxed) == 10) {...}  // [3]

This mode allows and guarantees atomicity of ai reads/writes only. So the compiler is again free to reorder access to it, and only guarantees that writes are visible to other threads in a reasonable amount of time.

It's applicability is very limited, as it makes it very hard to reason about the order of events in a program. For example

std::atomic<int> ai{0}, aj{0};

// thread 1
aj.store(1, std::memory_order_relaxed);
ai.store(10, std::memory_order_relaxed);

// thread 2
if(ai.load(std::memory_order_relaxed) == 10) {
  aj.fetch_add(1, std::memory_order_relaxed);
  // is aj 1 or 2 now??? no way to tell.
}

This mode is potentially (and often) slower than case 1 since the compiler must ensure each read/write actually goes out to cache/RAM, but is faster than case 2, since it's still possible to optimize other variables around it.

For more details about atomics and memory ordering, see Herb Sutter's excellent atomic<> weapons talk.

like image 42
rustyx Avatar answered Oct 06 '22 00:10

rustyx