Atomic load and store functions produce same assembly code as non-atomic load and store

Question

Why is the assembly output of store_idx_x86() the same as store_idx() and load_idx_x86() the same as load_idx()?

It was my understanding that __atomic_load_n() would flush the core's invalidation queue, and __atomic_store_n() would flush the core's store buffer.

Note -- I complied with: gcc (GCC) 4.8.2 20140120 (Red Hat 4.8.2-16)

Update: I understand that x86 will never reorder stores with other stores and loads with other loads -- so is gcc smart enough to implement sfence and lfence only when it is needed or should using __atomic_ result in a fence (assuming a memory model stricter than __ATOMIC_RELAXED)?

Code

#include <stdint.h>


inline void store_idx_x86(uint64_t* dest, uint64_t idx)
{   
    *dest = idx;    
}

inline void store_idx(uint64_t* dest, uint64_t idx)
{
    __atomic_store_n(dest, idx, __ATOMIC_RELEASE);
}

inline uint64_t load_idx_x86(uint64_t* source)
{
    return *source;

}

inline uint64_t load_idx(uint64_t* source)
{
    return __atomic_load_n(source, __ATOMIC_ACQUIRE);
}

Assembly:

.file   "util.c"
    .text
    .globl  store_idx_x86
    .type   store_idx_x86, @function
store_idx_x86:
.LFB0:
    .cfi_startproc
    pushq   %rbp
    .cfi_def_cfa_offset 16
    .cfi_offset 6, -16
    movq    %rsp, %rbp
    .cfi_def_cfa_register 6
    movq    %rdi, -8(%rbp)
    movq    %rsi, -16(%rbp)
    movq    -8(%rbp), %rax
    movq    -16(%rbp), %rdx
    movq    %rdx, (%rax)
    popq    %rbp
    .cfi_def_cfa 7, 8
    ret
    .cfi_endproc
.LFE0:
    .size   store_idx_x86, .-store_idx_x86
    .globl  store_idx
    .type   store_idx, @function
store_idx:
.LFB1:
    .cfi_startproc
    pushq   %rbp
    .cfi_def_cfa_offset 16
    .cfi_offset 6, -16
    movq    %rsp, %rbp
    .cfi_def_cfa_register 6
    movq    %rdi, -8(%rbp)
    movq    %rsi, -16(%rbp)
    movq    -8(%rbp), %rax
    movq    -16(%rbp), %rdx
    movq    %rdx, (%rax)
    popq    %rbp
    .cfi_def_cfa 7, 8
    ret
    .cfi_endproc
.LFE1:
    .size   store_idx, .-store_idx
    .globl  load_idx_x86
    .type   load_idx_x86, @function
load_idx_x86:
.LFB2:
    .cfi_startproc
    pushq   %rbp
    .cfi_def_cfa_offset 16
    .cfi_offset 6, -16
    movq    %rsp, %rbp
    .cfi_def_cfa_register 6
    movq    %rdi, -8(%rbp)
    movq    -8(%rbp), %rax
    movq    (%rax), %rax
    popq    %rbp
    .cfi_def_cfa 7, 8
    ret
    .cfi_endproc
.LFE2:
    .size   load_idx_x86, .-load_idx_x86
    .globl  load_idx
    .type   load_idx, @function
load_idx:
.LFB3:
    .cfi_startproc
    pushq   %rbp
    .cfi_def_cfa_offset 16
    .cfi_offset 6, -16
    movq    %rsp, %rbp
    .cfi_def_cfa_register 6
    movq    %rdi, -8(%rbp)
    movq    -8(%rbp), %rax
    movq    (%rax), %rax
    popq    %rbp
    .cfi_def_cfa 7, 8
    ret
    .cfi_endproc
.LFE3:
    .size   load_idx, .-load_idx
    .ident  "GCC: (GNU) 4.8.2 20140120 (Red Hat 4.8.2-16)"
    .section    .note.GNU-stack,"",@progbits

Myles Hathcock · Accepted Answer

Why is the assembly output of store_idx_x86() the same as store_idx() and load_idx_x86() the same as load_idx()?

On x86, assuming compiler-enforced alignment, they are the same operations. Loads and Stores to aligned addresses of the native size or smaller are guaranteed to be atomic. Reference Intel manual vol 3A, 8.1.1:

The Pentium processor (and newer processors since) guarantees that the following additional memory operations will always be carried out atomically: Reading or writing a quadword aligned on a 64-bit boundary [...]

Furthermore, x86 enforces a strongly ordered memory model, meaning every store and load has implicit release and acquire semantics, respectively.

Lastly, the fencing instructions you mention are only required when using Intel's non-temporal SSE instructions (great reference here), or when needing to create a store-load fence (article here) (and that one is the mfence or lock instruction actually).

Aside: I was curious about that statement in Intel's manuals, so I devised a test program. Frustratingly, on my computer (2 core i3-4030U), I get this output from it:

unaligned
4265292 / 303932066 | 1.40337%
unaligned, but in same cache line
2373 / 246957659 | 0.000960893%
aligned (8 byte)
0 / 247097496 | 0%

Which seems to violate what Intel says. I will investigate. In the meantime, you should clone that demo program and see what it gives you. You just need -std=c++11 ... -pthread on linux.

Atomic load and store functions produce same assembly code as non-atomic load and store

Tags:

c

gcc

assembly

atomic

Bigtree

1 Answers

Myles Hathcock

Recent Activity

Donate For Us

Atomic load and store functions produce same assembly code as non-atomic load and store

Tags:

c

gcc

assembly

atomic

Bigtree

1 Answers

Myles Hathcock

Related questions

Recent Activity

Donate For Us