Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

The cost of atomic counters and spinlocks on x86(_64)

Tags:

Preface

I recently came across some synchronization problems, which led me to spinlocks and atomic counters. Then I was searching a bit more, how these work and found std::memory_order and memory barriers (mfence, lfence and sfence).

So now, it seems that I should use acquire/release for the spinlocks and relaxed for the counters.

Some reference

x86 MFENCE - Memory Fence
x86 LOCK - Assert LOCK# Signal

Question

What is the machine code (edit: see below) for those three operations (lock = test_and_set, unlock = clear, increment = operator++ = fetch_add) with default (seq_cst) memory order and with acquire/release/relaxed (in that order for those three operations). What is the difference (which memory barriers where) and the cost (how many CPU cycles)?

Purpose

I was just wondering how bad my old code (not specifying memory order = seq_cst used) really is and if I should create some class atomic_counter derived from std::atomic but using relaxed memory ordering (as well as good spinlock with acquire/release instead of mutexes on some places ...or to use something from boost library - I have avoided boost so far).

My Knowledge

So far I do understand that spinlocks protect more than itself (but some shared resource/memory as well), so, there must be something that makes some memory view coherent for multiple threads/cores (that would be those acquire/release and memory fences). Atomic counter just lives for itself and only need that atomic increment (no other memory involved and I do not really care about the value when I read it, it is informative and can be few cycles old, no problem). There is some LOCK prefix and some instructions like xchg implicitly have it. Here my knowledge ends, I don't know how the cache and buses really work and what is behind (but I know that modern CPUs can reorder instructions, execute them in parallel and use memory cache and some synchronization). Thank you for explanation.

P.S.: I have old 32bit PC now, can only see lock addl and simple xchg, nothing else - all versions look the same (except unlock), memory_order makes no difference on my old PC (except unlock, release uses move instead of xchg). Will that be true for 64bit PC? (edit: see below) Do I have to care about memory order? (answer: no, not much, release on unlock saves few cycles, that's all.)

The Code:

#include <atomic>
using namespace std;

atomic_flag spinlock;
atomic<int> counter;

void inc1() {
    counter++;
}
void inc2() {
    counter.fetch_add(1, memory_order_relaxed);
}
void lock1() {
    while(spinlock.test_and_set()) ;
}
void lock2() {
    while(spinlock.test_and_set(memory_order_acquire)) ;
}
void unlock1() {
    spinlock.clear();
}
void unlock2() {
    spinlock.clear(memory_order_release);
}

int main() {
    inc1();
    inc2();
    lock1();
    unlock1();
    lock2();
    unlock2();
}

g++ -std=c++11 -O1 -S (32bit Cygwin, shortened output)

__Z4inc1v:
__Z4inc2v:
    lock addl   $1, _counter    ; both seq_cst and relaxed
    ret
__Z5lock1v:
__Z5lock2v:
    movl    $1, %edx
L5:
    movl    %edx, %eax
    xchgb   _spinlock, %al      ; both seq_cst and acquire
    testb   %al, %al
    jne L5
    rep ret
__Z7unlock1v:
    movl    $0, %eax
    xchgb   _spinlock, %al      ; seq_cst
    ret
__Z7unlock2v:
    movb    $0, _spinlock       ; release
    ret

UPDATE for x86_64bit: (see mfence in unlock1)

_Z4inc1v:
_Z4inc2v:
    lock addl   $1, counter(%rip)   ; both seq_cst and relaxed
    ret
_Z5lock1v:
_Z5lock2v:
    movl    $1, %edx
.L5:
    movl    %edx, %eax
    xchgb   spinlock(%rip), %al     ; both seq_cst and acquire
    testb   %al, %al
    jne .L5
    ret
_Z7unlock1v:
    movb    $0, spinlock(%rip)
    mfence                          ; seq_cst
    ret
_Z7unlock2v:
    movb    $0, spinlock(%rip)      ; release
    ret