Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to have atomic integers on machines that lack stdatomic.h?

I have developed a multithreaded program that depends on the availability of atomic_int, atomic_store and atomic_load from stdatomic.h. The program is compiled with GCC.

Now, I tried to unsuccessfully compile the program on several old operating system versions that lack stdatomic.h. Unfortunately, it is a requirement that I am able to compile the program on old machines as well. So it is not enough that I compile the program on a new operating system version and run the binary on an old version.

Is there a way to emulate stdatomic.h on older machines, perhaps with some GCC-specific built-in function?

While installing a newer version of GCC on an old operating system might be the solution, the current build system has calls hardcoded to "gcc" all over it, and also the new GCC would have to be compiled from source as old operating systems don't have it in the package management system. So, ideally an answer would be something that works on old GCC versions.

like image 473
juhist Avatar asked Mar 05 '17 12:03

juhist


Video Answer


1 Answers

While this is not a completely drop-in solution for all applications, I found a way that supports the required basic functionality and passes at least some rudimentary multi-threading tests:

#define _Atomic(T) struct { volatile __typeof__(T) __val; }

typedef _Atomic(int) atomic_int;

#define atomic_load(object) \
    __sync_fetch_and_add(&(object)->__val, 0)

#define atomic_store(object, desired) do { \
    __sync_synchronize(); \
   (object)->__val = (desired); \
    __sync_synchronize(); \
} while (0)

The __sync_synchronize and __sync_fetch_and_add calls are necessary, or else communication between threads fails (I didn't test removing only one of them, I just tested removing both).

I'm not very confident, however, that this solution works in all cases. I found it from https://gist.github.com/nhatminhle/5181506 where the author doesn't recommend it for old GCC versions.

In theory, you could also use a mutex. However, mutexes have poorer performance than atomics.

Edit:

It is also possible to implement atomic_store in the following way:

#define atomic_store(object, desired) do { \
    for (;;) \
    { \
        __typeof__((object)->__val) oldval = atomic_load(object); \
        if (__sync_bool_compare_and_swap(&(object)->__val, oldval, desired)) \
        { \
            break; \
        } \
    } \
} while (0)

However, that consistently reduced performance from 139280.5 ops/second (standard deviation 1799.6 ops/second) to 131805.6 ops/second (standard deviation 986.03 ops/second). So, the reduced performance is statistically significant.

Edit 2:

The loop approach has the following assembly code:

.globl signal_completion
        .type   signal_completion, @function
signal_completion:
.LFB18:
        leaq    4(%rdi), %rcx
.L42:
        xorl    %eax, %eax
        lock
        xaddl   %eax, (%rcx)
        movl    $1, %edx
        movl    %eax, -4(%rsp)
        movl    -4(%rsp), %eax
        lock
        cmpxchgl        %edx, (%rcx)
        jne     .L42
        rep ; ret
.LFE18:
        .size   signal_completion, .-signal_completion
        .p2align 4,,15

Whereas the __sync_synchronize approach has the following code:

.globl signal_completion
        .type   signal_completion, @function
signal_completion:
.LFB18:
        movl    $1, 4(%rdi)
        ret
.LFE18:
        .size   signal_completion, .-signal_completion
        .p2align 4,,15

...and on a machine that has stdatomic.h it compiles to this:

        .globl  signal_completion
        .type   signal_completion, @function
signal_completion:
.LFB43:
        .cfi_startproc
        movl    $1, 4(%rdi)
        mfence
        ret
        .cfi_endproc
.LFE43:
        .size   signal_completion, .-signal_completion

So, the only thing I'm really lacking is mfence. I guess it could be added using simple inline assembly, for example by this:

asm volatile ("mfence" ::: "memory");

...placed after the second __sync_synchronize() in the atomic_store definition.

Edit 3:

Apparently, the __sync_fetch_and_add is not optimized away, as a loop that polls a variable has this assembly output:

.L29:
        xorl    %eax, %eax
        lock
        xaddl   %eax, (%rdi)
        testl   %eax, %eax
        je      .L29

By having instead:

#define atomic_load(object) ((object)->__val)

You will get:

.L29:
        movl    (%rdi), %eax
        testl   %eax, %eax
        je      .L29

which is equivalent to the assembly on a stdatomic.h-supporting machine:

.L38:
        movl    (%rdi), %eax
        testl   %eax, %eax
        je      .L38

Strangely-enough, the __sync_fetch_and_add variant seems to run faster on my machine and on my benchmark even though it has more complex code. Strange world, isn't it?

like image 168
juhist Avatar answered Sep 22 '22 13:09

juhist