I have developed a multithreaded program that depends on the availability of atomic_int, atomic_store and atomic_load from stdatomic.h. The program is compiled with GCC.
Now, I tried to unsuccessfully compile the program on several old operating system versions that lack stdatomic.h. Unfortunately, it is a requirement that I am able to compile the program on old machines as well. So it is not enough that I compile the program on a new operating system version and run the binary on an old version.
Is there a way to emulate stdatomic.h on older machines, perhaps with some GCC-specific built-in function?
While installing a newer version of GCC on an old operating system might be the solution, the current build system has calls hardcoded to "gcc" all over it, and also the new GCC would have to be compiled from source as old operating systems don't have it in the package management system. So, ideally an answer would be something that works on old GCC versions.
While this is not a completely drop-in solution for all applications, I found a way that supports the required basic functionality and passes at least some rudimentary multi-threading tests:
#define _Atomic(T) struct { volatile __typeof__(T) __val; }
typedef _Atomic(int) atomic_int;
#define atomic_load(object) \
__sync_fetch_and_add(&(object)->__val, 0)
#define atomic_store(object, desired) do { \
__sync_synchronize(); \
(object)->__val = (desired); \
__sync_synchronize(); \
} while (0)
The __sync_synchronize and __sync_fetch_and_add calls are necessary, or else communication between threads fails (I didn't test removing only one of them, I just tested removing both).
I'm not very confident, however, that this solution works in all cases. I found it from https://gist.github.com/nhatminhle/5181506 where the author doesn't recommend it for old GCC versions.
In theory, you could also use a mutex. However, mutexes have poorer performance than atomics.
Edit:
It is also possible to implement atomic_store in the following way:
#define atomic_store(object, desired) do { \
for (;;) \
{ \
__typeof__((object)->__val) oldval = atomic_load(object); \
if (__sync_bool_compare_and_swap(&(object)->__val, oldval, desired)) \
{ \
break; \
} \
} \
} while (0)
However, that consistently reduced performance from 139280.5 ops/second (standard deviation 1799.6 ops/second) to 131805.6 ops/second (standard deviation 986.03 ops/second). So, the reduced performance is statistically significant.
Edit 2:
The loop approach has the following assembly code:
.globl signal_completion
.type signal_completion, @function
signal_completion:
.LFB18:
leaq 4(%rdi), %rcx
.L42:
xorl %eax, %eax
lock
xaddl %eax, (%rcx)
movl $1, %edx
movl %eax, -4(%rsp)
movl -4(%rsp), %eax
lock
cmpxchgl %edx, (%rcx)
jne .L42
rep ; ret
.LFE18:
.size signal_completion, .-signal_completion
.p2align 4,,15
Whereas the __sync_synchronize approach has the following code:
.globl signal_completion
.type signal_completion, @function
signal_completion:
.LFB18:
movl $1, 4(%rdi)
ret
.LFE18:
.size signal_completion, .-signal_completion
.p2align 4,,15
...and on a machine that has stdatomic.h it compiles to this:
.globl signal_completion
.type signal_completion, @function
signal_completion:
.LFB43:
.cfi_startproc
movl $1, 4(%rdi)
mfence
ret
.cfi_endproc
.LFE43:
.size signal_completion, .-signal_completion
So, the only thing I'm really lacking is mfence. I guess it could be added using simple inline assembly, for example by this:
asm volatile ("mfence" ::: "memory");
...placed after the second __sync_synchronize() in the atomic_store definition.
Edit 3:
Apparently, the __sync_fetch_and_add is not optimized away, as a loop that polls a variable has this assembly output:
.L29:
xorl %eax, %eax
lock
xaddl %eax, (%rdi)
testl %eax, %eax
je .L29
By having instead:
#define atomic_load(object) ((object)->__val)
You will get:
.L29:
movl (%rdi), %eax
testl %eax, %eax
je .L29
which is equivalent to the assembly on a stdatomic.h-supporting machine:
.L38:
movl (%rdi), %eax
testl %eax, %eax
je .L38
Strangely-enough, the __sync_fetch_and_add variant seems to run faster on my machine and on my benchmark even though it has more complex code. Strange world, isn't it?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With