Consider the following condensed code:
/* Compile: gcc -pthread -m32 -ansi x.c */
#include <stdio.h>
#include <inttypes.h>
#include <pthread.h>
static volatile uint64_t v = 0;
void *func (void *x) {
__sync_add_and_fetch (&v, 1);
return x;
}
int main (void) {
pthread_t t;
pthread_create (&t, NULL, func, NULL);
pthread_join (t, NULL);
printf ("v = %"PRIu64"\n", v);
return 0;
}
I have a uint64_t
variable that I want to increment atomically, because the variable is a counter in a multi-threaded program.
To achieve the atomicity I use GCC's atomic builtins.
If I compile for an amd64 system (-m64) the produced assembler code is easy to understand.
By using a lock addq
, the processor guarantees the increment to be atomic.
400660: f0 48 83 05 d7 09 20 lock addq $0x1,0x2009d7(%rip)
But the same C code produces a very complicated ASM code on an ia32 system (-m32):
804855a: a1 28 a0 04 08 mov 0x804a028,%eax
804855f: 8b 15 2c a0 04 08 mov 0x804a02c,%edx
8048565: 89 c1 mov %eax,%ecx
8048567: 89 d3 mov %edx,%ebx
8048569: 83 c1 01 add $0x1,%ecx
804856c: 83 d3 00 adc $0x0,%ebx
804856f: 89 ce mov %ecx,%esi
8048571: 89 d9 mov %ebx,%ecx
8048573: 89 f3 mov %esi,%ebx
8048575: f0 0f c7 0d 28 a0 04 lock cmpxchg8b 0x804a028
804857c: 08
804857d: 75 e6 jne 8048565 <func+0x15>
Here is what I don't understand:
lock cmpxchg8b
does guarantee that the changed variable is only written if the expected value still resides in the target address. The compare-and-swap is guaranteed to happen atomically.Probably it does not matter if there was a "dirty read", but could someone please outline a short proof that there is no problem?
Further: Why does the generated code jump back to 0x8048565 and not 0x804855a? I am positive that this is only correct if other writers, too, only increment the variable. Is this an implicated requirement for the __sync_add_and_fetch
function?
The initial read with 2 separate mov
instructions is not atomic, but it's not in the loop. @interjay's answer explains why this is fine.
Fun fact: the read done by cmpxchg8b
would be atomic even without a lock
prefix. (But this code does use a lock
prefix to make the entire RMW operation atomic, rather than separate atomic load and atomic store.)
It's guaranteed to be atomic due to it being aligned correctly (and it fits on one cache line) and because Intel made the spec this way, see the Intel Architecture manual Vol 1, 4.4.1:
A word or doubleword operand that crosses a 4-byte boundary or a quadword operand that crosses an 8-byte boundary is considered unaligned and requires two separate memory bus cycles for access.
Vol 3A 8.1.1:
The Pentium processor (and newer processors since) guarantees that the following additional memory operations will always be carried out atomically:
• Reading or writing a quadword aligned on a 64-bit boundary
• 16-bit accesses to uncached memory locations that fit within a 32-bit data bus
The P6 family processors (and newer processors since) guarantee that the following additional memory operation will always be carried out atomically:
• Unaligned 16-, 32-, and 64-bit accesses to cached memory that fit within a cache line
Thus by being aligned, it can be read in 1 cycle, and it fits into one cache line making cmpxchg8b
's read atomic.
If the data had been misaligned, the lock
prefix would still make it atomic, but the performance cost would be very high because a simple cache-lock (delaying response to MESI Invalidate requests for that one cache line) would no longer be sufficient.
The code jumps back to 0x8048565
(after the mov
loads, including the copy and add-1) because v
has already been loaded; there is no need to load it again as CMPXCHG8B
will set EAX:EDX
to the value in the destination if it fails:
CMPXCHG8B
Description for the Intel ISA manual Vol. 2A:
Compare EDX:EAX with m64. If equal, set ZF and load ECX:EBX into m64. Else, clear ZF and load m64 into EDX:EAX.
Thus the code needs only to increment the newly returned value and try again. If we look at this in C code it becomes easier:
value = dest; // non-atomic but usually won't tear
while(!CAS8B(&dest,value,value + 1))
{
value = dest; // atomic; part of lock cmpxchg8b
}
The value = dest
is actually from the same read that cmpxchg8b
used for the compare part. There isn't a separate reload inside the loop.
In fact, C11 atomic_compare_exchange_weak
/ _strong
has this behaviour built-in: it updates the "expected" operand.
So does gcc's modern builtin __atomic_compare_exchange_n (type *ptr, type *expected, type desired, bool weak, int success_memorder, int failure_memorder)
- it takes the expected
value by reference.
With GCC's older obsolete __sync
builtins, __sync_val_compare_and_swap
returns the old val (instead of a boolean swapped / didn't-swap result for __sync_bool_compare_and_swap
)
The reading of the variable in 0x804855a and 0x804855f does not need to be atomic. Using the compare-and-swap instruction to increment looks like this in pseudocode:
oldValue = *dest; // non-atomic: tearing between the halves is unlikely but possible
do {
newValue = oldValue+1;
} while (!compare_and_swap(dest, &oldValue, newValue));
Since the compare-and-swap checks that *dest == oldValue
before swapping, it will act as a safeguard - so that if the value in oldValue
is incorrect, the loop will be tried again, so there's no problem if the non-atomic read resulted in an incorrect value.
The 64-bit access to *dest
done by lock cmpxchg8b
is atomic (as part of an atomic RMW of *dest
). Any tearing in loading the 2 halves separately will be caught here. Or if a write from another core happened after the initial read, before lock cmpxchg8b
: this is possible even with single-register-width cmpxchg
-retry loops. (e.g. to implement atomic fetch_mul or an atomic float
, or other RMW operations that x86's lock
prefix doesn't let us do directly.)
Your second question was why the line oldValue = *dest
is not inside the loop. This is because the compare_and_swap
function will always replace the value of oldValue
with the actual value of *dest
. So it will essentially perform the line oldValue = *dest
for you, and there's no point in doing it again. In the case of the cmpxchg8b
instruction, it will put the contents of the memory operand in edx:eax
when the comparison fails.
The pseudocode for compare_and_swap is:
bool compare_and_swap (int *dest, int *oldVal, int newVal)
{
do atomically {
if ( *oldVal == *dest ) {
*dest = newVal;
return true;
} else {
*oldVal = *dest;
return false;
}
}
}
By the way, in your code you need to ensure that v
is aligned to 64 bits - otherwise it could be split between two cache lines and the cmpxchg8b
instruction will not be performed atomically. You can use GCC's __attribute__((aligned(8)))
for this.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With