look at this code
extern "C" long _InterlockedCompareExchange(long volatile * _Destination, long _Exchange, long _Comparand);
#define MAGIC 1
// Unlike InterlockedIncrement this function not increment from 0 to 1, but return FALSE
bool TryLock(long* pLock)
{
long Value = *pLock, NewValue;
for ( ; Value; Value = NewValue)
{
NewValue = _InterlockedCompareExchange(pLock, Value + 1, Value);
if (
#if MAGIC
NewValue == Value
#else
Value == NewValue
#endif
) return true;
}
return false;
}
are if set #define MAGIC 0
something changed ? by idea must not be. but if using CL.EXE
64bit compiler if we change NewValue == Value
to Value == NewValue
(simply long values) - generated code serious changed!
I try this with two version of CL
- latest 19.00.24210.0
and with 14.00.50727.762
(more than 10 years old - December 2006) I got absolute equal code from both in all tests. compile with flags cl /c /FA /O1
- so /O1
optimization (same results with /Oxs
)
with MAGIC 1
(NewValue == Value
)
TryLock PROC
mov eax, [pLock]
jmp @@
@@loop:
lea edx, [rax+1]
lock cmpxchg [pLock], edx
je @@exit
@@:
test eax, eax
jne @@loop
ret
@@exit:
mov al, 1
ret
TryLock ENDP
but with MAGIC 0
(Value == NewValue
)
TryLock PROC
mov r8d, [pLock]
test r8d, r8d
je @@0
@@loop:
lea edx, [r8+1]
mov eax, r8d
lock cmpxchg [pLock], edx
cmp r8d, eax ; !!!!!!!!
je @@exit
test eax, eax
mov r8d, eax
jne @@loop
@@0:
xor al, al
ret
@@exit:
mov al, 1
ret
TryLock ENDP
code become large, but main notable difference in instruction
cmp Value, NewValue
after lock cmpxchg
in second variant. really lock cmpxchg [p], NewValue
yourself set or reset ZF
flag and additional cmp Value, NewValue
become excess. we can omit it if we write in assembly, but on c/c++
we have no way use ZF
for condition branch.
no statements like ifzf { /* if ZF == 1 */ } else { /* if ZF == 0 */ }
as result we need write if (NewValue == Value) {} else {}
and as result must be cmp NewValue, Value
in generated assembly. but how i discovered for CL
x64
(but not for x86
!) already 10+ years (think all versions) do next
this code
NewValue = _InterlockedCompareExchange(p, fn(OldValue), OldValue);
if (OldValue == NewValue) ...
converted to
mov eax, OldValue
lock cmpxchg [p], fn(OldValue)
mov NewValue, eax
cmp OldValue, eax ; !!!!
jne @@
....
but this code
NewValue = _InterlockedCompareExchange(p, fn(OldValue), OldValue);
if (NewValue == OldValue) ...
converted to
mov eax, OldValue
lock cmpxchg [p], fn(OldValue)
mov NewValue, eax
jne @@
...
so CL
understand cmpxchg
semantic and can do optimization, but only in some case.
i test this feature in several test functions and everywhere got the same result for both (very old and new CL
)
extern "C" long _InterlockedCompareExchange(long volatile * _Destination, long _Exchange, long _Comparand);
typedef long (*FN)(long* pLock, long Value);
#define MAGIC 1
void TestZF1(long* pLock)
{
long Value = *pLock, NewValue;
do
{
Value++;
NewValue = _InterlockedCompareExchange(pLock, Value ^ 1, Value);
} while (
#if MAGIC
NewValue != Value
#else
Value != NewValue
#endif
);
}
long TestZF2(long* pLock, FN fn1, FN fn2)
{
long Value = *pLock, NewValue;
NewValue = _InterlockedCompareExchange(pLock, Value ^ 1, Value);
return (
#if MAGIC
NewValue == Value
#else
Value == NewValue
#endif
? fn1 : fn2) (pLock, NewValue);
}
and generated assembly:
TestZF1 PROC
mov r8d, DWORD PTR [rcx]
@@loop:
add r8d, 1
mov edx, r8d
mov eax, r8d
xor edx, 1
lock cmpxchg [rcx], edx
IF !MAGIC
cmp r8d,eax ; ! in TestZF1 different exactly in this instruction
ENDIF
jne @@loop
ret 0
TestZF1 ENDP
IF MAGIC
TestZF2 PROC
mov r9d, [rcx]
mov eax, [rcx]
xor r9d, 1
lock cmpxchg [rcx], r9d
cmove r8, rdx
mov edx, eax
jmp r8
TestZF2 ENDP
ELSE
TestZF2 PROC
mov r10d, [rcx]
mov r9d, r10d
xor r9d, 1
mov eax, r10d
lock cmpxchg [rcx], r9d
cmp r10d, eax ; !!!!!!!!
cmove r8, rdx
mov edx, eax
jmp r8
TestZF2 ENDP
ENDIF
several questions:
CL
x64
optimize case if (NewValue == Value)
but not
optimize if (Value == NewValue)
?CL
x86
not do this optimization ? how minimum in all my tests
cmp Value,NewValue
instruction existc/c++
,without assembler, for implement
this on x86 with CL
?c/c++
compilers have this kind of
optimization for _InterlockedCompareExchange[Pointer]
?
- why CL x64 optimize case if (NewValue == Value) but not optimize if (Value == NewValue) ?
- this is consciously, specially designed, or it was suddenly and unknown ?
I'm convinced it is a bug, so I've reported it.
If they respond, we will know if this is a bug or not.
- why CL x86 not do this optimization ? how minimum in all my tests cmp Value,NewValue instruction exist
x86 performance may be not optimized to the same level as x86-64, as it is of secondary importance. Though may be reported as another missed optimization bug.
- are possible write code on c/c++ ,without assembler, for implement this on x86 with CL ?
Apparently not. But clang-cl
that ships with Visual Studio 2019 and supposed to emulate CL very closely seem to do better. It is also affected by MAGIC
, but when it is enabled, it produces better code on x86.
- interesting - are another c/c++ compilers have this kind of optimization for _InterlockedCompareExchange[Pointer] ?
Other compilers have separate __sync_bool_compare_and_swap
and __sync_val_compare_and_swap
, and they implement this optimization for bool
version https://godbolt.org/z/j97aEG5GY
Note that _InterlockedCompareExchange
as well as __sync_bool_compare_and_swap
are non-standard, and there are C standard and C++ standard alternatives.
The corresponding standard function returns both boolean directly, and observed value indirectly:
atomic_compare_exchange_strong
, returns _Bool
, and observed value is returned by pointer where you pass expectedstd::atomic<T>::compare_exchange_strong
, returns bool
, observed value is returned by reference instead of expectedThese standard alternatives are likely to have _InterlockedCompareExchange
or __sync_bool_compare_and_swap
under the hood though.
Regarding optimizations:
<stdatomic.h>
C header yetIf you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With