Parallel programming under x86 can be hard job especially under multi-core CPU. Let say that we have multi-core x86 CPU and more different multithread communication combinations.
So which one model is better (more efficient) for locking shared memory region: Test&Set or Test&Test&Set and when to use it!
Here I have two simple (no time limited) test procedures written in under Delphi IDE in x86 assembler:
procedure TestAndSet(const oldValue, newValue: cardinal; var destination);
asm
//eax = oldValue
//edx = NewLockValue
//ecx = destination = 32 bit pointer on lock variable 4 byte aligned
@RepeatSpinLoop:
push eax //Save lock oldValue (compared)
pause //CPU spin-loop hint
lock cmpxchg dword ptr [ecx], edx
pop eax //Restore eax as oldValue
jnz @RepeatSpinLoop //Repeat if cmpxchg wasn't successful
end;
procedure TestAndTestAndSet(const oldValue, newValue: cardinal; var destination);
asm
//eax = oldValue
//edx = NewLockValue
//ecx = destination = 32 bit pointer on lock variable 4 byte aligned
@RepeatSpinLoop:
push eax //Save lock oldValue (compared)
@SpinLoop:
pause //CPU spin-loop hint
cmp dword ptr [ecx], eax //Test betfore test&set
jnz @SpinLoop
lock cmpxchg dword ptr [ecx], edx
pop eax //Restore eax as oldValue
jnz @RepeatSpinLoop //Repeat if cmpxchg wasn't successful
end;
EDIT:
Intel in documentation mention two approach Test&Set or Test&Test&Set. I' wont to establish in which case is someone approach better, so when to use it. Check: Intel
Surely the first (testAndSet) is better because the 2nd does not achieve much with repeating the test using cmp & jnz - in between. While you are doing this the destination value may change anyway as it is not locked.
TTAS (#2) is good practice. "Lurking" and waiting for the "opportunity" before doing CAS is common practice in both Java and .NET concurrent classes. With that said, cmpxchg received quite a lot of optimizations in the last few years, so it might be possible that you'd get nearly identical results on the latest crop of processors.
What you should try in both cases, however is to employ some exponential backoff when you spin.
Update
@GJ: You should find some more up-to-date documentation on Intel's site. Note the paragraph about not locking the bus since 486 and the comparison chart of xchg and cmpxchg that shows that they are practically identical.
Spinning on a read vs on a locked instruction will still be a good idea to avoid some contention on getting the cache line in exclusive mode. (So TTAS.)
However this will provide a useful gain only if you implement e.g. an exponential back-off, even yielding the CPU after a while.
The differences between TTAS and TAS, or w/o backoff would be smaller if you are using a single, modern multi-core CPU with a shared L3 cache between the cores and would become more visible if you are using a multi-socket - e.g. server - machine or a multi-core CPU that has no shared cache between the cores. They would also be different based on the amount of contention. (I.e. light load would see smaller difference between TTAS/TAS.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With