I just read a MSDN article, "Synchronization and Multiprocessor Issues", that addresses memory cache consistency issues on multiprocessor machines. This was really eye opening to me, because I would not have thought there could be a race condition in the example they provide. This article explains that writes to memory might not actually occur (from the perspective of the other cpu) in the order written in my code. This is a new concept to me!
This article provides 2 solutions:
The article also mentions that "The following synchronization functions use the appropriate barriers to ensure memory ordering: •Functions that enter or leave critical sections".
This is the part I don't understand. Does this mean that any writes to memory that are limited to functions that use critical sections are immune from cache consistency and memory ordering issues? I have nothing against the Interlock*() functions, but another tool in my tool belt would be good to have!
The Cache Coherence Problem In a multiprocessor system, data inconsistency may occur among adjacent levels or within the same level of the memory hierarchy. For example, the cache and the main memory may have inconsistent copies of the same object.
Propagation of Modified Data In general, any data cached in the heap is always consistent in that heap. For example, if one thread gets and modifies a cached element, another thread that then gets that element sees the change as long as all activity takes place in that heap.
Write through is a storage method in which data is written into the cache and the corresponding main memory location at the same time. The cached data allows for fast retrieval on demand, while the same data in main memory ensures that nothing will get lost if a crash, power failure, or other system disruption occurs.
This MSDN article is just the first step of multi-thread application development: in short, it means "protect your shared variables with locks (aka critical sections), because you are not sure that the data you read/write is the same for all threads".
The CPU per-core cache is just one of the possible issues, which will lead into reading wrong values. Another issue which may lead into race condition is two threads writing to a resource at the same time: it's impossible to know which value will be stored afterward.
Since code expects the data to be coherent, some multi-thread programs may behave wrongly. With multi-threading, you are not sure that the code you write, via individual instructions, is executed as expected, when it deals with shared variables.
InterlockedExchange/InterlockedIncrement
functions are low-level asm opcodes with a LOCK prefix (or locked by design, like the XCHG EDX,[EAX]
opcode), which will indeed force the cache coherency for all CPU cores, and therefore make the asm opcode execution thread-safe.
For instance, here is how a string reference count is implemented when you assign a string value (see _LStrAsg
in System.pas - this is from our optimized version of the RTL for Delphi 7/2002 - since Delphi original code is copyrighted):
MOV ECX,[EDX-skew].StrRec.refCnt
INC ECX { thread-unsafe increment ECX = reference count }
JG @@1 { ECX=-1 -> literal string -> jump not taken }
.....
@@1: LOCK INC [EDX-skew].StrRec.refCnt { ATOMIC increment of reference count }
MOV ECX,[EAX]
...
There is a difference between the first INC ECX
and LOCK INC [EDX-skew].StrRec.refCnt
- not only the first increments ECX and not the reference count variable, but the first is not thread-safe, whereas the 2nd is prefixed by a LOCK therefore will be thread-safe.
By the way, this LOCK prefix is one of the problem of multi-thread scaling in the RTL - it's better with newer CPUs, but still not perfect.
So using critical sections is the easiest way of making a code thread-safe:
var GlobalVariable: string;
GlobalSection: TRTLCriticalSection;
procedure TThreadOne.Execute;
var LocalVariable: string;
begin
...
EnterCriticalSection(GlobalSection);
LocalVariable := GlobalVariable+'a'; { modify GlobalVariable }
GlobalVariable := LocalVariable;
LeaveCriticalSection(GlobalSection);
....
end;
procedure TThreadTwp.Execute;
var LocalVariable: string;
begin
...
EnterCriticalSection(GlobalSection);
LocalVariable := GlobalVariable; { thread-safe read GlobalVariable }
LeaveCriticalSection(GlobalSection);
....
end;
Using a local variable makes the critical section shorter, therefore your application will better scale and make use of the full power of your CPU cores. Between EnterCriticalSection
and LeaveCriticalSection
, only one thread will be running: other threads will wait in EnterCriticalSection
call... So the shorter the critical section is, the faster your application is. Some wrongly designed multi-threaded applications can actually be slower than mono-threaded apps!
And do not forget that if your code inside the critical section may raise an exception, you should always write an explicit try ... finally LeaveCriticalSection() end;
block to protect the lock release, and prevent any dead lock of your application.
Delphi is perfectly thread-safe if you protect your shared data with a lock, i.e. a Critical Section. Be aware that even reference-counted variables (like strings) should be protected, even if there is a LOCK inside their RTL functions: this LOCK is there to assume correct reference counting and avoid memory leaks, but it won't be thread-safe. To make it as fast as possible, see this SO question.
The purpose of InterlockExchange
and InterlockCompareExchange
is to change a shared pointer variable value. You can see it as a a "light" version of the critical section to access a pointer value.
In all cases, writing working multi-threaded code is not easy - it's even hard, as a Delphi expert just wrote in his blog.
You should either write simple threads with no shared data at all (make a private copy of the data before the thread starts, or use read-only shared data - which is thread-safe by essence), or call some well designed and proven libraries - like http://otl.17slon.com - which will save you a lot of debugging time.
First of all, according to the language standards, volatile doesn't do what the article says it does. The acquire and release semantics of volatile are MSVC specific. This can be a problem if you compile with other compilers or on other platforms. C++11 introduces language supported atomic variables which will hopefully, in due course, finally put an end to the (mis-)use of volatile as a threading construct.
Critical sections and mutexes are indeed implemented so that reads and writes of protected variables will be seen correctly from all threads.
I think the best way to think of critical sections and mutexes (locks) is as devices to bring about serialization. That is, blocks of code protected by such locks are executed serially, one after another without overlap. The serialization applies to memory access also. There can be no problems due to cache coherence or read/write reordering.
Interlocked functions are implemented using hardware based locks on the memory bus. These functions are used by lock free algorithms. What this means is that they don't use heavy weight locks like critical sections, but rather these light weight hardware locks.
Lock free algorithms can be more efficient than those based on locks, but lock free algorithms can be very much harder to write correctly. Prefer critical sections over lock free unless the performance implications are discernable.
Another article well worth reading is The "Double-Checked Locking is Broken" Declaration.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With