After having learnt the hard way that shared
variables are currently not guarded by memory barriers, I have now encountered another issue. Either I am doing something wrong, or the existing compiler optimization in dmd can break multi-threaded code by re-ordering reads of shared
variables.
As an example, when I compile an executable with dmd -O
(full optimization), the compiler happily optimizes away the local variable o
in this code (where cas
is the compare-and-swap function from core.atomic
)
shared uint cnt;
void atomicInc ( ) { uint o; do { o = cnt; } while ( !cas( &cnt, o, o + 1 ) );}
to something like this (see dis-assembly below):
shared uint cnt;
void atomicInc ( ) { while ( !cas( &cnt, cnt, cnt + 1 ) ) { } }
In the "optimized" code cnt
is read twice from memory, thereby running the risk that another thread has modified cnt
in between. The optimization basically destroys the compare-and-swap algorithm.
Is this a bug, or is there a correct way to achieve the desired result? The only work-around I have found so far is to implement the code using assembler.
Full test code and additional details
For completeness, here is a full test code that shows both issues (no memory-barriers, and the optimization problem). It produces the following output on three different Windows machines for both dmd 2.049 and dmd 2.050 (assuming that Dekker's algorithm doesn't deadlock, which might happen):
dmd -O -run optbug.d
CAS : failed
Dekker: failed
And the loop inside atomicInc
gets compiled to this with full optimization:
; cnt is stored at 447C10h
; while ( !cas( &cnt, o, o + 1 ) ) o = cnt;
; 1) prepare call cas( &cnt, o, o + 1 ): &cnt and o go to stack, o+1 to eax
402027: mov ecx,447C10h ; ecx = &cnt
40202C: mov eax,[447C10h] ; eax = o1 = cnt
402031: inc eax ; eax = o1 + 1 (third parameter)
402032: push ecx ; push &cnt (first parameter)
; next instruction pushes current value of cnt onto stack
; as second parameter o instead of re-using o1
402033: push [447C10h]
402039: call 4020BC ; 2) call cas
40203E: xor al,1 ; 3) test success
402040: jne 402027 ; no success try again
; end of main loop
Here is the test code:
import core.atomic;
import core.thread;
import std.stdio;
enum loops = 0xFFFF;
shared uint cnt;
/* *****************************************************************************
Implement atomicOp!("+=")(cnt, 1U); with CAS. The code below doesn't work with
the "-O" compiler flag because cnt is read twice while calling cas and another
thread can modify cnt in between.
*/
enum threads = 8;
void atomicInc ( ) { uint o; do { o = cnt; } while ( !cas( &cnt, o, o + 1 ) );}
void threadFunc ( ) { foreach (i; 0..loops) atomicInc; }
void testCas ( ) {
cnt = 0;
auto tgCas = new ThreadGroup;
foreach (i; 0..threads) tgCas.create(&threadFunc);
tgCas.joinAll;
writeln( "CAS : ", cnt == loops * threads ? "passed" : "failed" );
}
/* *****************************************************************************
Dekker's algorithm. Fails on ia32 (other than atom) because ia32 can re-order
read before write. Most likely fails on many other architectures.
*/
shared bool flag1 = false;
shared bool flag2 = false;
shared bool turn2 = false; // avoids starvation by executing 1 and 2 in turns
void dekkerInc ( ) {
flag1 = true;
while ( flag2 ) if ( turn2 ) {
flag1 = false; while ( turn2 ) { /* wait until my turn */ }
flag1 = true;
}
cnt++; // shouldn't work without a cast
turn2 = true; flag1 = false;
}
void dekkerDec ( ) {
flag2 = true;
while ( flag1 ) if ( !turn2 ) {
flag2 = false; while ( !turn2 ) { /* wait until my turn */ }
flag2 = true;
}
cnt--; // shouldn't work without a cast
turn2 = false; flag2 = false;
}
void threadDekkerInc ( ) { foreach (i; 0..loops) dekkerInc; }
void threadDekkerDec ( ) { foreach (i; 0..loops) dekkerDec; }
void testDekker ( ) {
cnt = 0;
auto tgDekker = new ThreadGroup;
tgDekker.create( &threadDekkerInc );
tgDekker.create( &threadDekkerDec );
tgDekker.joinAll;
writeln( "Dekker: ", cnt == 0 ? "passed" : "failed" );
}
/* ************************************************************************** */
void main() {
testCas;
testDekker;
}
While the issues still seem to exist, core.atomic
now exposes atomicLoad
which enables a relative straightforward workaround. To make the cas
example work, it suffices to load cnt
atomically:
void atomicInc ( ) {
uint o;
do {
o = atomicLoad(cnt);
} while ( !cas( &cnt, o, o + 1 ) );
}
Similarly, to make Dekker's algorithm work:
// ...
while ( atomicLoad(flag2) ) if ( turn2 ) {
// ...
while ( atomicLoad(flag1) ) if ( !turn2 ) {
// ...
For architectures other than ia32 (ignoring string operations and SSE) that can also re-order
additional memory barriers would be required.
Yes, code it in assembler. If you skip using the cas() function and just write your entire atomicInt function in assembly, it's only going to be a few lines of code. Until you do so, you're probably going to be fighting against the compiler's optimizations.
On top of all that, you can use the x86 LOCK INC instruction instead of CAS and you should be able to reduce the function to just a line or two of assembly.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With