I have a problem with the MS C compiler reordering certain statements, critical in a multithreading context, at high levels of optimization. I want to know how to force ordering in specific places while still using high levels of optimization. (At low levels of optimization, this compiler does not reorder statements)
The following code:
ChunkT* plog2sizeChunk=...
SET_BUSY(plog2sizeChunk->pPoolAndBusyFlag); // set "busy" bit on this chunk of storage
x = plog2sizeChunk->pNext;
produces this:
0040130F 8B 5A 08 mov ebx,dword ptr [edx+8]
00401312 83 22 FE and dword ptr [edx],0FFFFFFFEh
in which the write to pPoolAndBusyFlag is reordered by the compiler to occur after the pNext fetch.
SET_BUSY is essentially
plog2sizeChunk->pPoolAndBusyFlag&=0xFFFFFFFeh;
I think the compiler has rightfully decided it was OK to reorder these accesses because they are to two separate members of the same struct, and such reordering has no affect on the results of single-threaded execution:
typedef struct chunk_tag{
unsigned pPoolAndBusyFlag; // Contains pointer to owning pool and a busy flag
natural log2size; // holds log2size of the chunk if Busy==false
struct chunk_tag* pNext; // holds pointer to next block of same size
struct chunk_tag* pPrev; // holds pointer to previous block of same size
} ChunkT, *pChunkT;
For my purposes, the pPoolAndBusyFlag has to be set before other accesses to this structure are valid in a multithreaded/multicore context. I don't think this particular access is problematic for me, but the fact the compiler can reorder this means that other parts of my code may have the same kind of reordering but it may be critical in those places. (Imagine the two statements are updates to the two members rather than one write/one read). I want to be able force the order of the actions.
Ideally, I'd write something like:
plog2sizeChunk->pPoolAndBusyFlag&=0xFFFFFFFeh;
#pragma no-reordering // no such directive appears to exist
pNext = plog2sizeChunk->pNext;
I have experimentally verified I can get this effect in this ugly way:
plog2sizeChunk->pPoolAndBusyFlag&=0xFFFFFFFeh;
asm { xor eax, eax } // compiler won't optimize past asm block
pNext = plog2sizeChunk->pNext;
gives
0040130F 83 22 FE and dword ptr [edx],0FFFFFFFEh
00401312 33 C0 xor eax,eax
00401314 8B 5A 08 mov ebx,dword ptr [edx+8]
I note that the x86 hardware may reorder these particular instructions anyway since they don't refer to the same memory location, and reads may pass writes; to really fix this example, I'd need some type of memory barrier. Back to my earlier remark, if they were both writes, the x86 will not reorder them, and the write order will be seen in that order by other threads. So in that case I don't think I need a memory barrier, just a forced ordering.
I have not seen the compiler re-order two writes (yet) but I haven't been looking very hard (yet); I just tripped over this. And of course with optimizations just because you don't see it in this compilation doesn't mean it won't appear in the next.
So, how do I force the compiler to order these?
I understand I can declare the memory slots in the struct to be volatile. They are still independent storage locations, so I don't see how this prevents an optimization. Maybe I'm mis-interpreting what volatile means?
EDIT (Oct 20): Thanks to all the responders. My current implementation uses volatile (used as the initial solution), _ReadWriteBarrier (to mark the code where reordering shouldn't occur by the compiler), and a few MemoryBarriers (where reads and writes occur), and that seems to have solved the problem.
EDIT: (Nov 2): To be clean, I ended up defining sets of macros for ReadBarrier, WriteBarrier, and ReadWriteBarrier. There are sets for pre and post locking, pre and post unlocking, and general usage. Some of these are empty, some contain _ReadWriteBarrier and MemoryBarrier, as appropriate for the x86 and typical spin locks based on XCHG [XCHG includes an implicit MemoryBarrier thus obviating that need in lock pre-/post- sets). I then parked these in the code at appropriate places documenting the essential (non)reordering requirements.
So as I understand it the pNext = plog2sizeChunk->pNext
publishes the block so that it can be seen by other threads and you have to make sure they see the correct busy flag.
That means you need a uni-directional memory barrier before publishing it (also one before reading it in another thread, although if your code runs on x86 you get those for free) to make sure that threads actually see the change. You also need one before the write to avoid reordering writes after it. No just inserting assembly or using a standard compliant volatile (MSVC volatile gives extra guarantees though that make a difference here) is not enough - yes this stops the compiler from shifting reads and writes around, but the CPU is not bound by it and can do the same reordering internally.
Both MSVC and gcc have intrinsics/macros to create memory barriers (see eg here). MSVC also gives stronger guarantees to volatiles that are good enough for your problem. Finally C++11 atomics would work as well, but I'm not sure if C itself has any portable way to guarantee memory barriers.
See _ReadWriteBarrier. That's a compiler intrinsic dedicated to what you are looking for. Be sure to check the documentation against your precise version od MSVC ("deprecated" on VS2012...). Beware cpu reordering (then see MemoryBarrier
The documentation states that the _ReadBarrier, _WriteBarrier, and _ReadWriteBarrier compiler intrinsics (compiler reordering) and that the MemoryBarrier macro (CPU reordering) are all "deprecated" starting with VS2012. But i think they will continue to work fine for some time...
New code may use the new C++11 facilities (links in the MSDN page)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With