In regards to this question, I'm interested only in x86 and x86-64.
For MSVC 2005, the documentation for __faststorefence says: "Guarantees that every preceding store is globally visible before any subsequent store."
For MSVC 2008 and 2010, it changed to: "Guarantees that every previous memory reference, including both load and store memory references, is globally visible before any subsequent memory reference."
The way the latter is written, it implies in my opinion that this would also block the CPU's reordering of loads before older stores. This is different from the first definition, which implies that the intrinsic is only to deal with blocking or the reordering of non-temporal stores with older stores (the only other reordering x86(-64) does).
However, then the documentation appears to contradict itself: "On the x64 platform, this routine generates an instruction that is a faster store fence than the sfence instruction. Use this intrinsic instead of _mm_sfence on the x64 platform."
This implies that it still has sfence-like functionality, and thus loads can still be reordered with older stores. So which is it? Can someone clear up my confusion?
PS: looking for a GCC version of this function, I came across long local; __asm__ __volatile__("lock; orl $0, %0;" : : "m"(local));
but I think it's from 32-bit code; what would be the 64-bit analog?
The GCC version you quote is equivalent to the code that MSVC generates. It relies on the fact that the x86/x86-64 processor architecture docs specify that loads and stores are not reordered with a LOCK
ed instruction.
I am not clear whether this applies to non-temporal stores, since in general the memory model restrictions do not apply to those instructions.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With