The Memory Order Machine Clear performance event is described by the vTune documentation as:
The memory ordering (MO) machine clear happens when a snoop request from another processor matches a source for a data operation in the pipeline. In this situation the pipeline is cleared before the loads and stores in progress are retired.
However I don't see why that should be the case. There is no synchronisation order between loads and stores on different logical processors.
The processor could just pretend the snoop happened after all the current in-flight data operations are committed.
The issue is also described here
A memory ordering machine clear gets triggered whenever the CPU core detects a “memory ordering conflict”. Basically, this means that some of the currently pending instructions tried to access memory that we just found out some other CPU core wrote to in the meantime. Since these instructions are still flagged as pending while the “this memory just got written” event means some other core successfully finished a write, the pending instructions – and everything that depends on their results – are, retroactively, incorrect: when we started executing these instructions, we were using a version of the memory contents that is now out of date. So we need to throw all that work out and do it over. That’s the machine clear.
But that makes no sense to me, the CPU doesn't need to re-execute the loads in the Load-Queue as there is no total order for non locked loads/stores.
I could see a problem is loads were allowed to be reordered:
;foo is 0
mov eax, [foo] ;inst 1
mov ebx, [foo] ;inst 2
mov ecx, [foo] ;inst 3
If the execution order would be 1 3 2 then a store like mov [foo], 1
between 3 and 2 would cause
eax = 0
ebx = 1
ecx = 0
which would indeed violate the memory ordering rules.
But loads cannot be reorder with loads, so why Intel's CPUs flush the pipeline when a snoop request from another core matches the source of any in-flight load?
What erroneous situations is this behaviour preventing?
Although the x86 memory ordering model does not allow loads to any memory type other than WC to be globally observable out of program order, the implementation actually allows loads to complete out of order. It would be very costly to stall issuing a load request until all previous loads have completed. Consider the following example:
load X
load Y
load Z
Assume that line x is not present in the cache hierarchy and has to be fetched from memory. However, both Y and Z are present in the L1 cache. One way to maintain the x86 load ordering requirement is by not issuing loads Y and X until load X gets the data. However, this would stall all instructions that depend on Y and Z, resulting in a potentially massive performance hit.
Multiple solutions have been proposed and studied extensively in the literature. The one that Intel has implemented in all of its processors is allowing loads to be issued out of order and then check whether a memory ordering violation has occurred, in which case the violating load is reissued and all of its dependent instructions are replayed. But this violation can only occur when the following conditions are satisfied:
When both of these conditions occur, the logical core detects a memory ordering violation. Consider the following example:
------ ------
core1 core2
------ ------
load rdx, [X] store [Y], 1
load rbx, [Y] store [X], 2
add rdx, rbx
call printf
Assume that the initial state is:
According to the x86 strong ordering model, the only possible legal outcomes are 0, 1, and 3. In particular, the outcome 2 is not legal.
The following sequence of events may occur:
To maintain the ordering of loads, core1's load buffer has to snoop all invalidations to lines resident in its private caches. When it detects that line Y has been invalidated while there are pending loads that precede the completed load from the invalidated line in program order, a memory ordering violation occurs and the load has to be reissued after which it gets the most recent value. Note that if line Y has been evicted from core1's private caches before it gets invalidated and before the load from X completes, it may not be able to snoop the invalidation of line Y in the first place. So there needs to be a mechanism to handle this situation as well.
If core1 never uses one or both of the values loaded, a load ordering violation may occur, but it can never be observed. Similarly, if the values stored by core2 to lines X and Y are the same, a load ordering violation may occur, but is impossible to observe. However, even in these cases, core1 would still unnecessarily reissue the violating load and replay all of its dependencies.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With