Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Out of Order Execution and Memory Fences

I know that modern CPUs can execute out of order, However they always retire the results in-order, as described by wikipedia.

"Out of Oder processors fill these "slots" in time with other instructions that are ready, then re-order the results at the end to make it appear that the instructions were processed as normal."

Now memory fences are said to be required when using multicore platforms, because owing to Out of Order execution, wrong value of x can be printed here.

Processor #1:
 while f == 0
  ;
 print x; // x might not be 42 here

Processor #2:
 x = 42;
 // Memory fence required here
 f = 1

Now my question is, since Out of Order Processors (Cores in case of MultiCore Processors I assume) always retire the results In-Order, then what is the necessity of Memory fences. Don't the cores of a multicore processor sees results retired from other cores only or they also see results which are in-flight?

I mean in the example I gave above, when Processor 2 will eventually retire the results, the result of x should come before f, right? I know that during out of order execution it might have modified f before x but it must have not retired it before x, right?

Now with In-Order retiring of results and cache coherence mechanism in place, why would you ever need memory fences in x86?

like image 436
MetallicPriest Avatar asked Sep 08 '11 10:09

MetallicPriest


1 Answers

The memory fence ensures that all changes to variables before the fence are visible to all other cores, so that all cores have an up to date view of the data.

If you don't put a memory fence, the cores might be working with wrong data, this can be seen especially in scenario's, where multiple cores would be working on the same datasets. In this case you can ensure that when CPU 0 has done some action, that all changes done to the dataset are now visible to all other cores, whom can then work with up to date information.

Some architectures, including the ubiquitous x86/x64, provide several memory barrier instructions including an instruction sometimes called "full fence". A full fence ensures that all load and store operations prior to the fence will have been committed prior to any loads and stores issued following the fence.

If a core were to start working with outdated data on the dataset, how could it ever get the correct results? It couldn't no matter if the end result were to be presented as-if all was done in the right order.

The key is in the store buffer, which sits between the cache and the CPU, and does this:

Store buffer invisible to remote CPUs

Store buffer allows writes to memory and/or caches to be saved to optimize interconnect accesses

That means that things will be written to this buffer, and then at some point will the buffer be written to the cache. So the cache could contain a view of data that is not the most recent, and therefore another CPU, through cache coherency, will also not have the latest data. A store buffer flush is necessary for the latest data to be visible, this, I think is essentially what the memory fence will cause to happen at hardware level.

EDIT:

For the code you used as an example, Wikipedia says this:

A memory barrier can be inserted before processor #2's assignment to f to ensure that the new value of x is visible to other processors at or prior to the change in the value of f.

like image 73
Tony The Lion Avatar answered Oct 21 '22 13:10

Tony The Lion