When a out-of-order processor encounters something like
LOAD R1, 0x1337
LOAD R2, $R1
LOAD R3, 0x42
Assuming that all accesses will result in a cache miss, can the processor ask the memory controller for the contents of 0x42 before the it asks for the content of $R1 or even 0x1337? If so, assuming that accessing $R1 will result in a exception (e.g., segmentation fault), we can consider that 0x42 was loaded speculatively, correct?
And by the way, when a load-store unit sends a request to the memory controller, can it send a second request before receiving the answer to the previous one?
My question doesn't target any architecture in particular. Answers related to any mainstream architecture are welcomed.
Answer to your question depends on the memory ordering model of your CPU, which is not the same as the CPU allowing out of order execution. If the CPU implements Total store ordering (eg x86 or Sparc) then the answer to your question is 0x42 will not be loaded before 0x1337
If the cpu implements a relaxed memory model (eg IA-64, PowerPC, alpha), then in the absence of a memory fence instruction all bets are off as to which will be accessed first. This should be of little relevance unless you are doing IO, or dealing with multi-threaded code.
you should note that some CPU's (eg Itanium) do have relaxed memory models (so reads may be out of order) but do NOT have any out of order execution logic since they expect the compiler to order the instructions and speculative instructions in an optimal way rather than spend silicon space on OOE
This would seem to be the a logical conclusion for superscalor CPUs with multiple load-store units too. Multi-channel memory controllers are pretty common these days.
In the case of out-of-order instruction execution, an enormous amount of logic is expended in determining whether instructions have dependancies on others in the stream - not just register dependancies but also operations on memory as well. There's also an enormous amount of logic for handling exceptions: the CPU needs to complete all instructions in the stream up to the fault (or alternatively, offload some parts of this onto the operating system).
In terms of the programming model seen by most applications, the effects are never apparent. As seen by memory, it's implicit that loads will not always happen in the sequence expected - but this is the case any way when caches are in use.
Clearly, in circumstances where the order of loads and stores does matter - for instance in accessing device registers, OOE must be disabled. The POWER architecture has the wonderful EIEIO
instruction for this purpose.
Some members of the ARM Cortex-A family offer OOE - I suspect with the power constraints of these devices, and the apparent lack of instructions for forcing ordering, that load-stores always complete in order
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With