Why memory reordering is not a problem on single core/processor machines?

Tags:

Consider the following example taken from Wikipedia, slightly adapted, where the steps of the program correspond to individual processor instructions:

x = 0;
f = 0;

Thread #1:
   while (f == 0);
   print x;

Thread #2: 
   x = 42;
   f = 1;

I'm aware that the print statement might print different values (42 or 0) when the threads are running on two different physical cores/processors due to the out-of-order execution.

However I don't understand why this is not a problem on a single core machine, with those two threads running on the same core (through preemption). According to Wikipedia:

When a program runs on a single-CPU machine, the hardware performs the necessary bookkeeping to ensure that the program executes as if all memory operations were performed in the order specified by the programmer (program order), so memory barriers are not necessary.

As far as I know single-core CPUs too reorder memory accesses (if their memory model is weak), so what makes sure the program order is preserved?

733

asked Dec 06 '19 17:12

Ignorant

2 Answers

The CPU would not be aware that these are two threads. Threads are a software construct (1).

So the CPU sees these instructions, in this order:

store x = 42
store f = 1
test f == 0
jump if true ; not taken
load x

If the CPU were to re-order the store of x to the end, after the load, it would change the results. While the CPU is allowed out of order execution, it only does this when it doesn't change the result. If it was allowed to do that, virtually every sequence of instructions would possibly fail. It would be impossible to produce a working program.

In this case, a single CPU is not allowed to re-order a store past a load of the same address. At least, as far the CPU can see it is not re-ordered. As far the as the L1, L2, L3 cache and main memory (and other CPUs!) are concerned, maybe the store has not been committed yet.

(1) Something like HyperThreads, two threads per core, common in modern CPUs, wouldn't count as "single-CPU" w.r.t. your question.

157

answered Oct 11 '22 17:10

TrentP

The CPU doesn't know or care about "context switches" or software threads. All it sees is some store and load instructions. (e.g. in the OS's context-switch code where it saves the old register state and loads the new register state)

The cardinal rule of out-of-order execution is that it must not break a single instruction stream. Code must run as if every instruction executed in program order, and all its side-effects finished before the next instruction starts. This includes software context-switching between threads on a single core. e.g. a single-core machine or green-threads within on process.

(Usually we state this rule as not breaking single-threaded code, with the understanding of what exactly that means; weirdness can only happen when an SMP system loads from memory locations stored by other cores).

As far as I know single-core CPUs too reorder memory accesses (if their memory model is weak)

But remember, other threads aren't observing memory directly with a logic analyzer, they're just running load instructions on that same CPU core that's doing and tracking the reordering.

If you're writing a device driver, yes you might have to actually use a memory barrier after a store to make sure it's actually visible to off-chip hardware before doing a load from another MMIO location.

Or when interacting with DMA, making sure data is actually in memory, not in CPU-private write-back cache, can be a problem. Also, MMIO is usually done in uncacheable memory regions that imply strong memory ordering. (x86 has cache-coherent DMA so you don't have to actually flush back to DRAM, only make sure its globally visible with an instruction like x86 mfence that waits for the store buffer to drain. But some non-x86 OSes that had cache-control instructions designed in from the start do requires OSes to be aware of it. i.e. to make sure cache is invalidated before reading in new contents from disk, and to make sure it's at least written back to somewhere DMA can read from before asking a device to read from a page.)

And BTW, even x86's "strong" memory model is only acq/rel, not seq_cst (except for RMW operations which are full barriers). (Or more specifically, a store buffer with store forwarding on top of sequential consistency). Stores can be delayed until after later loads. (StoreLoad reordering). See https://preshing.com/20120930/weak-vs-strong-memory-models/

so what makes sure the program order is preserved?

Hardware dependency tracking; loads snoop the store buffer to look for loads from locations that have recently been stored to. This makes sure loads take data from the last program-order write to any given memory location¹.

Without this, code like

  x = 1;
  int tmp = x;

might load a stale value for x. That would be insane and unusable (and kill performance) if you had to put memory barriers after every store for your own reloads to reliably see the stored values.

We need all instructions running on a single core to give the illusion of running in program order, according to the ISA rules. Only DMA or other CPU cores can observe reordering.

Footnote 1: If the address for older stores isn't available yet, a CPU may even speculate that it will be to a different address and load from cache instead of waiting for the store-data part of the store instruction to execute. If it guessed wrong, it will have to roll back to a known good state, just like with branch misprediction. This is called "memory disambiguation". See also Store-to-Load Forwarding and Memory Disambiguation in x86 Processors for a technical look at it, including cases of narrow reload from part of a wider store, including unaligned and maybe spanning a cache-line boundary...

answered Oct 11 '22 15:10

Peter Cordes

Related questions
                            
                                How does python handle thread locking / context switching?
                            
                                std::condition_variable wait() and notify_one() synchronization
                            
                                Who killed My Java Infinite loop thread?
                            
                                New thread multiple times
                            
                                How to create threads in Java EE environment?
                            
                                Why does my Java program's performance drop significantly after startup?
                            
                                C++ thread: what does join do exactly? [duplicate]
                            
                                Why is the synchronized method not accessed synchronously in this multithreaded program?
                            
                                When the executorService.shutdown(); should be called
                            
                                Using Multi-core (-thread) processor for FOR loop
                            
                                How to check if thread has finished work in C++11 and above?
                            
                                Python multiprocess/multithreading to speed up file copying
                            
                                ThreadPoolExecutor with all params from two lists
                            
                                Can a long running task in a Fork-Join thread pool block all threads?
                            
                                ThreadPoolExecutor with context manager
                            
                                Task.Run does not work like Thread.start
                            
                                What is the standard way to get a Rust thread out of blocking operations?
                            
                                How to test class which implements Runnable with Junit
                            
                                DBContext System.ObjectDisposed Exception with .NET Entity Framework Core, Dependency Injection and threading
                            
                                How do the thread local variables in the Rust standard library work?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why memory reordering is not a problem on single core/processor machines?

Tags:

cpu-architecture

synchronization

multithreading

low-level

memory-barriers

Ignorant

People also ask

2 Answers

TrentP

Peter Cordes

Recent Activity

Donate For Us