In C++11 threads, what guarantees does a std::mutex have about memory visibility?

Tags:

I am currently trying to learn the C++11 threading api, and I am finding that the various resources don't provide an essential piece of information: how the CPU cache is handled. Modern CPUs have a cache for each core (meaning different threads may use a different cache). This means that it is possible for one thread to write a value to memory, and for another thread to not see it, even if it sees other changes the first thread also made.

Of course, any good threading API provides some way to solve this. In C++'s threading api, however, it is not clear how this works. I know that a std::mutex, for example, protects memory somehow, but it isn't clear what it does: does it clear the entire CPU cache, does it clear just the objects accessed inside the mutex from the current thread's cache, or something else?

Also, apparently, read-only access does not require a mutex, but if thread 1, and only thread 1, is continually writing to memory to modify an object, won't other threads potentiality see an outdated version of that object, thus making some sort of cache clearing necessary?

Do the atomic types simply bypass the cache and read the value from main memory using a single CPU instruction? Do they make any guarantees about the other places in memory being accessed?

How does memory access in C++11's threading api work, in the context of CPU caches?

Some questions, such as this one talk about memory fences, and a memory model, but no source seems to explain this in the context of CPU caches, which is what this question asks for.

542

asked May 26 '18 02:05

john01dav

2 Answers

std::mutex has release-acquire memory ordering semantics, so everything in thread A that happened-before the atomic write to the critical section from thread A's point of view must be visible to thread B before entering the critical section in thread B.

Have a read of http://en.cppreference.com/w/cpp/atomic/memory_order to get started. Another good resource is the book C++ Concurrency in Action. Having said this, when using the high level synchronization primitives, you should be able to be able to get away with ignoring most of these details unless you are curious or want to get your hands dirty.

answered Oct 20 '22 05:10

Preet Kukreti

I think I understand what you are getting at. There are three things at play here.

The C++11 standard describes what happens at the language level... locking a std::mutex is a synchronization operation. The C++ standard does not describe how this works. CPU caches do not exist as far as the C++ standard is concerned.
The C++ implementation, at some point, puts some machine code in your application that implements a mutex lock. The engineers creating this implementation must take into account both the C++11 spec and the architecture spec.
The CPU itself manages the cache in such a way that to provide the semantics necessary for the C++ implementation to work.

This may be easier to understand if you look at atomics, which translate to much smaller snippets of assembly code but still provide synchronization. For example, try this one on GodBolt:

#include <atomic>

std::atomic<int> value;

int acquire() {
    return value.store(std::memory_order_acquire);
}

void release() {
    value.store(0, std::memory_order_release);
}

You can see the assembly:

acquire():
  mov eax, DWORD PTR value[rip]
  ret
release():
  mov DWORD PTR value[rip], 0
  ret
value:
  .zero 4

So on x86, there’s nothing necessary, the CPU already provides the required memory ordering semantics (although you can use an explicit mfence it’s usually implied by the operations). This is definitely not how it works on all processors, see the Power output:

acquire():
.LCF0:
0: addis 2,12,.TOC.-.LCF0@ha
  addi 2,2,.TOC.-.LCF0@l
  addis 3,2,.LANCHOR0@toc@ha # gpr load fusion, type int
  lwz 3,.LANCHOR0@toc@l(3)
  cmpw 7,3,3
  bne- 7,$+4
  isync
  extsw 3,3
  blr
  .long 0
  .byte 0,9,0,0,0,0,0,0
release():
.LCF1:
0: addis 2,12,.TOC.-.LCF1@ha
  addi 2,2,.TOC.-.LCF1@l
  lwsync
  li 9,0
  addis 10,2,.LANCHOR0@toc@ha
  stw 9,.LANCHOR0@toc@l(10)
  blr
  .long 0
  .byte 0,9,0,0,0,0,0,0
value:
  .zero 4

In here there are explicit isync instructions because the Power memory model provides fewer guarantees without them.

This is just punting things down to a lower level, however. The CPU itself manages shared caches using a technique like the MESI Protocol, which is a technique for maintaining cache coherence.

In the MESI protocol, when a core modifies a block of cache, it must flush that block from other caches. Other cores mark the block invalid, writing the contents out to main memory if necessary. This is inefficient, but necessary. For this reason you don't want to try and shove a bunch of commonly used mutexes or atomic variables in a small region of memory, because you can end up with multiple cores fighting over the same block of cache. The Wikipedia article is fairly comprehensive and has more detail than I'm writing here.

Something I'm omitting is the fact that mutexes typically require some kind of kernel-level support in order for threads to go to sleep or wake up.

answered Oct 20 '22 06:10

Dietrich Epp

Related questions
                            
                                How do I dynamic upcast and downcast with smart pointers?
                            
                                how to limit GPU usage in tensorflow (r1.1) with C++ API
                            
                                Is int main(void) valid in C++?
                            
                                Is there a safe navigation operator for C++?
                            
                                Why does decltype(declval<T>().func()) work where decltype(&T::func) doesn't?
                            
                                Is va_list still used in C++? Or is it encouraged to use template<typename... T>?
                            
                                Non-reproducible random numbers using `<random>`
                            
                                c++ closure creation
                            
                                ambiguous call of overloaded template with parameter (const T&, const T&) or (const char (&)[N], const char (&)[M])
                            
                                `constexpr` variable "used in its own initializer": Clang vs. GCC
                            
                                Exception Handling in C++ Terminate called after throwing an instance of 'char const*'
                            
                                On what base fold expression of a parameter pack consisting of a single element is transformed into unparenthesized expression
                            
                                Clion Unintialized record type: player
                            
                                Is an un-delayed infinite while loop bad practice? [closed]
                            
                                std::promise<void> throws Unknown error, unless calling sleep
                            
                                Does Multiple reader single writer implementation in g++-4.4(Not in C++11/14) via boost::shared_mutex impact performance?
                            
                                Move a vector<T*> to vector<const T*>
                            
                                Result of ternary operator on `int` and `float`
                            
                                Why can C++ const references be collasped into non-const references
                            
                                Complexity of algorithm std::includes in c++

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

In C++11 threads, what guarantees does a std::mutex have about memory visibility?

Tags:

c++

multithreading

c++11

cpu-cache

john01dav

People also ask

2 Answers

Preet Kukreti

Dietrich Epp

Recent Activity

Donate For Us