Modern multicore CPUs synchronize cache between cores by snooping, i.e. each core broadcasts what it is doing in terms of memory access, and watches the broadcasts generated by other cores, to cooperate in making sure writes from core A are seen by core B.
This is good in that if you have data that really does need to be shared between threads, it minimizes the amount of code you have to write to make sure it does get shared.
It's bad in that if you have data that should be local to just one thread, the snooping still happens, constantly dissipating energy to no purpose.
Does the snooping still happens if you declare the relevant variables
thread_local? Unfortunately the answer is yes according to the accepted answer to Can other threads modify thread-local memory?
Does any currently extant platform (combination of CPU and operating system) provide any way to turn off snooping for thread-local data? Doesn't have to be a portable way; if it requires issuing OS-specific API calls, or even dropping into assembly, I'm still interested.
Most modern processors use a directory coherence protocol to maintain coherence between all the cores in the same NUMA node and another directory coherence protocol to maintain coherence between all NUMA nodes and IO hubs that are in the same coherence domain, where each NUMA node could be an active socket, part of an active socket, or a node controller. A brief introduction to coherence in real processors can be fount at: Cache coherency(MESI protocol) between different levels of cache namely L1, L2 and L3.
Directory coherence protocols significantly reduce the need for broadcasting snoops because they provide additional coherence state per cache line to basically track who may possibly have a copy of the line. Unnecessary snoops can still occur in the following cases:
The cost of unnecessary snooping is not just extra energy consumption, but also latency because a request cannot be considered to have completed non-speculatively unless all the coherence transactions have completed. This can significantly increase the time to complete a request, which in turn limits bandwdith because each outstanding request consumes certain hardware resources.
You don't have to worry about unnecessary snoops to cache lines storing thread-local variables as long as there are truly being used as thread-local and the thread that owns these variables rarely migrates between physical cores.
There is a basic invalidation based protocol, MESI, which is somewhat foundational. There are other extensions of it, but it serves to minimize the number of bus transactions on a read or write. MESI encodes the states a cache line can be in: Modified, Exclusive, Shared, Invalid. A basic schematic of MESI involves two views. The dashes(-) means maybe an internal state change, but no external operation required. From the CPU to its cache:
M E S I
Read - - - 2
Write - - 1 3
Also, these states "listen" to the exterior bus, so from the bus to the cache:
M E S I
Read 4 - - -
Write 5 - - -
So the bus-agents co-operate to only generate the minimum necessary transactions.
Many CPU's, particularly embedded controllers, have cpu-private-memory, which could be a great candidate for thread local storage; however to migrate a thread from one core to another, would require chasing down all of its thread local storage variables, and copying them (somehow) to the new core's private-memory.
Depending upon the workload, this may be viable, but for the general workload, minimizing the bus traffic and loosening the affinity is a win.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!Donate Us With