Logo Questions Linux Laravel Mysql Ubuntu Git Menu

Eliding cache snooping for thread-local memory

Modern multicore CPUs synchronize cache between cores by snooping, i.e. each core broadcasts what it is doing in terms of memory access, and watches the broadcasts generated by other cores, to cooperate in making sure writes from core A are seen by core B.

This is good in that if you have data that really does need to be shared between threads, it minimizes the amount of code you have to write to make sure it does get shared.

It's bad in that if you have data that should be local to just one thread, the snooping still happens, constantly dissipating energy to no purpose.

Does the snooping still happens if you declare the relevant variables thread_local? Unfortunately the answer is yes according to the accepted answer to Can other threads modify thread-local memory?

Does any currently extant platform (combination of CPU and operating system) provide any way to turn off snooping for thread-local data? Doesn't have to be a portable way; if it requires issuing OS-specific API calls, or even dropping into assembly, I'm still interested.

like image 235
rwallace Avatar asked Mar 01 '23 16:03


2 Answers

Most modern processors use a directory coherence protocol to maintain coherence between all the cores in the same NUMA node and another directory coherence protocol to maintain coherence between all NUMA nodes and IO hubs that are in the same coherence domain, where each NUMA node could be an active socket, part of an active socket, or a node controller. A brief introduction to coherence in real processors can be fount at: Cache coherency(MESI protocol) between different levels of cache namely L1, L2 and L3.

Directory coherence protocols significantly reduce the need for broadcasting snoops because they provide additional coherence state per cache line to basically track who may possibly have a copy of the line. Unnecessary snoops can still occur in the following cases:

  • A line gets silently evicted from a core or NUMA node without notifying the directory controller.
  • The directory state may be protected with an error detection code. If the state is deemed corrupted, a broadcast is required.
  • Depending on the microarchitecture, the in-memory directory may not have the capability of tracking cache lines per NUMA node but rather at the granularity of "any other NUMA node."

The cost of unnecessary snooping is not just extra energy consumption, but also latency because a request cannot be considered to have completed non-speculatively unless all the coherence transactions have completed. This can significantly increase the time to complete a request, which in turn limits bandwdith because each outstanding request consumes certain hardware resources.

You don't have to worry about unnecessary snoops to cache lines storing thread-local variables as long as there are truly being used as thread-local and the thread that owns these variables rarely migrates between physical cores.

like image 180
Hadi Brais Avatar answered Mar 07 '23 19:03

Hadi Brais

There is a basic invalidation based protocol, MESI, which is somewhat foundational. There are other extensions of it, but it serves to minimize the number of bus transactions on a read or write. MESI encodes the states a cache line can be in: Modified, Exclusive, Shared, Invalid. A basic schematic of MESI involves two views. The dashes(-) means maybe an internal state change, but no external operation required. From the CPU to its cache:

           M   E  S   I
Read       -   -  -   2
Write      -   -  1   3


  1. Issue a bus invalidate, change state to M.
  2. Issue a bus read, change state to S.
  3. Issue a bus read + bus invalidate, change state to M.

Also, these states "listen" to the exterior bus, so from the bus to the cache:

           M   E  S  I
Read       4   -  -  -
Write      5   -  -  -
  1. Flush from cache, change to S.
  2. Flush from cache, change to I.

So the bus-agents co-operate to only generate the minimum necessary transactions.

Many CPU's, particularly embedded controllers, have cpu-private-memory, which could be a great candidate for thread local storage; however to migrate a thread from one core to another, would require chasing down all of its thread local storage variables, and copying them (somehow) to the new core's private-memory.

Depending upon the workload, this may be viable, but for the general workload, minimizing the bus traffic and loosening the affinity is a win.

like image 34
mevets Avatar answered Mar 07 '23 21:03