As I understand things, for perfromance on NUMA systems, there are two cases to avoid:
A simple example will help. Let's assume I have a two socket sytem and each socket has a CPU with two physical cores (and two logical cores i.e. no Intel hyper-threading or AMD two cores per module). Let me borrow the digram at OpenMP: for schedule
| socket 0 | core 0 | thread 0 |
| | core 1 | thread 1 |
| socket 1 | core 2 | thread 2 |
| | core 3 | thread 3 |
So based on case 1 it's best to avoid e.g. thread 0 and thread 1 writing to the same cache line and based on case 2 it's best to avoid e.g. thread 0 writing to the same virtual page as thread 2.
However, I have been informed that on modern processors that the second case is no longer a concern. Threads between sockets can write to the same virtual page efficiently (as long as they don't write to the same cache line).
Is case two no longer a problem? And if it is still a problem what's the correct terminology for this? Is is correct to call both cases a kind of false sharing?
You're right about case 1. Some more details about case 2:
Based on the operating system's NUMA policy and any related migration issues, the physical location of the page that threads 0 and 2 are writing to could be socket 0 or socket 1. The cases are symmetrical so let's say that there's a first touch policy and that thread 0 gets there first. The sequence of operations could be:
You could swap the order of 2. and 3. without affecting the outcome. Either way, the round trip between sockets in step 3 is going to take longer than the socket-local access in step 2, but that cost is only incurred once for each time thread 2 needs to put its line into a modified state. If execution continues for long enough in between transitions in the state of that cache line, the extra cost will amortize.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With