Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

NUMA systems, virtual pages, and false sharing

As I understand things, for perfromance on NUMA systems, there are two cases to avoid:

  1. threads in the same socket writing to the same cache line (usually 64 bytes)
  2. threads from different sockets writing to the same virtual page (usually 4096 bytes)

A simple example will help. Let's assume I have a two socket sytem and each socket has a CPU with two physical cores (and two logical cores i.e. no Intel hyper-threading or AMD two cores per module). Let me borrow the digram at OpenMP: for schedule

| socket 0    | core 0 | thread 0 |
|             | core 1 | thread 1 |

| socket 1    | core 2 | thread 2 |
|             | core 3 | thread 3 |

So based on case 1 it's best to avoid e.g. thread 0 and thread 1 writing to the same cache line and based on case 2 it's best to avoid e.g. thread 0 writing to the same virtual page as thread 2.

However, I have been informed that on modern processors that the second case is no longer a concern. Threads between sockets can write to the same virtual page efficiently (as long as they don't write to the same cache line).

Is case two no longer a problem? And if it is still a problem what's the correct terminology for this? Is is correct to call both cases a kind of false sharing?

like image 789
Z boson Avatar asked Nov 02 '22 05:11

Z boson


1 Answers

You're right about case 1. Some more details about case 2:

Based on the operating system's NUMA policy and any related migration issues, the physical location of the page that threads 0 and 2 are writing to could be socket 0 or socket 1. The cases are symmetrical so let's say that there's a first touch policy and that thread 0 gets there first. The sequence of operations could be:

  1. Thread 0 allocates the page.
  2. Thread 0 does a write to the cache line it'll be working on. That cache line transitions from invalid to modified within cache(s) on socket 0.
  3. Thread 2 does a write to the cache line it'll be working on. To put that line in exclusive state, socket 1 has to send a Read For Ownership to socket 0 and receive a response.
  4. Threads 0 and 2 can go about their business. As long as thread 0 doesn't touch thread 2's cache line or vice versa and nobody else does anything that would change the state of either line, all operations that thread 0 and thread 2 are doing are socket- (and possibly core-) local.

You could swap the order of 2. and 3. without affecting the outcome. Either way, the round trip between sockets in step 3 is going to take longer than the socket-local access in step 2, but that cost is only incurred once for each time thread 2 needs to put its line into a modified state. If execution continues for long enough in between transitions in the state of that cache line, the extra cost will amortize.

like image 172
Aaron Altman Avatar answered Nov 17 '22 22:11

Aaron Altman