Look at this snippet: <pre class="prettyprint"><code>#include <atomic> #include <thread> typedef volatile unsigned char Type; // typedef std::atomic_uchar Type; void fn(Type *p) { for (int i=0; i<500000000; i++) { (*p)++; } } int main() { const int N = 4; std::thread thr[N]; alignas(64) Type buffer[N*64]; for (int i=0; i<N; i++) { thr[i] = std::thread(&fn, &buffer[i*1]); } for (int i=0; i<N; i++) { thr[i].join(); } } </code></pre> This little program increments four adjacent bytes a lot of times from four different threads. Before, I used the rule: don't use the same cache line from different threads, as cache line sharing is bad. So I expected that a four thread version (<code>N=4</code>) is much slower than a one thread version (<code>N=1</code>). However, these are my measurements (on a Haswell CPU): <ul> <li>N=1: 1 sec</li> <li>N=4: 1.2 sec</li> </ul> So <code>N=4</code> is not much slower. If I use different cache lines (replace <code>*1</code> with <code>*64</code>), then <code>N=4</code> becomes a little faster: 1.1 sec. The same measurements for atomic access (swap the comments at <code>typedef</code>), same cache line: <ul> <li>N=1: 3.1 sec</li> <li>N=4: 48 sec</li> </ul> So the <code>N=4</code> case is much slower (as I expected). If different cache lines used, then <code>N=4</code> has similar performance as <code>N=1</code>: 3.3 sec. I don't understand the reason behind these results. Why don't I get a serious slowdown the non-atomic, <code>N=4</code> case? Four cores have the same memory in their caches, so they must synchronize them somehow, don't they? How can they run almost perfectly parallel? Why just the atomic case gets a serious slowdown? <hr> I think I need to understand how memory gets updated in this case. In the beginning, no cores have <code>buffer</code> in their caches. After one <code>for</code> iteration (in <code>fn</code>), all 4 cores have <code>buffer</code> in their cache-lines, but each core writes a different byte. How do these cache-lines get synchronized (in the non-atomic case)? How does the cache know, which byte is dirty? Or is there some other mechanism to handle this case? Why is this mechanism a lot cheaper (actually, it is almost free) than the atomic-one?

What you are seeing is basically the effect of the store buffer combined with store-to-load forwarding allowing each core to work mostly independently, despite sharing a cache line. As we will see below, it is truly a weird case where more contention is bad, up to a point, then even more contention suddenly makes things really fast! Now with the conventional view of contention your code seems like something that will be high contention and therefore much slower than ideal. What happens, however, is that as soon as each core gets a single pending write in its write buffer, all later reads can be satisfied from the write buffer (store forwarding), and later writes just go into the buffer as well even after the core has lost ownership of the cache line. This turns most of the work into a totally local operation. The cache line is still bouncing around between the cores, but it's decoupled from the core execution path and is only needed to actually commit the stores now and then1. The <code>std::atomic</code> version can't use this magic at all since it has to use <code>lock</code>ed operations to maintain atomicity and defeat the store buffer, so you see both the full cost of contention and the cost of the long-latency atomic operations2. Let's try to actually collect some evidence that this is what's occurring. All of the discussion below deals with the non-<code>atomic</code> version of the benchmark that uses <code>volatile</code> to force reads and writes from <code>buffer</code>. Let's first check the assembly, to make sure it's what we expect: <pre class="prettyprint"><code>0000000000400c00 <fn(unsigned char volatile*)>: 400c00: ba 00 65 cd 1d mov edx,0x1dcd6500 400c05: 0f 1f 00 nop DWORD PTR [rax] 400c08: 0f b6 07 movzx eax,BYTE PTR [rdi] 400c0b: 83 c0 01 add eax,0x1 400c0e: 83 ea 01 sub edx,0x1 400c11: 88 07 mov BYTE PTR [rdi],al 400c13: 75 f3 jne 400c08 <fn(unsigned char volatile*)+0x8> 400c15: f3 c3 repz ret </code></pre> It's straightforward: a five instruction loop with a byte load, an increment of the loaded byte, a byte store, and finally the loop increment and conditional jump back to the top. Here, gcc has missed an optimization by breaking up the <code>sub</code> and <code>jne</code>, inhibiting macro-fusion, but overall it's OK and the store-forwarding latency is going to limit the loop in any case. Next, let's take a look at the number of L1D misses. Every time a core needs to write into the line that has been stolen away, it will suffer an L1D miss, which we can measure with <code>perf</code>. First, the single threaded (<code>N=1</code>) case: <pre class="prettyprint"><code>$ perf stat -e task-clock,cycles,instructions,L1-dcache-loads,L1-dcache-load-misses ./cache-line-increment Performance counter stats for './cache-line-increment': 1070.188749 task-clock (msec) # 0.998 CPUs utilized 2,775,874,257 cycles # 2.594 GHz 2,504,256,018 instructions # 0.90 insn per cycle 501,139,187 L1-dcache-loads # 468.272 M/sec 69,351 L1-dcache-load-misses # 0.01% of all L1-dcache hits 1.072119673 seconds time elapsed </code></pre> It is about what we expect: essentially zero L1D misses (0.01% of the total, probably mostly from interrupts and other code outside the loop), and just over 500,000,000 hits (matching almost exactly the number of loop iterations). Note also that we can easily calculate the cycles per iteration: about 5.55. This primarily reflects the cost of store-to-load forwarding, plus one cycle for the increment, which is a carried dependency chain as the same location is repeatedly updated (and <code>volatile</code> means it can't be hoisted into a register). Let's take a look at the <code>N=4</code> case: <pre class="prettyprint"><code>$ perf stat -e task-clock,cycles,instructions,L1-dcache-loads,L1-dcache-load-misses ./cache-line-increment Performance counter stats for './cache-line-increment': 5920.758885 task-clock (msec) # 3.773 CPUs utilized 15,356,014,570 cycles # 2.594 GHz 10,012,249,418 instructions # 0.65 insn per cycle 2,003,487,964 L1-dcache-loads # 338.384 M/sec 61,450,818 L1-dcache-load-misses # 3.07% of all L1-dcache hits 1.569040529 seconds time elapsed </code></pre> As expected the L1 loads jumps from 500 million to 2 billion, since there are 4 threads each doing the 500 million loads. The number of L1D misses also jumped by about a factor of 1,000, to about 60 million. Still, that number is not a lot compared to the 2 billion loads (and 2 billion stores - not shown, but we know they are there). That's ~33 loads and ~33 stores for every miss. It also means 250 cycles between each miss. That doesn't really fit the model of the cache line bouncing around erratically between the cores, where as soon as a core gets the line, another core demands it. We know that lines bounce around between cores sharing an L2 in perhaps 20-50 cycles, so the ratio of one miss every 250 cycles seems way to low. <h3>Two Hypotheses</h3> A couple ideas spring to mind for the above described behavior: <ul> <li> Perhaps the MESI protocol variant used in this chip is "smart" and recognizes that one line is hot among several cores, but only a small amount of work is being done each time a core gets the lock and the line spends more time moving between L1 and L2 than actually satisfying loads and stores for some core. In light of this, some smart component in the coherence protocol decides to enforce some kind of minimum "ownership time" for each line: after a core gets the line, it will keep it for N cycles, even if demanded by another core (the other cores just have to wait). This would help balance out the overhead of cache line ping-pong with real work, at the cost of "fairness" and responsiveness of the other cores, kind of like the trade-off between unfair and fair locks, and counteracting the effect described here, where the faster & fairer the coherency protocol is, the worse some (usually synthetic) loops may perform. Now I've never heard of anything like that (and the immediately previous link shows that at least in the Sandy-Bridge era things were moving in the opposite direction), but it's certainly possible! </li> <li> The store-buffer effect described is actually occurring, so most operations can complete almost locally. </li> </ul> <h3>Some Tests</h3> Let's try to distinguish two cases with some modifications. <h3>Reading and Writing Distinct Bytes</h3> The obvious approach is to change the <code>fn()</code> work function so that the threads still contend on the same cache line, but where store-forwarding can't kick in. How about we just read from location <code>x</code> and then write to location <code>x + 1</code>? We'll give each thread two consecutive locations (i.e., <code>thr[i] = std::thread(&fn, &buffer[i*2])</code>) so each thread is operating on two private bytes. The modified <code>fn()</code> looks like: <pre class="prettyprint"><code>for (int i=0; i<500000000; i++) unsigned char temp = p[0]; p[1] = temp + 1; } </code></pre> The core loop is pretty much identical to earlier: <pre class="prettyprint"><code> 400d78: 0f b6 07 movzx eax,BYTE PTR [rdi] 400d7b: 83 c0 01 add eax,0x1 400d7e: 83 ea 01 sub edx,0x1 400d81: 88 47 01 mov BYTE PTR [rdi+0x1],al 400d84: 75 f2 jne 400d78 </code></pre> The only thing that's changed is that we write to <code>[rdi+0x1]</code> rather than <code>[rdi]</code>. Now as I mentioned above, the original (same location) loop is actually running fairly slowly at about 5.5 cycles per iteration even in the best-case single-threaded case, because of the loop-carried <code>load->add->store->load...</code> dependency. This new code breaks that chain! The load no longer depends on the store so we can execute everything pretty much in parallel and I expect this loop to run at about 1.25 cycles per iteration (5 instructions / CPU width of 4). Here's the single threaded case: <pre class="prettyprint"><code>$ perf stat -e task-clock,cycles,instructions,L1-dcache-loads,L1-dcache-load-misses ./cache-line-increment Performance counter stats for './cache-line-increment': 318.722631 task-clock (msec) # 0.989 CPUs utilized 826,349,333 cycles # 2.593 GHz 2,503,706,989 instructions # 3.03 insn per cycle 500,973,018 L1-dcache-loads # 1571.815 M/sec 63,507 L1-dcache-load-misses # 0.01% of all L1-dcache hits 0.322146774 seconds time elapsed </code></pre> So about 1.65 cycles per iteration3, about about three times faster versus incrementing the same location. How about 4 threads? <pre class="prettyprint"><code>$ perf stat -e task-clock,cycles,instructions,L1-dcache-loads,L1-dcache-load-misses ./cache-line-increment Performance counter stats for './cache-line-increment': 22299.699256 task-clock (msec) # 3.469 CPUs utilized 57,834,005,721 cycles # 2.593 GHz 10,038,366,836 instructions # 0.17 insn per cycle 2,011,160,602 L1-dcache-loads # 90.188 M/sec 237,664,926 L1-dcache-load-misses # 11.82% of all L1-dcache hits 6.428730614 seconds time elapsed </code></pre> So it's about 4 times slower than the same location case. Now rather than being just a bit slower than the single-threaded case it is about 20 times slower. This is the contention you've been looking for! Now also that the number of L1D misses has increased by a factor of 4 as well, nicely explaining the performance degradation and consistent with the idea that when store-to-load forwarding can't hide the contention, misses will increase by a lot. <h3>Increasing the Distance Between Stores</h3> Another approach would be to increase the distance in time/instructions between the store and the subsequent load. We can do this by incrementing <code>SPAN</code> consecutive locations in the <code>fn()</code> method, rather than always the same location. E.g, if <code>SPAN</code> is 4, increment consecutively 4 locations like: <pre class="prettyprint"><code>for (long i=0; i<500000000 / 4; i++) { p[0]++; p[1]++; p[2]++; p[3]++; } </code></pre> Note that we are still incrementing 500 million locations in total, just spreading out the increments among 4 bytes. Intuitively you would expect overall performance to increase since you now have <code>SPAN</code> parallel dependency with length <code>1/SPAN</code>, so in the case above you might expect performance to improve by a factor of 4, since the 4 parallel chains can proceed at about 4 times the total throughput. Here's what we actually get for time (measured in cycles) for the 1 thread and 3 thread4, for <code>SPAN</code> values from 1 to 20: <img src="https://i.stack.imgur.com/OvQo0.png" alt="Time vs Increment Distable"> Initially you see performance increase substantially in both single and multi-threaded cases; the increase from a <code>SPAN</code> of one to two and three is close to the theoretical expected in the case of perfect parallelism for both cases. The single-threaded case reaches an asymptote of about 4.25x faster than the single-location write: at this point the store-forwarding latency isn't the bottleneck and other bottlenecks have taken over (max IPC and store port contention, mostly). The multi-threaded case is very different, however! Once you hit a <code>SPAN</code> of about 7, the performance rapidly gets worse, leveling out at about 2.5 times worse than the <code>SPAN=1</code> case and almost 10x worse compared to the best performance at <code>SPAN=5</code>. What happens is that store-to-load forwarding stops occurring because the store and subsequent load are far enough apart in time/cycles that the store has retired to L1, so the load actually has to get the line and participate in MESI. Also plotted is the L1D misses, which as mentioned above is indicative of "cache line transfers" between cores. The single-threaded case has essentially zero, and they are uncorrelated with the performance. The performance of the multi-threaded case, however, pretty much tracks exactly the cache misses. With <code>SPAN</code> values in the 2 to 6 range, where store-forwarding is still working, there are proportionally fewer misses. Evidently the core is able to "buffer up" more stores between each cache line transfer since the core loop is faster. Another way to think of it is that in the contended case L1D misses are basically constant per unit-time (which makes sense, since they are basically tied to the L1->L2->L1 latency, plus some coherency protocol overhead), so the more work you can do in between the cache line transfers, the better. Here's the code for the multi-span case: <pre class="prettyprint"><code>void fn(Type *p) { for (long i=0; i<500000000 / SPAN; i++) { for (int j = 0; j < SPAN; j++) { p[j]++; } } } </code></pre> The bash script to run <code>perf</code> for all <code>SPAN</code> value from 1 to 20: <pre class="prettyprint"><code>PERF_ARGS=${1:--x, -r10} for span in {1..20}; do g++ -std=c++11 -g -O2 -march=native -DSPAN=$span cache-line-increment.cpp -lpthread -o cache-line-increment perf stat ${PERF_ARGS} -e cycles,L1-dcache-loads,L1-dcache-load-misses,machine_clears.count,machine_clears.memory_ordering ./cache-line-increment done </code></pre> Finally, "transpose" the results into proper CSV: <pre class="prettyprint"><code>FILE=result1.csv; for metric in cycles L1-dcache-loads L1-dcache-load-misses; do { echo $metric; grep $metric $FILE | cut -f1 -d,; } > ${metric}.tmp; done && paste -d, *.tmp </code></pre> <h3>A Final Test</h3> There's a final test that you can do to show that each core is effectively doing most of its work in private: use the version of the benchmark where the threads work on the same location (which doesn't change the performance characteristics) examine the sum of the final counter values (you'd need <code>int</code> counters rather than <code>char</code>). If everything was atomic, you'd have a sum of 2 billion, and in the non-atomic case how close the total is to that value is a rough measure of how frequently the cores were passing around the lines. If the cores are working almost totally privately, the value would be closer to 500 million than 2 billion, and I guess that's what you'll find (a value fairly close to 500 million). With some more clever incrementing, you can even have each thread track how often the value they incremented came from their last increment rather than another threads increment (e.g., by using a few bits of the value to stash a thread identifier). With an even more clever test you could practically reconstruct thee way the cache line moved around between the cores (is there a pattern, e.g., does core A prefer to hand off to core B?) and which cores contributed most to the final value, etc. That's all left as an exercise :). <hr> 1 On top of that, if Intel has a coalescing store buffer where later stores that fully overlap earlier ones kill the earlier stores, it would only have to commit one value to L1 (the latest store) every time it gets the line. 2 You can't really separate the two effects here, but we will do it later by defeating store-to-load forwarding. 3 A bit more than I expected, perhaps bad scheduling leading to port pressure. If <code>gcc</code> would just all the <code>sub</code> and <code>jne</code> to fuse, it runs at 1.1 cycles per iteration (still worse than the 1.0 I'd expect). It will do that I use <code>-march=haswell</code> instead of <code>-march=native</code> but I'm not going to go back and change all the numbers. 4 The results hold with 4 threads as well: but I only have 4 cores and I'm running stuff like Firefox in the background, so using 1 less core makes the measurements a lot less noisy. Measuring time in cycles helps a lot too. 5 On this CPU architecture, store forwarding where the load arrives before the store data is ready seem to alternate between 4 and 5 cycles, for an average of 4.5 cycles.

Why does using the same cache-line from multiple threads not cause serious slowdown?

Tags:

c++

performance

x86

multithreading

false-sharing

Look at this snippet:

#include <atomic> #include <thread>  typedef volatile unsigned char Type; // typedef std::atomic_uchar Type;  void fn(Type *p) {     for (int i=0; i<500000000; i++) {         (*p)++;     } }  int main() {     const int N = 4;      std::thread thr[N];     alignas(64) Type buffer[N*64];      for (int i=0; i<N; i++) {         thr[i] = std::thread(&fn, &buffer[i*1]);     }      for (int i=0; i<N; i++) {         thr[i].join();     }  }

This little program increments four adjacent bytes a lot of times from four different threads. Before, I used the rule: don't use the same cache line from different threads, as cache line sharing is bad. So I expected that a four thread version (N=4) is much slower than a one thread version (N=1).

However, these are my measurements (on a Haswell CPU):

N=1: 1 sec
N=4: 1.2 sec

So N=4 is not much slower. If I use different cache lines (replace *1 with *64), then N=4 becomes a little faster: 1.1 sec.

The same measurements for atomic access (swap the comments at typedef), same cache line:

N=1: 3.1 sec
N=4: 48 sec

So the N=4 case is much slower (as I expected). If different cache lines used, then N=4 has similar performance as N=1: 3.3 sec.

I don't understand the reason behind these results. Why don't I get a serious slowdown the non-atomic, N=4 case? Four cores have the same memory in their caches, so they must synchronize them somehow, don't they? How can they run almost perfectly parallel? Why just the atomic case gets a serious slowdown?

I think I need to understand how memory gets updated in this case. In the beginning, no cores have buffer in their caches. After one for iteration (in fn), all 4 cores have buffer in their cache-lines, but each core writes a different byte. How do these cache-lines get synchronized (in the non-atomic case)? How does the cache know, which byte is dirty? Or is there some other mechanism to handle this case? Why is this mechanism a lot cheaper (actually, it is almost free) than the atomic-one?

403

asked Oct 24 '17 19:10

geza

1 Answers

What you are seeing is basically the effect of the store buffer combined with store-to-load forwarding allowing each core to work mostly independently, despite sharing a cache line. As we will see below, it is truly a weird case where more contention is bad, up to a point, then even more contention suddenly makes things really fast!

Now with the conventional view of contention your code seems like something that will be high contention and therefore much slower than ideal. What happens, however, is that as soon as each core gets a single pending write in its write buffer, all later reads can be satisfied from the write buffer (store forwarding), and later writes just go into the buffer as well even after the core has lost ownership of the cache line. This turns most of the work into a totally local operation. The cache line is still bouncing around between the cores, but it's decoupled from the core execution path and is only needed to actually commit the stores now and then¹.

The std::atomic version can't use this magic at all since it has to use locked operations to maintain atomicity and defeat the store buffer, so you see both the full cost of contention and the cost of the long-latency atomic operations².

Let's try to actually collect some evidence that this is what's occurring. All of the discussion below deals with the non-atomic version of the benchmark that uses volatile to force reads and writes from buffer.

Let's first check the assembly, to make sure it's what we expect:

0000000000400c00 <fn(unsigned char volatile*)>:   400c00:   ba 00 65 cd 1d          mov    edx,0x1dcd6500   400c05:   0f 1f 00                nop    DWORD PTR [rax]   400c08:   0f b6 07                movzx  eax,BYTE PTR [rdi]   400c0b:   83 c0 01                add    eax,0x1   400c0e:   83 ea 01                sub    edx,0x1   400c11:   88 07                   mov    BYTE PTR [rdi],al   400c13:   75 f3                   jne    400c08 <fn(unsigned char volatile*)+0x8>   400c15:   f3 c3                   repz ret

It's straightforward: a five instruction loop with a byte load, an increment of the loaded byte, a byte store, and finally the loop increment and conditional jump back to the top. Here, gcc has missed an optimization by breaking up the sub and jne, inhibiting macro-fusion, but overall it's OK and the store-forwarding latency is going to limit the loop in any case.

Next, let's take a look at the number of L1D misses. Every time a core needs to write into the line that has been stolen away, it will suffer an L1D miss, which we can measure with perf. First, the single threaded (N=1) case:

$ perf stat -e task-clock,cycles,instructions,L1-dcache-loads,L1-dcache-load-misses ./cache-line-increment    Performance counter stats for './cache-line-increment':         1070.188749      task-clock (msec)         #    0.998 CPUs utilized                2,775,874,257      cycles                    #    2.594 GHz                          2,504,256,018      instructions              #    0.90  insn per cycle                 501,139,187      L1-dcache-loads           #  468.272 M/sec                               69,351      L1-dcache-load-misses     #    0.01% of all L1-dcache hits           1.072119673 seconds time elapsed

It is about what we expect: essentially zero L1D misses (0.01% of the total, probably mostly from interrupts and other code outside the loop), and just over 500,000,000 hits (matching almost exactly the number of loop iterations). Note also that we can easily calculate the cycles per iteration: about 5.5⁵. This primarily reflects the cost of store-to-load forwarding, plus one cycle for the increment, which is a carried dependency chain as the same location is repeatedly updated (and volatile means it can't be hoisted into a register).

Let's take a look at the N=4 case:

$ perf stat -e task-clock,cycles,instructions,L1-dcache-loads,L1-dcache-load-misses ./cache-line-increment    Performance counter stats for './cache-line-increment':         5920.758885      task-clock (msec)         #    3.773 CPUs utilized               15,356,014,570      cycles                    #    2.594 GHz                         10,012,249,418      instructions              #    0.65  insn per cycle               2,003,487,964      L1-dcache-loads           #  338.384 M/sec                           61,450,818      L1-dcache-load-misses     #    3.07% of all L1-dcache hits           1.569040529 seconds time elapsed

As expected the L1 loads jumps from 500 million to 2 billion, since there are 4 threads each doing the 500 million loads. The number of L1D misses also jumped by about a factor of 1,000, to about 60 million. Still, that number is not a lot compared to the 2 billion loads (and 2 billion stores - not shown, but we know they are there). That's ~33 loads and ~33 stores for every miss. It also means 250 cycles between each miss.

That doesn't really fit the model of the cache line bouncing around erratically between the cores, where as soon as a core gets the line, another core demands it. We know that lines bounce around between cores sharing an L2 in perhaps 20-50 cycles, so the ratio of one miss every 250 cycles seems way to low.

Two Hypotheses

A couple ideas spring to mind for the above described behavior:

Perhaps the MESI protocol variant used in this chip is "smart" and recognizes that one line is hot among several cores, but only a small amount of work is being done each time a core gets the lock and the line spends more time moving between L1 and L2 than actually satisfying loads and stores for some core. In light of this, some smart component in the coherence protocol decides to enforce some kind of minimum "ownership time" for each line: after a core gets the line, it will keep it for N cycles, even if demanded by another core (the other cores just have to wait).

This would help balance out the overhead of cache line ping-pong with real work, at the cost of "fairness" and responsiveness of the other cores, kind of like the trade-off between unfair and fair locks, and counteracting the effect described here, where the faster & fairer the coherency protocol is, the worse some (usually synthetic) loops may perform.

Now I've never heard of anything like that (and the immediately previous link shows that at least in the Sandy-Bridge era things were moving in the opposite direction), but it's certainly possible!
The store-buffer effect described is actually occurring, so most operations can complete almost locally.

Some Tests

Let's try to distinguish two cases with some modifications.

Reading and Writing Distinct Bytes

The obvious approach is to change the fn() work function so that the threads still contend on the same cache line, but where store-forwarding can't kick in.

How about we just read from location x and then write to location x + 1? We'll give each thread two consecutive locations (i.e., thr[i] = std::thread(&fn, &buffer[i*2])) so each thread is operating on two private bytes. The modified fn() looks like:

for (int i=0; i<500000000; i++)     unsigned char temp = p[0];     p[1] = temp + 1; }

The core loop is pretty much identical to earlier:

  400d78:   0f b6 07                movzx  eax,BYTE PTR [rdi]   400d7b:   83 c0 01                add    eax,0x1   400d7e:   83 ea 01                sub    edx,0x1   400d81:   88 47 01                mov    BYTE PTR [rdi+0x1],al   400d84:   75 f2                   jne    400d78

The only thing that's changed is that we write to [rdi+0x1] rather than [rdi].

Now as I mentioned above, the original (same location) loop is actually running fairly slowly at about 5.5 cycles per iteration even in the best-case single-threaded case, because of the loop-carried load->add->store->load... dependency. This new code breaks that chain! The load no longer depends on the store so we can execute everything pretty much in parallel and I expect this loop to run at about 1.25 cycles per iteration (5 instructions / CPU width of 4).

Here's the single threaded case:

$ perf stat -e task-clock,cycles,instructions,L1-dcache-loads,L1-dcache-load-misses ./cache-line-increment    Performance counter stats for './cache-line-increment':          318.722631      task-clock (msec)         #    0.989 CPUs utilized                  826,349,333      cycles                    #    2.593 GHz                          2,503,706,989      instructions              #    3.03  insn per cycle                 500,973,018      L1-dcache-loads           # 1571.815 M/sec                               63,507      L1-dcache-load-misses     #    0.01% of all L1-dcache hits                          0.322146774 seconds time elapsed

So about 1.65 cycles per iteration³, about about three times faster versus incrementing the same location.

How about 4 threads?

$ perf stat -e task-clock,cycles,instructions,L1-dcache-loads,L1-dcache-load-misses ./cache-line-increment    Performance counter stats for './cache-line-increment':        22299.699256      task-clock (msec)         #    3.469 CPUs utilized               57,834,005,721      cycles                    #    2.593 GHz                         10,038,366,836      instructions              #    0.17  insn per cycle               2,011,160,602      L1-dcache-loads           #   90.188 M/sec                          237,664,926      L1-dcache-load-misses     #   11.82% of all L1-dcache hits            6.428730614 seconds time elapsed

So it's about 4 times slower than the same location case. Now rather than being just a bit slower than the single-threaded case it is about 20 times slower. This is the contention you've been looking for! Now also that the number of L1D misses has increased by a factor of 4 as well, nicely explaining the performance degradation and consistent with the idea that when store-to-load forwarding can't hide the contention, misses will increase by a lot.

Increasing the Distance Between Stores

Another approach would be to increase the distance in time/instructions between the store and the subsequent load. We can do this by incrementing SPAN consecutive locations in the fn() method, rather than always the same location. E.g, if SPAN is 4, increment consecutively 4 locations like:

for (long i=0; i<500000000 / 4; i++) {     p[0]++;     p[1]++;     p[2]++;     p[3]++; }

Note that we are still incrementing 500 million locations in total, just spreading out the increments among 4 bytes. Intuitively you would expect overall performance to increase since you now have SPAN parallel dependency with length 1/SPAN, so in the case above you might expect performance to improve by a factor of 4, since the 4 parallel chains can proceed at about 4 times the total throughput.

Here's what we actually get for time (measured in cycles) for the 1 thread and 3 thread⁴, for SPAN values from 1 to 20:

Time vs Increment Distable

Initially you see performance increase substantially in both single and multi-threaded cases; the increase from a SPAN of one to two and three is close to the theoretical expected in the case of perfect parallelism for both cases.

The single-threaded case reaches an asymptote of about 4.25x faster than the single-location write: at this point the store-forwarding latency isn't the bottleneck and other bottlenecks have taken over (max IPC and store port contention, mostly).

The multi-threaded case is very different, however! Once you hit a SPAN of about 7, the performance rapidly gets worse, leveling out at about 2.5 times worse than the SPAN=1 case and almost 10x worse compared to the best performance at SPAN=5. What happens is that store-to-load forwarding stops occurring because the store and subsequent load are far enough apart in time/cycles that the store has retired to L1, so the load actually has to get the line and participate in MESI.

Also plotted is the L1D misses, which as mentioned above is indicative of "cache line transfers" between cores. The single-threaded case has essentially zero, and they are uncorrelated with the performance. The performance of the multi-threaded case, however, pretty much tracks exactly the cache misses. With SPAN values in the 2 to 6 range, where store-forwarding is still working, there are proportionally fewer misses. Evidently the core is able to "buffer up" more stores between each cache line transfer since the core loop is faster.

Another way to think of it is that in the contended case L1D misses are basically constant per unit-time (which makes sense, since they are basically tied to the L1->L2->L1 latency, plus some coherency protocol overhead), so the more work you can do in between the cache line transfers, the better.

Here's the code for the multi-span case:

void fn(Type *p) {     for (long i=0; i<500000000 / SPAN; i++) {         for (int j = 0; j < SPAN; j++) {             p[j]++;         }     } }

The bash script to run perf for all SPAN value from 1 to 20:

PERF_ARGS=${1:--x, -r10}  for span in {1..20}; do     g++ -std=c++11 -g -O2 -march=native -DSPAN=$span  cache-line-increment.cpp  -lpthread -o cache-line-increment     perf stat ${PERF_ARGS} -e cycles,L1-dcache-loads,L1-dcache-load-misses,machine_clears.count,machine_clears.memory_ordering ./cache-line-increment done

Finally, "transpose" the results into proper CSV:

FILE=result1.csv; for metric in cycles L1-dcache-loads L1-dcache-load-misses; do { echo $metric; grep $metric $FILE | cut -f1 -d,; } > ${metric}.tmp; done && paste -d, *.tmp

A Final Test

There's a final test that you can do to show that each core is effectively doing most of its work in private: use the version of the benchmark where the threads work on the same location (which doesn't change the performance characteristics) examine the sum of the final counter values (you'd need int counters rather than char). If everything was atomic, you'd have a sum of 2 billion, and in the non-atomic case how close the total is to that value is a rough measure of how frequently the cores were passing around the lines. If the cores are working almost totally privately, the value would be closer to 500 million than 2 billion, and I guess that's what you'll find (a value fairly close to 500 million).

With some more clever incrementing, you can even have each thread track how often the value they incremented came from their last increment rather than another threads increment (e.g., by using a few bits of the value to stash a thread identifier). With an even more clever test you could practically reconstruct thee way the cache line moved around between the cores (is there a pattern, e.g., does core A prefer to hand off to core B?) and which cores contributed most to the final value, etc.

That's all left as an exercise :).

¹ On top of that, if Intel has a coalescing store buffer where later stores that fully overlap earlier ones kill the earlier stores, it would only have to commit one value to L1 (the latest store) every time it gets the line.

² You can't really separate the two effects here, but we will do it later by defeating store-to-load forwarding.

³ A bit more than I expected, perhaps bad scheduling leading to port pressure. If gcc would just all the sub and jne to fuse, it runs at 1.1 cycles per iteration (still worse than the 1.0 I'd expect). It will do that I use -march=haswell instead of -march=native but I'm not going to go back and change all the numbers.

⁴ The results hold with 4 threads as well: but I only have 4 cores and I'm running stuff like Firefox in the background, so using 1 less core makes the measurements a lot less noisy. Measuring time in cycles helps a lot too.

⁵ On this CPU architecture, store forwarding where the load arrives before the store data is ready seem to alternate between 4 and 5 cycles, for an average of 4.5 cycles.

answered Oct 11 '22 08:10

BeeOnRope

Related questions
                            
                                Qt: meaning of slot return value?
                            
                                Dependency injection in C++
                            
                                Linux optimistic malloc: will new always throw when out of memory?
                            
                                Is there a way of disabling the old c style casts in c++ [duplicate]
                            
                                Parallel OpenMP loop with break statement
                            
                                Is it possible to avoid repeating the class name in the implementation file?
                            
                                Client to Server Authentication in C++ using sockets
                            
                                How to use googletest Failures into Break-Points
                            
                                Boost Log 2.0 : empty Severity level in logs
                            
                                Obscure C++ operator overloading
                            
                                Weird linker error with static std::map
                            
                                Pre-allocating buckets in a C++ std::unordered_map
                            
                                Is it allowed to name the parameter in postfix operator ++?
                            
                                Using 'void' template arguments in C++
                            
                                Visual Studio & C++: Use filters as directories
                            
                                Are C++11 move semantics doing something new, or just making semantics clearer?
                            
                                How can I get sizeof a vector::value_type?
                            
                                Void ** a generic pointer?
                            
                                Specify cython output file
                            
                                Return with assert and comma operator

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With