Low performance with CLWB (cacheline write-backs) to same location vs. cycling through a few lines

Question

Why does the running time of the code below decrease when I increase kNumCacheLines?

In every iteration, the code modifies one of kNumCacheLines cachelines, writes the line to the DIMM with the clwb instruction, and blocks until the store hits the memory controller with sfence. This example requires Intel Skylake-server or newer Xeon, or IceLake client processors.

#include <stdlib.h>
#include <stdint.h>

#define clwb(addr) \
  asm volatile(".byte 0x66; xsaveopt %0" : "+m"(*(volatile char *)(addr)));

static constexpr size_t kNumCacheLines = 1;

int main() {
  uint8_t *buf = new uint8_t[kNumCacheLines * 64];
  size_t data = 0;
  for (size_t i = 0; i < 10000000; i++) {
    size_t buf_offset = (i % kNumCacheLines) * 64;
    buf[buf_offset] = data++;
    clwb(&buf[buf_offset]);
    asm volatile("sfence" ::: "memory");
  }

  delete [] buf;
}

(editor's note: _mm_sfence() and _mm_clwb(void*) would avoid needing inline asm, but this inline asm looks correct, including the "memory" clobber).

Here are some performance numbers on my Skylake Xeon machine, reported by running time ./bench with different values of kNumCacheLines:

kNumCacheLines  Time (seconds)
1               2.00
2               2.14
3               1.74
4               1.82
5               1.00
6               1.17
7               1.04
8               1.06

Intuitively, I would expect kNumCacheLines = 1 to give the best performance because of hits in the memory controller's write pending queue. But, it is one of the slowest.

As an explanation for the unintuitive slowdown, it is possible that while the memory controller is completing a write to a cache line, it blocks other writes to the same cache line. I suspect that increasing kNumCacheLines increases performance because of higher parallelism available to the memory controller. The running time jumps from 1.82 seconds to 1.00 seconds when kNumCacheLines goes from four to five. This seems to correlate with the fact that the memory controller's write pending queue has space for 256 bytes from a thread [https://arxiv.org/pdf/1908.03583.pdf, Section 5.3].

Note that because buf is smaller than 4 KB, all accesses use the same DIMM. (Assuming it's aligned so it doesn't cross a page boundary)

Peter Cordes · Accepted Answer

This is probably fully explained by Intel's CLWB instruction invalidating cache lines - turns out SKX runs clwb the same as clflushopt, i.e. it's a stub implementation for forward compatibility so persistent-memory software can start using it without checking CPU feature levels.

More cache lines means more memory-level parallelism in reloading the invalidated lines for the next store. Or that the flush part is finished before we try to reload. One or the other; there are a lot of details I don't have a specific explanation for.

In each iteration, you store a counter value into a cache line and clwb it. (and sfence). The previous activity on that cache line was kNumCacheLines iterations ago.

We were expecting that these stores could just commit into lines that were already in Exclusive state, but in fact they're going to be Invalid with eviction probably still in flight down the cache hierarchy, depending on exactly when sfence stalls, and for how long.

So each store needs to wait for an RFO (Read For Ownership) to get the line back into cache in Exclusive state before it can commit from the store buffer to L1d.

It seems that you're only getting a factor of 2 speedup from using more cache lines, even though Skylake(-X) has 12 LFBs (i.e. can track 12 in-flight cache lines incoming or outgoing). Perhaps sfence has something to do with that.

The big jump from 4 to 5 is surprising. (Basically two levels of performance, not a continuous transition). That lends some weight to the hypothesis that it's something to do with the store having made it all the way to DRAM before we try to reload, rather than having multiple RFOs in flight. Or at least casts doubt on the idea that it's just MLP for RFOs. CLWB forcing eviction is pretty clearly key, but the specific details of exactly what happens and why there's any speedup is just pure guesswork on my part.

A more detailed analysis might tell us something about microarchitectural details if anyone wants to do one. This hopefully isn't a very normal access pattern so probably we can just avoid doing stuff like this most of the time!

(Possibly related: apparently repeated writes to the same line of Optane DC PM memory are slower than sequential writes, so you don't want write-through caching or an access pattern like this on that kind of non-volatile memory either.)

Low performance with CLWB (cacheline write-backs) to same location vs. cycling through a few lines

Tags:

c++

performance

memory

x86

intel

Anuj Kalia

1 Answers

Peter Cordes

Recent Activity

Donate For Us

Low performance with CLWB (cacheline write-backs) to same location vs. cycling through a few lines

Tags:

c++

performance

memory

x86

intel

Anuj Kalia

1 Answers

Peter Cordes

Related questions

Recent Activity

Donate For Us