Loading an entire cache line at once to avoid contention for multiple elements of it

Question

Assuming that there are three pieces of data that I need from a heavily contended cache line, is there a way to load all three things "atomically" so as to avoid more than one roundtrip to any other core?

I don't actually need a correctness guarantee of atomicity for a snapshot of all 3 members, just in the normal case that all three items are read in the same clock cycle. I want to avoid the case where the cache line arrives, but then an invalidate request comes in before all 3 objects are read. That would result in the 3rd access needing to send another request to share the line, making contention even worse.

For example,

class alignas(std::hardware_destructive_interference_size) Something {
    std::atomic<uint64_t> one;
    std::uint64_t two;
    std::uint64_t three;
};

void bar(std::uint64_t, std::uint64_t, std::uint64_t);

void f1(Something& something) {
    auto one = something.one.load(std::memory_order_relaxed);
    auto two = something.two;
    if (one == 0) {
        bar(one, two, something.three);
    } else {
        bar(one, two, 0);
    }

}

void f2(Something& something) {
    while (true) {
        baz(something.a.exchange(...));
    }
}

Can I somehow ensure that one, two and three all get loaded together without multiple RFOs under heavy contention (assume f1 and f2 are running concurrently)?

The target architecture / platform for the purposes of this question is Intel x86 Broadwell, but if there is a technique or compiler intrinsic that allows doing something best-effort like this somewhat portably, that would be great as well.

Peter Cordes · Accepted Answer

terminology: A load won't generate an RFO, it doesn't need ownership. It only sends a request to share the data. Multiple cores can be reading from the same physical address in parallel, each with a copy of it hot in their L1d cache.

Other cores writing the line will send RFOs which invalidate the shared copy in our cache, though, and yes that could come in after reading one or two elements of a cache line before all have been read. (I updated your question with a description of the problem in those terms.)

Hadi's SIMD load is a good idea to grab all the data with one instruction.

As far as we know, _mm_load_si128() is in practice atomic for its 8-byte chunks, so it can safely replace the .load(mo_relaxed) of the atomic. But see Per-element atomicity of vector load/store and gather/scatter? - there's no clear written guarantee of this.

If you used _mm256_loadu_si256(), beware of GCC's default tuning -mavx256-split-unaligned-load: Why doesn't gcc resolve _mm256_loadu_pd as single vmovupd? So that's another good reason to use an aligned load, besides needing to avoid a cache-line split.

But we're writing in C, not asm, so we need to worry about some of the other things that std::atomic with mo_relaxed does: specifically that repeated loads from the same address might not give the same value. You probably need to dereference a volatile __m256i* to kind of simulate what load(mo_relaxed).

You can use atomic_thread_fence() if you want stronger ordering; I think in practice C++11 compilers that support Intel intrinsics will order volatile dereferences wrt. fences the same way as std::atomic loads/stores. In ISO C++, volatile objects are still subject to data-race UB, but in real implementations that can for example compile a Linux kernel, volatile can be used for multi-threading. (Linux rolls its own atomics with volatile and inline asm, and this is I think considered supported behaviour by gcc/clang.) Given what volatile actually does (object in memory matches the C++ abstract machine), it basically just automatically works, despite any rules-lawyer concerns that it's technically UB. It's UB that compilers can't know or care about because that's the whole point of volatile.

In practice there's good reason to believe that entire aligned 32-byte loads/store on Haswell and later are atomic. Certainly for reading from L1d into the out-of-order backend, but also even for transferring cache lines between cores. (e.g. multi-socket K10 can tear on 8-byte boundaries with HyperTransport, so this really is a separate issue). The only problem for taking advantage of it is the lack of any written guarantee or CPU-vendor-approved way to detect this "feature".

Other than that, for portable code it could help to hoist auto three = something.three; out of the branch; a branch mispredict gives the core much more time to invalidate the line before the 3rd load.

But compilers will probably not respect that source change, and only load it in the case that needs it. But branchless code would always load it, so maybe we should encourage that with

    bar(one, two, one == 0 ? something.three : 0);

Broadwell can run 2 loads per clock cycle (like all mainstream x86 since Sandybridge and K8); uops typically execute in oldest-ready-first order so it's likely (if this load did have to wait for data from another core) that our 2 load uops will execute in the first cycle possible after the data arrives.

The 3rd load uop will hopefully run in the cycle after that, leaving a very small window for an invalidate to cause a problem.

Or on CPUs with only 1-per clock loads, still having all 3 loads adjacent in the asm reduces the window for invalidations.

But if one == 0 is rare, then three often isn't needed at all, so unconditional loading brings a risk of unnecessary requests for it. So you have to consider that tradeoff when tuning, if you can't cover all the data with one SIMD load.

As discussed in comments, software prefetch could potentially help to hide some of the inter-core latency.

But you have to prefetch much later than you would for a normal array, so finding places in your code that are often running ~50 to ~100 cycles before f1() is called is a hard problem and can "infect" a lot of other code with details unrelated to its normal operation. And you need a pointer to the right cache line.

You need the PF to be late enough that the demand load happens a few (tens of) cycles before the prefetched data actually arrives. This is the opposite of the normal use-case, where L1d is a buffer to prefetch into and hold data from completed prefetches before demand-loads get to them. But you want load_hit_pre.sw_pf perf events (load hit prefetch), because that means the demand load happened while the data was still in flight, before there's any chance of it being invalidated.

That means tuning is even more brittle and difficult than usual, because instead of a nearly-flat sweet spot for prefetch distance where earlier or later doesn't hurt, earlier hides more latency right up until the point where it allows invalidations, so it's a slope all the way up to a cliff. (And any too-early prefetches just make overall contention even worse.)

Loading an entire cache line at once to avoid contention for multiple elements of it

Tags:

c++

x86

multithreading

micro-optimization

false-sharing

Curious

1 Answers

Peter Cordes

Recent Activity

Donate For Us

Loading an entire cache line at once to avoid contention for multiple elements of it

Tags:

c++

x86

multithreading

micro-optimization

false-sharing

Curious

1 Answers

Peter Cordes

Related questions

Recent Activity

Donate For Us