I have an application which has 2 threads , thread A affinity to core 1 and thread B affinity to core 2 , core 1 and core 2 are in the same x86 socket .
thread A do a busy spin of integer x , thread B will increase x under some conditions , When thread B decide to increase x , it invalidate the cache line where x located ,and according to x86 MESI protocal , it store new x to store buffer before core2 receive invalidate ack, then after core2 receive invalidate ack , core2 flush store buffer .
I am wondering , does core2 flush store buffer immediately after core2 receive invalidate ack ?! is there any chance that I can force cpu to do flush store buffer in c language ?! because thread A in core1 spining x should get x new value as early as possible in my case .
A core always tries to commit its store buffer to L1d cache (and thus become globally visible) as fast as possible, to make room for more stores.
You can use a barrier (like atomic_thread_fence(memory_order_seq_cst
) to make a thread wait for its stores to become globally visible before doing any more loads or stores, but that works by blocking this core, not by speeding up flushing the store buffer.
Obviously to avoid undefined behaviour in C11, the variable has to be _Atomic
. If there's only one writer, you might use tmp = atomic_load_explicit(&x, memory_order_relaxed)
and store_explicit of tmp+1
to avoid a more expensive seq_cst store or atomic RMW. acq / rel ordering would work too, just avoid the default seq_cst, and avoid an atomic_fetch_add
RMW if there's only one writer.
You don't need the whole RMW operation to be atomic if only one thread ever modifies it, and other threads access it read-only.
Before another core can read data you wrote, it has to make its way from Modified state in the L1d of the core that wrote it out to L3 cache, and from there to the L1d of the reader core.
You might be able to speed this part along, which happens after the data leaves the store buffer. But there's not much you can usefully do. You don't want to clflush
/clflushopt
, which would write-back + evict the cache line entirely so the other core would have to get it from DRAM, if it didn't try to read it at some point along the way (if that's even possible).
Ice Lake has clwb
which (hopefully) leaves the data cached as well as forcing write-back to DRAM. But again that forces data to actually go all the way to DRAM, not just a shared outer cache, so it costs DRAM bandwidth and is presumably slower than we'd like. (Skylake-Xeon has it, too, but handles it the same as clflushopt
. I expect & hope that Ice Lake client/server has/will have a proper implementation.)
Tremont (successor to Goldmont Plus, atom/silvermont series) has _mm_cldemote
(cldemote
). That's like the opposite of a SW prefetch; it's an optional performance hint to write the cache line out to L3, but doesn't force it to go to DRAM or anything.
Without special instructions, maybe you can write to 8 other locations that alias the same set in L2 and L1d cache, forcing a conflict eviction. That would cost extra time in the writing thread, but could make make the data available sooner to other threads that want to read it. I haven't tried this.
And this would probably evict other lines, too, costing more L3 traffic = system wide shared resources, not just costing time in the producer thread. You'd only ever consider this for latency, not throughput, unless the other lines were ones you wanted to write and evict anyway.
You need to use atomics.
You can use atomic_thread_fence
if you really want to (the question is a bit XY problem-ish), but it would probably be better to make x
atomic and use atomic_store
and atomic_load
, or maybe something like atomic_compare_exchange_weak
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With