Two different threads within a single process can share a common memory location by reading and/or writing to it.
Usually, such (intentional) sharing is implemented using atomic operations using the lock
prefix on x86, which has fairly well-known costs both for the lock
prefix itself (i.e., the uncontended cost) and also additional coherence costs when the cache line is actually shared (true or false sharing).
Here I'm interested in produced-consumer costs where a single thread P
writes to a memory location, and another thread `C reads from the memory location, both using plain reads and writes.
What is the latency and throughput of such an operation when performed on separate cores on the same socket, and in comparison when performed on sibling hyperthreads on the same physical core, on recent x86 cores.
In the title I'm using the term "hyper-siblings" to refer to two threads running on the two logical threads of the same core, and inter-core siblings to refer to the more usual case of two threads running on different physical cores.
Okay, I couldn't find any authoritative source, so I figured I'd give it a go myself.
#include <pthread.h> #include <sched.h> #include <atomic> #include <cstdint> #include <iostream> alignas(128) static uint64_t data[SIZE]; alignas(128) static std::atomic<unsigned> shared; #ifdef EMPTY_PRODUCER alignas(128) std::atomic<unsigned> unshared; #endif alignas(128) static std::atomic<bool> stop_producer; alignas(128) static std::atomic<uint64_t> elapsed; static inline uint64_t rdtsc() { unsigned int l, h; __asm__ __volatile__ ( "rdtsc" : "=a" (l), "=d" (h) ); return ((uint64_t)h << 32) | l; } static void * consume(void *) { uint64_t value = 0; uint64_t start = rdtsc(); for (unsigned n = 0; n < LOOPS; ++n) { for (unsigned idx = 0; idx < SIZE; ++idx) { value += data[idx] + shared.load(std::memory_order_relaxed); } } elapsed = rdtsc() - start; return reinterpret_cast<void*>(value); } static void * produce(void *) { do { #ifdef EMPTY_PRODUCER unshared.store(0, std::memory_order_relaxed); #else shared.store(0, std::memory_order_relaxed); #enfid } while (!stop_producer); return nullptr; } int main() { pthread_t consumerId, producerId; pthread_attr_t consumerAttrs, producerAttrs; cpu_set_t cpuset; for (unsigned idx = 0; idx < SIZE; ++idx) { data[idx] = 1; } shared = 0; stop_producer = false; pthread_attr_init(&consumerAttrs); CPU_ZERO(&cpuset); CPU_SET(CONSUMER_CPU, &cpuset); pthread_attr_setaffinity_np(&consumerAttrs, sizeof(cpuset), &cpuset); pthread_attr_init(&producerAttrs); CPU_ZERO(&cpuset); CPU_SET(PRODUCER_CPU, &cpuset); pthread_attr_setaffinity_np(&producerAttrs, sizeof(cpuset), &cpuset); pthread_create(&consumerId, &consumerAttrs, consume, NULL); pthread_create(&producerId, &producerAttrs, produce, NULL); pthread_attr_destroy(&consumerAttrs); pthread_attr_destroy(&producerAttrs); pthread_join(consumerId, NULL); stop_producer = true; pthread_join(producerId, NULL); std::cout <<"Elapsed cycles: " <<elapsed <<std::endl; return 0; }
Compile with the following command, replacing defines:
gcc -std=c++11 -DCONSUMER_CPU=3 -DPRODUCER_CPU=0 -DSIZE=131072 -DLOOPS=8000 timing.cxx -lstdc++ -lpthread -O2 -o timing
Where:
Here are the generated loops:
Consumer thread
400cc8: ba 80 24 60 00 mov $0x602480,%edx 400ccd: 0f 1f 00 nopl (%rax) 400cd0: 8b 05 2a 17 20 00 mov 0x20172a(%rip),%eax # 602400 <shared> 400cd6: 48 83 c2 08 add $0x8,%rdx 400cda: 48 03 42 f8 add -0x8(%rdx),%rax 400cde: 48 01 c1 add %rax,%rcx 400ce1: 48 81 fa 80 24 70 00 cmp $0x702480,%rdx 400ce8: 75 e6 jne 400cd0 <_ZL7consumePv+0x20> 400cea: 83 ee 01 sub $0x1,%esi 400ced: 75 d9 jne 400cc8 <_ZL7consumePv+0x18>
Producer thread, with empty loop (no writing to shared
):
400c90: c7 05 e6 16 20 00 00 movl $0x0,0x2016e6(%rip) # 602380 <unshared> 400c97: 00 00 00 400c9a: 0f b6 05 5f 16 20 00 movzbl 0x20165f(%rip),%eax # 602300 <stop_producer> 400ca1: 84 c0 test %al,%al 400ca3: 74 eb je 400c90 <_ZL7producePv>
Producer thread, writing to shared
:
400c90: c7 05 66 17 20 00 00 movl $0x0,0x201766(%rip) # 602400 <shared> 400c97: 00 00 00 400c9a: 0f b6 05 5f 16 20 00 movzbl 0x20165f(%rip),%eax # 602300 <stop_producer> 400ca1: 84 c0 test %al,%al 400ca3: 74 eb je 400c90 <_ZL7producePv>
The program counts the number of CPU cycles consumed, on consumer's core, to complete the whole loop. We compare the first producer, which does nothing but burn CPU cycles, to the second producer, which disrupts the consumer by repetitively writing to shared
.
My system has a i5-4210U. That is, 2 cores, 2 threads per core. They are exposed by the kernel as Core#1 → cpu0, cpu2
Core#2 → cpu1, cpu3
.
Result without starting the producer at all:
CONSUMER PRODUCER cycles for 1M cycles for 128k 3 n/a 2.11G 1.80G
Results with empty producer. For 1G operations (either 1000*1M or 8000*128k).
CONSUMER PRODUCER cycles for 1M cycles for 128k 3 3 3.20G 3.26G # mono 3 2 2.10G 1.80G # other core 3 1 4.18G 3.24G # same core, HT
As expected, since both threads are cpu hogs and both get a fair share, the producer burning cycles slows down consumer by about half. That's just cpu contention.
With producer on cpu#2, as there is no interaction, consumer runs with no impact from the producer running on another cpu.
With producer on cpu#1, we see hyperthreading at work.
Results with disruptive producer:
CONSUMER PRODUCER cycles for 1M cycles for 128k 3 3 4.26G 3.24G # mono 3 2 22.1 G 19.2 G # other core 3 1 36.9 G 37.1 G # same core, HT
When we schedule both thread on the same thread of the same core, there is no impact. Expected again, as the producer writes remain local, incurring no synchronization cost.
I cannot really explain why I get much worse performance for hyperthreading than for two cores. Advice welcome.
The killer problem is that the cores makes speculative reads, which means that each time a write to the the speculative read address (or more correctly to the same cache line) before it is "fulfilled" means the CPU must undo the read (at least if your an x86), which effectively means it cancels all speculative instructions from that instruction and later.
At some point before the read is retired it gets "fulfilled", ie. no instruction before can fail and there is no longer any reason to reissue, and the CPU can act as-if it had executed all instructions before.
Other core example
These are playing cache ping pong in addition to cancelling instructions so this should be worse than the HT version.
Lets start at some point in the process where the cache line with the shared data has just been marked shared because the Consumer has ask to read it.
So the Consumer can advance in the period between it gets it shared cache line until its invalidated again. It is unclear how many reads can be fulfilled at the same time, most likely 2 as the CPU has 2 read ports. And it properbly doesn't need to rerun them as soon as the internal state of the CPU is satisfied they can't they can't fail between each.
Same core HT
Here the two HT shares the core and must share its resources.
The cache line should stay in the exclusive state all the time as they share the cache and therefore don't need the cache protocol.
Now why does it take so many cycles on the HT core? Lets start with the Consumer just having read the shared value.
So for every read of the shared value the Consumer is reset.
Conclusion
The different core apparently advance so much each time between each cache ping pong that it performs better than the HT one.
What would have happened if the CPU waited to see if the value had actually changed?
For the test code the HT version would have run much faster, maybe even as fast as the private write version. The different core would not have run faster as the cache miss was covering the reissue latency.
But if the data had been different the same problem would arise, except it would be worse for the different core version as it would then also have to wait for the cache line, and then reissue.
So if the OP can change some of roles letting the time stamp producer read from the shared and take the performance hit it would be better.
Read more here
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With