Consider N
threads doing some asynchronous tasks with small result value like double
or int64_t
. So about 8
result values can fit a single CPU cache line. N
is equal to the number of CPU cores.
On one hand, if I just allocate an array of N
items, each a double
or int64_t
, then 8
threads will share a CPU cache line, which seems inefficient.
On the other hand, if I allocate a whole cache line for each double
/int64_t
, the receiver thread will have to fetch N
cache lines, each written by a different CPU core (except 1).
So is there an efficient solution for this scenario? The CPU is x86-64. A solution in C++ is preferred.
Clarification 1: thread launch/exit overhead is not big because thread pool is used. So it's mostly synchronization on a critical section.
Clarification 2: The parallel batches carry a dependency. The master thread can only launch the next batch of parallel computations after it has collected and processed the results of the previous batch. Because results of the previous batch serve as some parameters of the next batch.
update: I may have misunderstood. Are you looking for fast turnarounds on many tiny batches of work? In that case, you're probably better off with each thread writing to its own cache line, or maybe group them in pairs. If each worker thread has to get exclusive access (MESI/MESIF/MOESI) to write into the same cache line, that will serialize all the cores into some order.
Having the reader thread read the results from N threads lets all those cache misses happen in parallel.
From your comment:
I would like to scatter and gather millions of such parallel calculations per second. In the other words, the head thread distributes the work, launches worker threads, then collects the results, does something on it, and then launches parallel computations again.
So you have millions of results to collect, but only one worker thread per core. So each worker thread has to produce ~100k results.
Give each worker an output array, where it stores consecutive results from different tasks it has finished. The actual arrays might only be 4k entries long or something, with some synchronization to let the writer wrap around and reuse the first half once the reader has started on the second half of that thread's buffer.
When the collector thread reads a result from one of those arrays, it will bring that cache line into its own L2/L1D caches, bringing with it the 7 other results in that same cache line (assuming the usual case where the worker thread has already filled all 8 int64_t
slots and won't write that cache line again for this group of tiny tasks).
Or better, collect them in batches aligned to cache lines, so conflict misses don't evict a cache line from the collector's L1D before it gets back to it. (Reduce the chance of this happening by skewing the result arrays with a different offset for each thread, so the collector thread isn't reading N cache lines that are all offset from each other by a multiple of 4kiB or something.)
If you can use a sentinel value in your output arrays, that's probably ideal. If the collector sees that, it knows it got ahead of the worker and should check other threads. (Or sleep if it got through all output arrays without finding new results).
Otherwise you also need current-output-position shared variables which the workers update (with a release-store) after writing the output array. (Maybe batch these position-counter updates to one per 8 array results. But make sure you do it with a pure atomic store, not a += 8
. Since the producer thread is the only one that writes that variable, it would be silly to have the overhead of a lock add
.)
This would easily cause false sharing between worker threads if packed into one array, and also definitely needs to be cached (not in UC or WC memory, so a worker thread can rewrite it in-place efficiently). So you definitely want each thread to have its own cache line for these. The collector will just have to suffer the penalty of reading N different cache lines (and probably suffering memory mis-speculation machine clears: What are the latency and throughput costs of producer-consumer sharing of a memory location between hyper-siblings versus non-hyper siblings?)
Actually, the best option in that case would probably be to use one of the 8 qwords in every cache line of the output arrays as a "complete" flag or bitmap, so the collector thread can check that to see if the 7 results in a cache line are ready.
If just getting the results between worker and collector threads is your main bottleneck, then probably your threading is too fine-grained. You should break your tasks up more coarsely, or have your worker threads do some of the combining on multiple results it produced, while they're still hot in its L1D. That's much better bandwidth than getting it to another core through L3 or DRAM.
If the number of accesses/writes of the worker threads far exceeds the resulting reporting to/reading from the head/master thread, then
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With