Does anyone have any tips for efficiently parallelizing std::partition using TBB? Has this been done already?
Here is what I'm thinking:
*I am hoping in the average case this region will be small compared to the length of the array or compared to the swaps required if partitioning the array in contiguous chunks.
Any thoughts before I try it?
I'd treat it as a degenerate case of parallel sample sort. (Parallel code for sample sort can be found here.) Let N be the number of items. The degenerate sample sort will require Θ(N) temporary space, has Θ(N) work, and Θ(P+ lg N) span (critical path). The last two values are important for analysis, since speedup is limited to work/span.
I'm assuming the input is a random-access sequence. The steps are:
There is a way to embed the scan of step 4 into steps 3 and 5, so that the span can be reduced to Θ(lg N), but I doubt it's worth the additional complexity.
If using tbb::parallel_for loops to parallelize steps 3 and 5, consider using affinity_partitioner to help threads in step 5 pick up what they left in cache from step 3.
Note that partitioning requires only Θ(N) work for Θ(N) memory loads and stores. Memory bandwidth could easily become the limiting resource for speedup.
Why not to parallel something similar to std::partition_copy
instead? The reasons are:
std::partition
, in-place swaps as in Adam's solution require logarithmic complexity due to recursive merge of the results.It's pretty straight-forward to apply a parallel_for
(for random-access iterators) or tbb::parallel_for_each
(for non-random-access iterators) to start processing the input range. each task can store the 'true' and 'false' results independently. There are lots of ways to store the results, some from the top of my head:
tbb::parallel_reduce
(only for random-access iterators), store the results locally to the task body and move-append them in join()
from another tasktbb::concurrent_vector
's method grow_by()
to copy local results in a bunch or just push()
each result separately on arrival.tbb::combinable
TLS container and combine them laterThe exact semantics of std::partition_copy
can be achieved by copy from the temporary storage from above or
atomic<size_t>
cursors to synchronize where to store the results (assuming there is enough space)If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With