Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to parallelize std::partition using TBB

Does anyone have any tips for efficiently parallelizing std::partition using TBB? Has this been done already?

Here is what I'm thinking:

  1. if the array is small, std::partition it (serial) and return
  2. else, treat the array as 2 interleaved arrays using custom iterators (interleave in cache-sized blocks)
  3. start a parallel partition task for each pair of iterators (recurse to step 1)
  4. swap elements between the two partition/middle pointers*
  5. return the merged partition/middle pointer

*I am hoping in the average case this region will be small compared to the length of the array or compared to the swaps required if partitioning the array in contiguous chunks.

Any thoughts before I try it?

like image 269
atb Avatar asked May 28 '14 23:05

atb


2 Answers

I'd treat it as a degenerate case of parallel sample sort. (Parallel code for sample sort can be found here.) Let N be the number of items. The degenerate sample sort will require Θ(N) temporary space, has Θ(N) work, and Θ(P+ lg N) span (critical path). The last two values are important for analysis, since speedup is limited to work/span.

I'm assuming the input is a random-access sequence. The steps are:

  1. Allocate a temporary array big enough to hold a copy of the input sequence.
  2. Divide the input into K blocks. K is a tuning parameter. For a system with P hardware threads, K=max(4*P,L) might be good, where L is a constant for avoiding ridiculously small blocks. The "4*P" allows some load balancing.
  3. Move each block to its corresponding position in the temporary array and partition it using std::partition. Blocks can be processed in parallel. Remember the offset of the "middle" for each block. You might want to consider writing a custom routine that both moves (in the C++11 sense) and partitions a block.
  4. Compute the offset to where each part of a block should go in the final result. The offsets for the first part of each block can be done using an exclusive prefix sum over the offsets of the middles from step 3. The offsets for the second part of each block can be computed similarly by using the offset of each middle relative to the end of its block. The running sums in the latter case become offsets from the end of the final output sequence. Unless you're dealing with more than 100 hardware threads, I recommend using a serial exclusive scan.
  5. Move the two parts of each block from the temporary array back to the appropriate places in the original sequence. Copying each block can be done in parallel.

There is a way to embed the scan of step 4 into steps 3 and 5, so that the span can be reduced to Θ(lg N), but I doubt it's worth the additional complexity.

If using tbb::parallel_for loops to parallelize steps 3 and 5, consider using affinity_partitioner to help threads in step 5 pick up what they left in cache from step 3.

Note that partitioning requires only Θ(N) work for Θ(N) memory loads and stores. Memory bandwidth could easily become the limiting resource for speedup.

like image 71
Arch D. Robison Avatar answered Oct 01 '22 07:10

Arch D. Robison


Why not to parallel something similar to std::partition_copy instead? The reasons are:

  • for std::partition, in-place swaps as in Adam's solution require logarithmic complexity due to recursive merge of the results.
  • you'll pay memory for parallelism anyway when using the threads and tasks.
  • if the objects are heavy, it is more reasonable to swap (shared) pointers anyway
  • if the results can be stored concurrently then threads can work independently.

It's pretty straight-forward to apply a parallel_for (for random-access iterators) or tbb::parallel_for_each (for non-random-access iterators) to start processing the input range. each task can store the 'true' and 'false' results independently. There are lots of ways to store the results, some from the top of my head:

  • using tbb::parallel_reduce (only for random-access iterators), store the results locally to the task body and move-append them in join() from another task
  • use tbb::concurrent_vector's method grow_by() to copy local results in a bunch or just push() each result separately on arrival.
  • cache thread-local results in tbb::combinable TLS container and combine them later

The exact semantics of std::partition_copy can be achieved by copy from the temporary storage from above or

  • (only for random-access output iterators) use atomic<size_t> cursors to synchronize where to store the results (assuming there is enough space)
like image 42
Anton Avatar answered Oct 01 '22 07:10

Anton