Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

In theory, is find_end parallelizable?

I'm currently working on an open-std proposal to bring parallel functionality to the project I am working on, but I've run into a road block with find_end.

Now find_end can be described as:

An algorithm that searches for the last subsequence of elements [s_first, s_last) in the range [first, last). The first version uses operator== to compare the elements, the second version uses the given binary predicate p.

it's requirements are laid out by cppreference. Now I had no problem parallelizing find/findif/findifnot etc. These could easily be split up into separate partitions that were executed asynchronously and I had no trouble. The problem with find_end is that splitting the algorithm up into chunks is not a solution, because if we have say a vector:

1 2 3 4 5 1 2 3 8

and we want to search for 1 2.

Ok first off I seperate the vector into chunks asynchronously and just search for the range in each chunk right? Seemed easy enough to me, however what happens if for some reason there are only 3 available cores, so the vector is separated into 3 chunks:

1 2 3|4 5 1|2 3 8

Now I have a problem, the second 1 2 range is split into different partitions. This is going to lead to a lot of invalid results for someone has x amount of cores that end up splitting the search results into y different partitions. I was thinking I would do some sort of search chunks -> merge y chunks into y/2 chunks -> search -> in a recursive style search, but this just seem's so inefficient, the whole point of this algorithm is to improve efficiency. I might be overthinking this ordeal as well

tl;dr, is there a way to parallelize find_end in a way I am not thinking of?

like image 523
Syntactic Fructose Avatar asked Jul 24 '14 18:07

Syntactic Fructose


2 Answers

Yes, there is a way.

Let N be the size of the range you are looking for.

Once you've separated your vector in 3 chunks (3 separate worker threads) :

1 2 3|4 5 1|2 3 8

You can allow each thread to run across its right adjacent chunk (if any) for N-1 elements (since only read operations are involved on the sequence, this is perfectly fine and thread-safe).

In this case : (N = 2)

  • Core 1 run on 1 2 3 4

  • Core 2 run on 4 5 1 2

  • Core 3 run on 2 3 8

like image 138
quantdev Avatar answered Nov 17 '22 04:11

quantdev


Since the point of find_end is to find the last occurrence of a needle in a haystack, parallelization by splitting the haystack into contiguous segments is often not going to produce any benefit because if the needle is actually in the last segment, the work done by all processors other than the one assigned to the last segment is wasted, and the time is precisely the same as it would have been with a single processor. In theory, the parallel evaluation allows you to cap the maximum search time, which is of benefit if (1) processors are not in competition for other tasks and (2) there are relatively few instances of the needle in the haystack.

In addition, you need to be able to coordinate process termination; each process can abandon the search when it finds a match or when its younger sibling has either found a match or abandoned the search. Once process 0 has found a match or run out of places to look for them, the lowest-index process with a match wins.

An alternative is to interleave the searches. If you have k processors, then processor 0 is given the sequences which end at last-0, last-k, last-2k..., processor 1 is given the sequences which end at last-1, last-k-1, last-2k-1... and in general processor i (0 ≤ i < k) works on last-i, last-k-i, last-2k-i...

Process coordination is slightly different from the first alternative. Again, each individual process can stop as soon as it finds a match. Also, any process can stop as soon as its current target is less than the latest match found by another process.

While that should result in reasonable parallelization of the search, it's not clear to me that it will be do better than a non-parallelized linear-time algorithm such as Knuth-Morris-Pratt or Boyer-Moore, either of which can be trivially modified to search right-to-left. These algorithms will be particularly useful in the not uncommon case where the needle is a compile-time constant, allowing for the possibility to precompute the necessary shift tables. The non-interleaved parallelization can benefit from KMP or BM, with the same caveat as above: it is likely that most of the participating process will prove to not have been useful.

like image 2
rici Avatar answered Nov 17 '22 03:11

rici