If we have a vector of size N that was previously sorted, and replace up to M elements with arbitrary values (where M is much smaller than N), is there an easy way to re-sort them at lower cost (i.e. generate a sorting network of reduced depth) than a full sort?
For example if N=10 and M=2 the input might be
10 20 30 40 999 60 70 80 90 -1
Note: the indices of the modified elements are not known (until we compare them with the surrounding elements.)
Here is an example where I know the solution because the input size is small and I was able to find it with a brute-force search:
if N = 5 and M is 1, these would be valid inputs:
0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 1 1 1 0 1 0 0 1 1 1 1 1 1 0
0 0 0 0 1 0 0 1 0 1 0 1 0 0 1 0 1 1 1 1 1 0 1 1 1 1 1 1 1 1
0 0 0 1 0 0 0 1 1 0 0 1 0 1 1 1 0 0 0 0 1 1 0 1 1
0 0 0 1 1 0 0 1 1 1 0 1 1 0 1 1 0 0 0 1 1 1 1 0 1
For example the input may be 0 1 1 0 1
if the previously sorted vector was 0 1 1 1 1
and the 4th element was modified, but there is no way to form 0 1 0 1 0
as a valid input, because it differs in at least 2 elements from any sorted vector.
This would be a valid sorting network for re-sorting these inputs:
>--*---*-----*-------->
| | |
>--*---|-----|-*---*-->
| | | |
>--*---|-*---*-|---*-->
| | | |
>--*---*-|-----*---*-->
| |
>--------*---------*-->
We do not care that this network fails to sort some invalid inputs (e.g. 0 1 0 1 0
.)
And this network has depth 4, a saving of 1 compared with the general case (a depth of 5 generally necessary to sort a 5-element vector.)
Unfortunately the brute-force approach is not feasible for larger input sizes.
Is there a known method for constructing a network to re-sort a larger vector?
My N values will be in the order of a few hundred, with M not much more than √N.
Ok, I'm posting this as an answer since the comment restriction on length drives me nuts :)
You should try this out:
clGetDeviceInfo
with CL_DEVICE_LOCAL_MEM_SIZE
) and derive the maximum number of work-items per work-group,
because with this approach your number of work-items will most likely be limited by the amount of local memory.This will probably work rather well I suspect, because:
Let me know if you have problems with my ideas.
I just realized I used a technique that may be confusing to others: My proposal for using local memory is not for synchronization or using multiple work items for a single input vector/array. I simply use it to get a low read/write memory latency. Since we use rather large chunks of memory I fear that using private memory may cause swapping to slow global memory without us realizing it. This also means you have to allocate local memory for each work-item. Each work-item will access its own chunk of local memory and use it for sorting (exclusively). I'm not sure how good this idea is, but I've read that using too much private memory may cause swapping to global memory and the only way to notice is by looking at the performance (not sure if I'm right about this).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With