For my application I have to handle a bunch of objects (let's say int
s) that gets subsequently divided and sorted into smaller buckets. To this end, I store the elements in a single continuous array
arr = {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14...}
and the information about the buckets (sublists) is given by offsets to the first element in the respective bucket and the lengths of the sublist.
So, for instance, given
offsets = {0,3,8,..}
sublist_lengths = {3,5,2,...}
would result in the following splits:
0 1 2 || 3 4 5 6 7 || 8 9 || ...
What I am looking for is a somewhat general and efficient way to run algorithms, like reductions, on the buckets only using either custom kernels or the thrust
library. Summing the buckets should give:
3 || 25 || 17 || ...
What I've come up with:
option 1: custom kernels require a quite a bit of tinkering, copies into shared memory, proper choice of block and grid sizes and an own implementation of the algorithms, like scan, reduce, etc. Also, every single operation would require an own custom kernel. In general it is clear to me how to do this, but after having used thrust
for the last couple of days I have the impression that there might be a smarter way
option 2: generate an array of keys from the offsets ({0,0,0,1,1,1,1,1,2,2,3,...}
in the above example) and use thrust::reduce_by_key
. I don't like the extra list generation, though.
option 3: Use thrust::transform_iterator
together with thrust::counting_iterator
to generate the above given key list on the fly. Unfortunately, I can't come up with an implementation that doesn't require increments of indices to the offset list on the device and defeats parallelism.
What would be the most sane way to implement this?
Within Thrust, I can't think of a better solution than Option 2. The performance will not be terrible, but it's certainly not optimal.
Your data structure bears similarity to the Compressed Sparse Row (CSR) format for storing sparse matrices, so you could use techniques developed for computing sparse matrix-vector multiplies (SpMV) for such matrices if you want better performance. Note that the "offsets" array of the CSR format has length (N+1) for a matrix with N rows (i.e. buckets in your case) where the last offset value is the length of arr
. The CSR SpMV code in Cusp is a bit convoluted, but it serves as a good starting point for your kernel. Simply remove any reference to Aj
or x
from the code and pass offsets
and arr
into the Ap
and Av
arguments respectively.
You didn't mention how big the buckets are. If the buckets are big enough, maybe you can get away with copying the offsets and sublist_lengths to the host, iterating over them and doing a separate Thrust call for each bucket. Fermi can have 16 kernels in flight at the same time, so on that architecture you might be able to handle smaller buckets and still get good utilization.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With