Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficient partial reductions given arrays of elements, offsets to and lengths of sublists

Tags:

c++

cuda

thrust

For my application I have to handle a bunch of objects (let's say ints) that gets subsequently divided and sorted into smaller buckets. To this end, I store the elements in a single continuous array

arr = {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14...}

and the information about the buckets (sublists) is given by offsets to the first element in the respective bucket and the lengths of the sublist.

So, for instance, given

offsets = {0,3,8,..}
sublist_lengths = {3,5,2,...}

would result in the following splits:

0 1 2 || 3 4 5 6 7 || 8 9 || ...

What I am looking for is a somewhat general and efficient way to run algorithms, like reductions, on the buckets only using either custom kernels or the thrust library. Summing the buckets should give:

3 || 25 || 17 || ...

What I've come up with:

  • option 1: custom kernels require a quite a bit of tinkering, copies into shared memory, proper choice of block and grid sizes and an own implementation of the algorithms, like scan, reduce, etc. Also, every single operation would require an own custom kernel. In general it is clear to me how to do this, but after having used thrust for the last couple of days I have the impression that there might be a smarter way

  • option 2: generate an array of keys from the offsets ({0,0,0,1,1,1,1,1,2,2,3,...} in the above example) and use thrust::reduce_by_key. I don't like the extra list generation, though.

  • option 3: Use thrust::transform_iterator together with thrust::counting_iterator to generate the above given key list on the fly. Unfortunately, I can't come up with an implementation that doesn't require increments of indices to the offset list on the device and defeats parallelism.

What would be the most sane way to implement this?

like image 255
bbtrb Avatar asked Feb 01 '12 04:02

bbtrb


2 Answers

Within Thrust, I can't think of a better solution than Option 2. The performance will not be terrible, but it's certainly not optimal.

Your data structure bears similarity to the Compressed Sparse Row (CSR) format for storing sparse matrices, so you could use techniques developed for computing sparse matrix-vector multiplies (SpMV) for such matrices if you want better performance. Note that the "offsets" array of the CSR format has length (N+1) for a matrix with N rows (i.e. buckets in your case) where the last offset value is the length of arr. The CSR SpMV code in Cusp is a bit convoluted, but it serves as a good starting point for your kernel. Simply remove any reference to Aj or x from the code and pass offsets and arr into the Ap and Av arguments respectively.

like image 70
wnbell Avatar answered Sep 17 '22 04:09

wnbell


You didn't mention how big the buckets are. If the buckets are big enough, maybe you can get away with copying the offsets and sublist_lengths to the host, iterating over them and doing a separate Thrust call for each bucket. Fermi can have 16 kernels in flight at the same time, so on that architecture you might be able to handle smaller buckets and still get good utilization.

like image 39
Roger Dahl Avatar answered Sep 19 '22 04:09

Roger Dahl