I have the following problem: I need to compute the inclusive scans (e.g. prefix sums) of values based on a tree structure on the GPU. These scans are either from the root node (top-down) or from the leaf nodes (bottom-up). The case of a simple chain is easily handled, but the tree structure makes parallelization rather difficult to implement efficiently.
For instance, after a top-down inclusive scan, (12)
would hold (0)[op](6)[op](7)[op](8)[op](11)[op](12)
, and for a bottom-up inclusive scan, (8)
would hold (8)[op](9)[op](10)[op](11)[op](12)
, where [op]
is a given binary operator (matrix addition, multiplication etc.).
One also needs to consider the following points:
(6)
as the new root). Nonetheless, a typical tree should not be too unbalanced.In this case, what would be the "best" data structure (for the tree structure) and the best algorithms (for the inclusive scans/prefix sums) to solve this problem?
Probably a harebrained idea, but imagine that you insert nodes of 0 value into the tree, in such a way that you get a 2D matrix. For instance, there would be 3 zero value nodes below the 5 node in your example. Then use one thread to travel each level of the matrix horizontally. For the top-down prefix sum, offset the threads in such a way that each lower thread is delayed by the maximum number of branches the tree can have in that location. So, you get a "wave" with a slanted edge running over the matrix. The upper threads, being further along, calculate those nodes in time for them to be processed further by threads running further down. You would need the same number of threads as the tree is deep.
I think parallel prefix scan may not suitable for your case because:
parallel prefix scan algorithm will increase the total number of [op]
, in your link of prefix sum, a 16-input parallel prefix scan requires 26 [op]
, while a sequential version only need 15. parallel algorithm performs better is based on a assumption that there's enough hardware resources to run multiple [op]
in parallel.
You could evaluate the cost of your [op]
before try the parallel prefix scan.
On the other hand, since the size of the tree is small, I think you could simply consider your tree as 4 (number of the leaf nodes) independent simple chains, and use concurrent kernel execution to improve the performance of these 4 prefix scan kernels
0-1-2-3
0-4-5
0-6-7-8-9-10
0-6-7-8-11-12
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With