GPU-based inclusive scan on an unbalanced tree

Question

I have the following problem: I need to compute the inclusive scans (e.g. prefix sums) of values based on a tree structure on the GPU. These scans are either from the root node (top-down) or from the leaf nodes (bottom-up). The case of a simple chain is easily handled, but the tree structure makes parallelization rather difficult to implement efficiently.

Tree example

For instance, after a top-down inclusive scan, (12) would hold (0)[op](6)[op](7)[op](8)[op](11)[op](12), and for a bottom-up inclusive scan, (8) would hold (8)[op](9)[op](10)[op](11)[op](12), where [op] is a given binary operator (matrix addition, multiplication etc.).

One also needs to consider the following points:

For a typical scenario, the length of the different branches should not be too long (~10), with something like 5 to 10 branches, so this is something that will run inside a block and work will be split between the threads. Different blocks will simply handle different values of nodes. This is obviously not optimal regarding occupancy, but this is a constraint on the problem that will be tackled sometime later. For now, I will rely on Instruction-level parallelism.
The structure of the graph cannot be changed (it describes an actual system), thus it cannot be balanced (or only by changing the root of the tree, e.g. using (6) as the new root). Nonetheless, a typical tree should not be too unbalanced.
I currently use CUDA for GPGPU, so I am open to any CUDA-enabled template library that can solve this issue.
Node data is already in global memory and the result will be used by other CUDA kernels, so the objective is just to achieve this without making it a huge bottleneck.
There is no "cycle", i.e. branches cannot merge down the tree.
The structure of the tree is fixed and set in an initialization phase.
A single binary operation can be quite expensive (e.g. multiplication of polynomial matrices, i.e. each element is a polynomial of a given order).

In this case, what would be the "best" data structure (for the tree structure) and the best algorithms (for the inclusive scans/prefix sums) to solve this problem?

Roger Dahl · Accepted Answer

Probably a harebrained idea, but imagine that you insert nodes of 0 value into the tree, in such a way that you get a 2D matrix. For instance, there would be 3 zero value nodes below the 5 node in your example. Then use one thread to travel each level of the matrix horizontally. For the top-down prefix sum, offset the threads in such a way that each lower thread is delayed by the maximum number of branches the tree can have in that location. So, you get a "wave" with a slanted edge running over the matrix. The upper threads, being further along, calculate those nodes in time for them to be processed further by threads running further down. You would need the same number of threads as the tree is deep.

kangshiyin · Answer

I think parallel prefix scan may not suitable for your case because:

parallel prefix scan algorithm will increase the total number of [op], in your link of prefix sum, a 16-input parallel prefix scan requires 26 [op], while a sequential version only need 15. parallel algorithm performs better is based on a assumption that there's enough hardware resources to run multiple [op] in parallel.

You could evaluate the cost of your [op] before try the parallel prefix scan.

On the other hand, since the size of the tree is small, I think you could simply consider your tree as 4 (number of the leaf nodes) independent simple chains, and use concurrent kernel execution to improve the performance of these 4 prefix scan kernels

0-1-2-3
0-4-5
0-6-7-8-9-10
0-6-7-8-11-12

GPU-based inclusive scan on an unbalanced tree

Tags:

algorithm

tree

cuda

gpgpu

BenC

2 Answers

Roger Dahl

kangshiyin

Recent Activity

Donate For Us

GPU-based inclusive scan on an unbalanced tree

Tags:

algorithm

tree

cuda

gpgpu

BenC

2 Answers

Roger Dahl

kangshiyin

Related questions

Recent Activity

Donate For Us