I got to solve a pretty standard problem on the GPU, but I'm quite new to practical GPGPU, so I'm looking for ideas to approach this problem.
I have many points in 3-space which are assigned to a very small number of groups (each point belongs to one group), specifically 15 in this case (doesn't ever change). Now I want to compute the mean and covariance matrix of all the groups. So on the CPU it's roughly the same as:
for each point p
{
mean[p.group] += p.pos;
covariance[p.group] += p.pos * p.pos;
++count[p.group];
}
for each group g
{
mean[g] /= count[g];
covariance[g] = covariance[g]/count[g] - mean[g]*mean[g];
}
Since the number of groups is extremely small, the last step can be done on the CPU (I need those values on the CPU, anyway). The first step is actually just a segmented reduction, but with the segments scattered around.
So the first idea I came up with, was to first sort the points by their groups. I thought about a simple bucket sort using atomic_inc
to compute bucket sizes and per-point relocation indices (got a better idea for sorting?, atomics may not be the best idea). After that they're sorted by groups and I could possibly come up with an adaption of the segmented scan algorithms presented here.
But in this special case, I got a very large amount of data per point (9-10 floats, maybe even doubles if the need arises), so the standard algorithms using a shared memory element per thread and a thread per point might make problems regarding per-multiprocessor resources as shared memory or registers (Ok, much more on compute capability 1.x than 2.x, but still).
Due to the very small and constant number of groups I thought there might be better approaches. Maybe there are already existing ideas suited for these specific properties of such a standard problem. Or maybe my general approach isn't that bad and you got ideas for improving the individual steps, like a good sorting algorithm suited for a very small number of keys or some segmented reduction algorithm minimizing shared memory/register usage.
I'm looking for general approaches and don't want to use external libraries. FWIW I'm using OpenCL, but it shouldn't really matter as the general concepts of GPU computing don't really differ over the major frameworks.
Even though there are few groups, I don't think you will be able to avoid the initial sorting into groups while still keeping the reduction step efficient. You will probably also want to perform the full sort, not just sorting indexes, because that will help keep memory access efficient in the reduction step.
For sorting, read about general strategies here:
http://http.developer.nvidia.com/GPUGems2/gpugems2_chapter46.html
For reduction (old but still good):
http://developer.download.nvidia.com/compute/cuda/1.1-Beta/x86_website/projects/reduction/doc/reduction.pdf
For an example implementation of parallel reduction:
http://developer.nvidia.com/cuda-cc-sdk-code-samples#reduction
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With