Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to efficiently gather data from threads in CUDA?

I have a application that solves a system of equations in CUDA, I know for sure that each thread can find up to 4 solutions, but how can I copy then back to the host?

I'm passing a huge array with enough space to all threads store 4 solutions (4 doubles for each solution), and another one with the number of solutions per thread, however that's a naive solution, and is the current bottleneck of my kernel.

I really like to optimize this. The main problem is concatenate a variable number of solutions per thread in a single array.

like image 365
RSFalcon7 Avatar asked Jun 22 '12 00:06

RSFalcon7


People also ask

How many threads are there in Nvidia CUDA warp?

NVIDIA GPUs execute warps of 32 parallel threads using SIMT, which enables each thread to access its own registers, to load and store from divergent addresses, and to follow divergent control flow paths.

What is function of __ global __ qualifier in CUDA program?

__global__ : 1. A qualifier added to standard C. This alerts the compiler that a function should be compiled to run on a device (GPU) instead of host (CPU).

What is warp in GPU?

In an NVIDIA GPU, the basic unit of execution is the warp. A warp is a collection of threads, 32 in current implementations, that are executed simultaneously by an SM. Multiple warps can be executed on an SM at once.

What is shared memory in CUDA?

Shared memory is a powerful feature for writing well optimized CUDA code. Access to shared memory is much faster than global memory access because it is located on chip. Because shared memory is shared by threads in a thread block, it provides a mechanism for threads to cooperate.


1 Answers

The functionality you're looking for is called stream compaction.

You probably do need to provide an array that contains room for 4 solutions per thread because attempting to directly store the results in a compact form is likely to create so many dependencies between the threads that the performance gained in being able to copy less data back to the host is lost by a longer kernel execution time. The exception to this is if almost all of the threads find no solutions. In that case, you might be able to use an atomic operation to maintain an index into an array. So, for each solution that is found, you would store it in an array at an index and then use an atomic operation to increase the index. I think it would be safe to use atomicAdd() for this. Before storing a result, the thread would use atomicAdd() to increase the index by one. atomicAdd() returns the old value, and the thread can store the result using the old value as the index.

However, given a more common situation, where there's a fair number of results, the best solution will be to perform a compacting operation as a separate step. One way to do this is with thrust::copy_if. See this question for some more background.

like image 133
Roger Dahl Avatar answered Oct 17 '22 12:10

Roger Dahl