Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how does cuda 4.0 support recursion

Tags:

cuda

I'm wondering, does cuda 4.0 support recursion using local memory or shared memory? I have to maintain a stack using global memory by myself, because the system-level recursion can't support my program (probably too many levels of recursion). When the recursion get deeper, the threads stop working.
So I really want to know how the default recursion work in CUDA, does it use local memory of shared memory? Thanks!

like image 885
user2816677 Avatar asked Dec 19 '22 23:12

user2816677


1 Answers

Use of recursion requires the use of the ABI, which requires architecture >= sm_20. The ABI has a function calling convention that includes the use of a stack frame. The stack frame is allocated in local memory ("local" means "thread-local", that is, storage private to a thread). Please refer to the CUDA C Programming Guide for basic information on CUDA memory spaces. In addition, you may want to have a look at this previous question: Where does CUDA allocate the stack frame for kernels?

For deeply recursive functions it is possible to exceed the default stack size. For example, on my current system the default stack size is 1024 bytes. You can retrieve the current stack size via the CUDA API function cudaDeviceGetLimit(). You can adjust the stack size via the CUDA API function cudaDeviceSetLimit():

cudaError_t stat;
size_t myStackSize = [your preferred stack size];
stat = cudaDeviceSetLimit (cudaLimitStackSize, myStackSize);

Note that the total amount of memory needed for stack frames is at least the per-thread size multiplied by the number of threads specified in the kernel launch. Often it can be larger due to allocation granularity. So increasing the stack size can eat up memory pretty quickly, and you may find that a deeply recursive function requires more local memory than can be allocated on your GPU.

While recursion is supported on modern GPUs, its use can lead to code with fairly low performance due to function call overhead, so you may want to check whether there is an iterative version of the algorithm you are implementing that may be better suited to the GPU.

like image 104
njuffa Avatar answered Jan 09 '23 06:01

njuffa