I have some problems running my code on a GTX 480 with Compute Capability 2.0
I always get following error if I launch the kernel with 1024 threads per Block:
========= CUDA-MEMCHECK
========= Program hit cudaErrorLaunchOutOfResources (error 7) due to "too many resources requested for launch" on CUDA API call to cudaLaunch.
========= Saved host backtrace up to driver entry point at error
========= Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 [0x2ef613]
========= Host Frame:/usr/local/cuda-6.5/lib64/libcudart.so.6.5 (cudaLaunch + 0x17e) [0x3686e]
========= Host Frame:./bin/myProgram [0x3a50]
========= Host Frame:./bin/myProgram [0x388a]
========= Host Frame:./bin/myProgram [0x38e3]
========= Host Frame:./bin/myProgram [0x2a99]
========= Host Frame:./bin/myProgram [0x1410]
========= Host Frame:./bin/myProgram [0x1da0]
========= Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xed) [0x2176d]
========= Host Frame:./bin/myProgram [0x1139]
=========
I run the program multiple time with different block and thread count:
5 Blocks, 512 Threads per Block => Works
5 Blocks, 1024 Threads per Block => Error
10 Blocks, 512 Threads per Block => Works
10 Blocks, 1024 Threads per Block => Error
15 Blocks, 512 Threads per Block => Works
15 Blocks, 1024 Threads per Block => Error
I checked the used registers, and it seems to be ok. "Function4" with 28 registers is the kernel which uses so much threads. All other kernerls uses only <<<1, 32>>> per call.
ptxas info : 0 bytes gmem
ptxas info : Function properties for _Z7function1Py
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Compiling entry function '_Z13function2PyS_i' for 'sm_20'
ptxas info : Function properties for _Z13function2PyS_i
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 22 registers, 52 bytes cmem[0]
ptxas info : Compiling entry function '_Z6function3PyiS_' for 'sm_20'
ptxas info : Function properties for _Z6function3PyiS_
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 22 registers, 56 bytes cmem[0]
ptxas info : Compiling entry function '_Z17function4PyiiS_Phji' for 'sm_20'
ptxas info : Function properties for _Z17function4PyiiS_Phji
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 28 registers, 72 bytes cmem[0]
I run this program with my GTX 660 too with CC 3.0 and there it works with 1024 Threads per Block. I have no clue where the problem come from. Has anyone an idea?
I had the same error.
Thanks to http://cuda-programming.blogspot.fr/2013/01/handling-cuda-error-messages.html, I understood the error. They say :
"Too Many Resources Requested for Launch - This error means that the number of registers available on the multiprocessor is being exceeded. Reduce the number of threads per block to solve the problem."
Basically I used to be able to have a given number of threads per block, (8x8x16=1024 for a 3D Kernel). But if you nest your kernel calls, you further reduce the number of available registers.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With