Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is register overflowing a possible cause of a CUDA_EXCEPTION_5, Warp Out-Of-Range Address error?

Tags:

cuda

I'm getting a CUDA_EXCEPTION_5, Warp Out-of-range Address error and I'm trying to figure out the various scenarios that can cause that.

I'm working on porting a C project (written by somebody else) to CUDA. The C code is very register-heavy, instantiating many arrays in the stack. I'm assuming register overflowing is very likely to be occuring and that may be triggering the warp out-of-range error.

Note that I want to get it working running first then I will begin optimizing the code.

I'm using Compute Capable 3.0 hardware which according to Wikipedia has 512KB of "local memory per thread". I read elsewhere it has 512KB of register space per SM. Is it possible to have 512KB of register space per running thread?

I'm currently executing my kernel as follows (yes I know it's ultra-slow):

dim3 grid(28800,1);
cuPlotLRMap<<<grid,1>>>(...)

Some details (I don't know how helpful this will be):

My hardware has 7 SMs. There are 112 running blocks, so does this mean each block gets 1/16th of 512k worth of register space?

I also understand if a thread exceeds the register space it can overflow into global memory. Is it possible for concurrent threads to overflow into the same global memory space when this occurs?

like image 510
Sean Avatar asked Dec 08 '25 06:12

Sean


1 Answers

512KB of "local memory per thread". I read elsewhere it has 512KB of register space per SM. Is it possible to have 512KB of register space per running thread?

See Compute Capabilities table in the CUDA C Programming Guide. Compute capbility 2.x and above devices support a maximum of 512KB of local memory per thread. The function cudaDeviceSetLimit( cudaLimitStackSize, bytesPerThread ) can be used to set the value. I believe the default is 2 KB per thread.

My hardware has 7 SMs. There are 112 running blocks, so does this mean each block gets 1/16th of 512k worth of register space?

Compute capbility 3.x devices can have at most 16 resides blocks per multiprocessor. This assumes that your registers/thread, threads/block, or shared memory/block does not limit the kernel to less than the device maximum. The Visual Profiler and Nsight VSE CUDA Profiler the configuration used by your kernel.

Currently, you are only launching 1 thread/block. You should be launching a multiple of WARP_SIZE per block (32).

I also understand if a thread exceeds the register space it can overflow into global memory. Is it possible for concurrent threads to overflow into the same global memory space when this occurs?

At compile or JIT time the compiler will perform register allocation. If there are insufficient registers per thread then the compiler will spill to local memory. This operation is deterministic and not determined at runtime.

Compute capability 3.0 devices are limited to 63 registers/thread. Compute capability 3.5 devices are limited to 255 registers per thread.

like image 59
Greg Smith Avatar answered Dec 11 '25 09:12

Greg Smith



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!