Is register overflowing a possible cause of a CUDA_EXCEPTION_5, Warp Out-Of-Range Address error?

Question

I'm getting a CUDA_EXCEPTION_5, Warp Out-of-range Address error and I'm trying to figure out the various scenarios that can cause that.

I'm working on porting a C project (written by somebody else) to CUDA. The C code is very register-heavy, instantiating many arrays in the stack. I'm assuming register overflowing is very likely to be occuring and that may be triggering the warp out-of-range error.

Note that I want to get it working running first then I will begin optimizing the code.

I'm using Compute Capable 3.0 hardware which according to Wikipedia has 512KB of "local memory per thread". I read elsewhere it has 512KB of register space per SM. Is it possible to have 512KB of register space per running thread?

I'm currently executing my kernel as follows (yes I know it's ultra-slow):

dim3 grid(28800,1);
cuPlotLRMap<<<grid,1>>>(...)

Some details (I don't know how helpful this will be):

My hardware has 7 SMs. There are 112 running blocks, so does this mean each block gets 1/16th of 512k worth of register space?

I also understand if a thread exceeds the register space it can overflow into global memory. Is it possible for concurrent threads to overflow into the same global memory space when this occurs?

Greg Smith · Accepted Answer

512KB of "local memory per thread". I read elsewhere it has 512KB of register space per SM. Is it possible to have 512KB of register space per running thread?

See Compute Capabilities table in the CUDA C Programming Guide. Compute capbility 2.x and above devices support a maximum of 512KB of local memory per thread. The function cudaDeviceSetLimit( cudaLimitStackSize, bytesPerThread ) can be used to set the value. I believe the default is 2 KB per thread.

My hardware has 7 SMs. There are 112 running blocks, so does this mean each block gets 1/16th of 512k worth of register space?

Compute capbility 3.x devices can have at most 16 resides blocks per multiprocessor. This assumes that your registers/thread, threads/block, or shared memory/block does not limit the kernel to less than the device maximum. The Visual Profiler and Nsight VSE CUDA Profiler the configuration used by your kernel.

Currently, you are only launching 1 thread/block. You should be launching a multiple of WARP_SIZE per block (32).

I also understand if a thread exceeds the register space it can overflow into global memory. Is it possible for concurrent threads to overflow into the same global memory space when this occurs?

At compile or JIT time the compiler will perform register allocation. If there are insufficient registers per thread then the compiler will spill to local memory. This operation is deterministic and not determined at runtime.

Compute capability 3.0 devices are limited to 63 registers/thread. Compute capability 3.5 devices are limited to 255 registers per thread.

Is register overflowing a possible cause of a CUDA_EXCEPTION_5, Warp Out-Of-Range Address error?

Tags:

cuda

Sean

1 Answers

Greg Smith

Recent Activity

Donate For Us

Is register overflowing a possible cause of a CUDA_EXCEPTION_5, Warp Out-Of-Range Address error?

Tags:

cuda

Sean

1 Answers

Greg Smith

Related questions

Recent Activity

Donate For Us