Cuda registers per thread

Tags:

As I understand correctly for the 2.x compute capability devices there's a 63 register limit per thread. Do you know which is the register limit per thread for devices of compute capability 1.3?

I have a big kernel which I'm testing on a GTX260. I'm pretty sure I'm using a lot of registers since the kernel is very complex and I need a lot of local variables. According to the Cuda profiler my register usage is 63 (Static Smem is 68 although I'm not so sure what that means and dynamic Smem is 0), although I'm pretty sure I have more than 63 local variables, so I figured the compiler is reusing registers or spilling them into local memory.

Now I thought the devices of compute capability 1.3 had a higher limit of registers per thread than the 2.x devices. My guess was that the compiler was choosing the 63 limit because I'm using using blocks of 256 threads in which case 256*63 is 16128 while 256*64 is 16384 which is the limit number of registers for a SM of this device. So my guess was that if I lower the number of threads per block I can increase the number of registers in use. So I ran the kernel with blocks of 196 threads. But again the profiler shows 63 registers even though 63*192 is 12096 and 64*192 is 12288 which is way inside the 16384 limit of the SM.

So any idea why the compiler is limiting itself still to 63 registers? Could it be all because of register reuse or is it still spilling registers?

670

asked Jul 09 '13 16:07

Atirag

1 Answers

max registers per thread is documented here

It is 63 for cc 2.x and 3.0, 128 for cc 1.x and 255 for cc 3.5

The compiler may have decided that 63 registers is enough, and doesn't have use for additional registers. Registers can be reused, so just because you have a lot of local variables, doesn't necessarily mean that the registers per thread has to be high.

My suggestion would be to use the nvcc -maxrregcount option to specify various limits, and then use the -Xptxas -v option to have the compiler tell you how many registers it is using when it creates the PTX.

178

answered Oct 11 '22 23:10

Robert Crovella

Related questions
                            
                                CUDA - why is warp based parallel reduction slower?
                            
                                differences between virtual and real architecture of cuda
                            
                                What are the real C++ language constructs supported by CUDA device code?
                            
                                CUDA - Implementing Device Hash Map?
                            
                                1D Min-convolution in CUDA
                            
                                Where to find CUDA's cutil_math.h?
                            
                                adding a cuda file to an existing c project in visual studio
                            
                                How to compile PTX code
                            
                                no CUDA-capable device is detected (using ubuntu 12.04.4 server) [closed]
                            
                                Set default host compiler for nvcc
                            
                                Is cudaFree() asynchronous?
                            
                                How should a very simple Makefile look like for Cuda compiling under linux
                            
                                Detection if a pointer is pointing in the device or host in CUDA
                            
                                CUDA threads per block limitation
                            
                                Access CUDA global device variable from host
                            
                                Why doesn't CudaFree seem to free memory?
                            
                                Texture memory-tex2D basics
                            
                                Resolving Thrust/CUDA warnings "Cannot tell what pointer points to..."
                            
                                How to choose device when running a CUDA executable?
                            
                                cuda uncorrectable ECC error encountered

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Cuda registers per thread

Tags:

local-storage

cuda

profiler

Atirag

People also ask

1 Answers

Robert Crovella

Recent Activity

Donate For Us