Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Cuda registers per thread

As I understand correctly for the 2.x compute capability devices there's a 63 register limit per thread. Do you know which is the register limit per thread for devices of compute capability 1.3?

I have a big kernel which I'm testing on a GTX260. I'm pretty sure I'm using a lot of registers since the kernel is very complex and I need a lot of local variables. According to the Cuda profiler my register usage is 63 (Static Smem is 68 although I'm not so sure what that means and dynamic Smem is 0), although I'm pretty sure I have more than 63 local variables, so I figured the compiler is reusing registers or spilling them into local memory.

Now I thought the devices of compute capability 1.3 had a higher limit of registers per thread than the 2.x devices. My guess was that the compiler was choosing the 63 limit because I'm using using blocks of 256 threads in which case 256*63 is 16128 while 256*64 is 16384 which is the limit number of registers for a SM of this device. So my guess was that if I lower the number of threads per block I can increase the number of registers in use. So I ran the kernel with blocks of 196 threads. But again the profiler shows 63 registers even though 63*192 is 12096 and 64*192 is 12288 which is way inside the 16384 limit of the SM.

So any idea why the compiler is limiting itself still to 63 registers? Could it be all because of register reuse or is it still spilling registers?

like image 670
Atirag Avatar asked Jul 09 '13 16:07

Atirag


People also ask

How many registers does CUDA core have?

the gpu has 8k or 16k registers (depending on the compute capability, see programmer's guide for more info) which are shared between all threads within a thread block. so if you have, e.g., 256 threads per block, each thread can use up to 32 or 64 registers.

How are threads numbered in CUDA?

Each CUDA card has a maximum number of threads in a block (512, 1024, or 2048). Each thread also has a thread id: threadId = x + y Dx + z Dx Dy The threadId is like 1D representation of an array in memory. If you are working with 1D vectors, then Dy and Dz could be zero. Then threadIdx is x, and threadId is x.

How many blocks and threads CUDA?

Here are a few noticeable points: CUDA architecture limits the numbers of threads per block (1024 threads per block limit). The dimension of the thread block is accessible within the kernel through the built-in blockDim variable.

What are registers in CUDA?

In general all scalar variables defined in CUDA code are stored in registers. Registers are local to a thread, and each thread has exclusive access to its own registers: values in registers cannot be accessed by other threads, even from the same block, and are not available for the host.


1 Answers

max registers per thread is documented here

It is 63 for cc 2.x and 3.0, 128 for cc 1.x and 255 for cc 3.5

The compiler may have decided that 63 registers is enough, and doesn't have use for additional registers. Registers can be reused, so just because you have a lot of local variables, doesn't necessarily mean that the registers per thread has to be high.

My suggestion would be to use the nvcc -maxrregcount option to specify various limits, and then use the -Xptxas -v option to have the compiler tell you how many registers it is using when it creates the PTX.

like image 178
Robert Crovella Avatar answered Oct 11 '22 23:10

Robert Crovella