As I understand correctly for the 2.x compute capability devices there's a 63 register limit per thread. Do you know which is the register limit per thread for devices of compute capability 1.3?
I have a big kernel which I'm testing on a GTX260. I'm pretty sure I'm using a lot of registers since the kernel is very complex and I need a lot of local variables. According to the Cuda profiler my register usage is 63 (Static Smem is 68 although I'm not so sure what that means and dynamic Smem is 0), although I'm pretty sure I have more than 63 local variables, so I figured the compiler is reusing registers or spilling them into local memory.
Now I thought the devices of compute capability 1.3 had a higher limit of registers per thread than the 2.x devices. My guess was that the compiler was choosing the 63 limit because I'm using using blocks of 256 threads in which case 256*63 is 16128 while 256*64 is 16384 which is the limit number of registers for a SM of this device. So my guess was that if I lower the number of threads per block I can increase the number of registers in use. So I ran the kernel with blocks of 196 threads. But again the profiler shows 63 registers even though 63*192 is 12096 and 64*192 is 12288 which is way inside the 16384 limit of the SM.
So any idea why the compiler is limiting itself still to 63 registers? Could it be all because of register reuse or is it still spilling registers?
the gpu has 8k or 16k registers (depending on the compute capability, see programmer's guide for more info) which are shared between all threads within a thread block. so if you have, e.g., 256 threads per block, each thread can use up to 32 or 64 registers.
Each CUDA card has a maximum number of threads in a block (512, 1024, or 2048). Each thread also has a thread id: threadId = x + y Dx + z Dx Dy The threadId is like 1D representation of an array in memory. If you are working with 1D vectors, then Dy and Dz could be zero. Then threadIdx is x, and threadId is x.
Here are a few noticeable points: CUDA architecture limits the numbers of threads per block (1024 threads per block limit). The dimension of the thread block is accessible within the kernel through the built-in blockDim variable.
In general all scalar variables defined in CUDA code are stored in registers. Registers are local to a thread, and each thread has exclusive access to its own registers: values in registers cannot be accessed by other threads, even from the same block, and are not available for the host.
max registers per thread is documented here
It is 63 for cc 2.x and 3.0, 128 for cc 1.x and 255 for cc 3.5
The compiler may have decided that 63 registers is enough, and doesn't have use for additional registers. Registers can be reused, so just because you have a lot of local variables, doesn't necessarily mean that the registers per thread has to be high.
My suggestion would be to use the nvcc -maxrregcount
option to specify various limits, and then use the -Xptxas -v
option to have the compiler tell you how many registers it is using when it creates the PTX.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With