CUDA manual specifies the number of 32-bit registers per multiprocessor. Does it mean that:
Double variable takes two registers?
Pointer variable takes two registers? - It has to be more than one register on Fermi with 6 GB memory, right?
If answer to question 2 is yes, it must be better to use less pointer variables and more int
indices.
E. g., this kernel code:
float* p1; // two regs
float* p2 = p1 + 1000; // two regs
int i; // one reg
for ( i = 0; i < n; i++ )
{
CODE THAT USES p1[i] and p2[i]
}
theoretically requires more registers than this kernel code:
float* p1; // two regs
int i; // one reg
int j; // one reg
for ( i = 0, j = 1000; i < n; i++, j++ )
{
CODE THAT USES p1[i] and p1[j]
}
The short answer to your three questions are:
To expand on point 3, consider the following two simple memory copy kernels:
__global__
void debunk(float *in, float *out, int n)
{
int i = n * (threadIdx.x + blockIdx.x*blockDim.x);
for(int j=0; j<n; j++) {
out[i+j] = in[i+j];
}
}
__global__
void debunk2(float *in, float *out, int n)
{
int i = n * (threadIdx.x + blockIdx.x*blockDim.x);
float *x = in + i;
float *y = out + i;
for(int j=0; j<n; j++, x++, y++) {
*x = *y;
}
}
By your reckoning, debunk
must use less registers because it has only two local integer variables, whereas debunk2
has two additional pointers. And yet, when I compile them using the CUDA 5 release toolchain:
$ nvcc -m64 -arch=sm_20 -c -Xptxas="-v" pointer_size.cu
ptxas info : 0 bytes gmem
ptxas info : Compiling entry function '_Z6debunkPfS_i' for 'sm_20'
ptxas info : Function properties for _Z6debunkPfS_i
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 8 registers, 52 bytes cmem[0]
ptxas info : Compiling entry function '_Z7debunk2PfS_i' for 'sm_20'
ptxas info : Function properties for _Z7debunk2PfS_i
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 8 registers, 52 bytes cmem[0]
They compile to the exact same register count. And if you disassemble the toolchain output you will see that apart from the setup code, the final instruction streams are almost identical. There are a number of reasons for this, but it basically comes down to two simple rules:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With