CUDA register usage

Question

CUDA manual specifies the number of 32-bit registers per multiprocessor. Does it mean that:

Double variable takes two registers?
Pointer variable takes two registers? - It has to be more than one register on Fermi with 6 GB memory, right?

If answer to question 2 is yes, it must be better to use less pointer variables and more int indices.

E. g., this kernel code:

float* p1;               // two regs
float* p2 = p1 + 1000;   // two regs
int i;                   // one reg
for ( i = 0; i < n; i++ )
{
    CODE THAT USES p1[i] and p2[i]
}

theoretically requires more registers than this kernel code:

float* p1;               // two regs
int i;                   // one reg
int j;                   // one reg
for ( i = 0, j = 1000; i < n; i++, j++ )
{
    CODE THAT USES p1[i] and p1[j]
}

talonmies · Accepted Answer

The short answer to your three questions are:

Yes.
Yes, if the code is compiled for a 64 bit host operating system. Device pointer size always matches host application pointer size in CUDA.
No.

To expand on point 3, consider the following two simple memory copy kernels:

__global__
void debunk(float *in, float *out, int n)
{
    int i = n * (threadIdx.x + blockIdx.x*blockDim.x);

    for(int j=0; j<n; j++) {
        out[i+j] = in[i+j];
    }
}

__global__
void debunk2(float *in, float *out, int n)
{
    int i = n * (threadIdx.x + blockIdx.x*blockDim.x);
    float *x = in + i;
    float *y = out + i;

    for(int j=0; j<n; j++, x++, y++) {
        *x = *y;
    }
}

By your reckoning, debunk must use less registers because it has only two local integer variables, whereas debunk2 has two additional pointers. And yet, when I compile them using the CUDA 5 release toolchain:

$ nvcc -m64 -arch=sm_20 -c -Xptxas="-v"  pointer_size.cu 
ptxas info    : 0 bytes gmem
ptxas info    : Compiling entry function '_Z6debunkPfS_i' for 'sm_20'
ptxas info    : Function properties for _Z6debunkPfS_i
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 8 registers, 52 bytes cmem[0]
ptxas info    : Compiling entry function '_Z7debunk2PfS_i' for 'sm_20'
ptxas info    : Function properties for _Z7debunk2PfS_i
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 8 registers, 52 bytes cmem[0]

They compile to the exact same register count. And if you disassemble the toolchain output you will see that apart from the setup code, the final instruction streams are almost identical. There are a number of reasons for this, but it basically comes down to two simple rules:

Trying to determine the register count from C code (or even PTX assembler) is mostly futile
Trying to second guess a very sophisticated compiler and assembler is also mostly futile.

CUDA register usage

Tags:

cuda

gpu

user2052436

1 Answers

talonmies

Recent Activity

Donate For Us

CUDA register usage

Tags:

cuda

gpu

user2052436

1 Answers

talonmies

Related questions

Recent Activity

Donate For Us