Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

CUDA register usage

Tags:

cuda

gpu

CUDA manual specifies the number of 32-bit registers per multiprocessor. Does it mean that:

  1. Double variable takes two registers?

  2. Pointer variable takes two registers? - It has to be more than one register on Fermi with 6 GB memory, right?

  3. If answer to question 2 is yes, it must be better to use less pointer variables and more int indices.

    E. g., this kernel code:

    float* p1;               // two regs
    float* p2 = p1 + 1000;   // two regs
    int i;                   // one reg
    for ( i = 0; i < n; i++ )
    {
        CODE THAT USES p1[i] and p2[i]
    }
    

    theoretically requires more registers than this kernel code:

    float* p1;               // two regs
    int i;                   // one reg
    int j;                   // one reg
    for ( i = 0, j = 1000; i < n; i++, j++ )
    {
        CODE THAT USES p1[i] and p1[j]
    }
    
like image 880
user2052436 Avatar asked Jan 14 '23 05:01

user2052436


1 Answers

The short answer to your three questions are:

  1. Yes.
  2. Yes, if the code is compiled for a 64 bit host operating system. Device pointer size always matches host application pointer size in CUDA.
  3. No.

To expand on point 3, consider the following two simple memory copy kernels:

__global__
void debunk(float *in, float *out, int n)
{
    int i = n * (threadIdx.x + blockIdx.x*blockDim.x);

    for(int j=0; j<n; j++) {
        out[i+j] = in[i+j];
    }
}

__global__
void debunk2(float *in, float *out, int n)
{
    int i = n * (threadIdx.x + blockIdx.x*blockDim.x);
    float *x = in + i;
    float *y = out + i;

    for(int j=0; j<n; j++, x++, y++) {
        *x = *y;
    }
}

By your reckoning, debunk must use less registers because it has only two local integer variables, whereas debunk2 has two additional pointers. And yet, when I compile them using the CUDA 5 release toolchain:

$ nvcc -m64 -arch=sm_20 -c -Xptxas="-v"  pointer_size.cu 
ptxas info    : 0 bytes gmem
ptxas info    : Compiling entry function '_Z6debunkPfS_i' for 'sm_20'
ptxas info    : Function properties for _Z6debunkPfS_i
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 8 registers, 52 bytes cmem[0]
ptxas info    : Compiling entry function '_Z7debunk2PfS_i' for 'sm_20'
ptxas info    : Function properties for _Z7debunk2PfS_i
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 8 registers, 52 bytes cmem[0]

They compile to the exact same register count. And if you disassemble the toolchain output you will see that apart from the setup code, the final instruction streams are almost identical. There are a number of reasons for this, but it basically comes down to two simple rules:

  1. Trying to determine the register count from C code (or even PTX assembler) is mostly futile
  2. Trying to second guess a very sophisticated compiler and assembler is also mostly futile.
like image 143
talonmies Avatar answered Feb 06 '23 08:02

talonmies