Tracking down cuda kernel register usage

Question

I am trying to track down register usage and came across an interesting scenario. Consider the following source:

#define OL 20
#define NHS 10

__global__ void loop_test( float ** out, const float ** in,int3 gdims,int stride){

        const int idx = blockIdx.x*blockDim.x + threadIdx.x;
        const int idy = blockIdx.y*blockDim.y + threadIdx.y;
        const int idz = blockIdx.z*blockDim.z + threadIdx.z;

        const int index = stride*gdims.y*idz + idy*stride + idx;
        int i = 0,j =0;
        float sum =0.f;
        float tmp;
        float lf;
        float u2, tW;

        u2 = 1.0;
        tW = 2.0;

        float herm[NHS];

        for(j=0; j < OL; ++j){
                for(i = 0; i < NHS; ++i){
                        herm[i] += in[j][index];
                }
        }

        for(j=0; j<OL; ++j){
                for(i=0;i<NHS; ++i){
                        tmp = sum + herm[i]*in[j][index];
                        sum = tmp;
                }
                out[j][index] = sum;
                sum =0.f;
        }

}

As a side note on the source - the running sum I could do +=, but was playing with how changing that effects register usage (seems it doesn't - just adds an extra mov instruction). Additionally this source is oriented for accessing memory mapped to 3D space.

Counting out the registers it would seem there are 22 registers ( I believe a float[N] takes up N+1 registers - please correct me if I'm wronge) based on the declarations.

However compiling with:

nvcc -cubin -arch=sm_20 -Xptxas="-v" src/looptest.cu

yields:

0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 25 registers, 72 bytes cmem[0]

Ok so the number is different that what is 'expected'. Additionally if compiled with :

nvcc -cubin -arch=sm_13 -Xptxas="-v" src/looptest.cu

The register usage is far less - 8 to be exact ( apparently due to stronger adherence in sm_20 than sm_13 to IEEE floating point math standards?):

ptxas info    : Compiling entry function '_Z9loop_testPPfPPKfS2_4int3i' for 'sm_13'
ptxas info    : Used 17 registers, 40+16 bytes smem, 8 bytes cmem[1]

As a final note, change the macro OL to 40, and suddenly:

0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 28 registers, 72 bytes cmem[0]

In conclusion I would like to know where registers are being eaten up, and what results in the couple observations I have made.

I don't have enough experience with assembly to get through a cuobjdump - the answer certainly lies buried in there - maybe someone can enlighten me about what I should be looking for or show me a guide as to how to approach the assembly dump.

harrism · Accepted Answer

sm_20 and sm_13 are very different architectures, with very different instruction set (ISA) design. The main difference that causes the increase in register usage that you see is that sm_1x has special-purpose address registers, while sm_2x and later do not. Instead, addresses are stored in general-purpose registers just like values are, which means most programs require more registers on sm_2x than on sm_1x.

sm_20 also has twice the register file size of sm_13, to compensate for this affect.

Tracking down cuda kernel register usage

Tags:

c

optimization

cuda

Marm0t

1 Answers

harrism

Recent Activity

Donate For Us

Tracking down cuda kernel register usage

Tags:

c

optimization

cuda

Marm0t

1 Answers

harrism

Related questions

Recent Activity

Donate For Us