I am trying to track down register usage and came across an interesting scenario. Consider the following source:
#define OL 20
#define NHS 10
__global__ void loop_test( float ** out, const float ** in,int3 gdims,int stride){
const int idx = blockIdx.x*blockDim.x + threadIdx.x;
const int idy = blockIdx.y*blockDim.y + threadIdx.y;
const int idz = blockIdx.z*blockDim.z + threadIdx.z;
const int index = stride*gdims.y*idz + idy*stride + idx;
int i = 0,j =0;
float sum =0.f;
float tmp;
float lf;
float u2, tW;
u2 = 1.0;
tW = 2.0;
float herm[NHS];
for(j=0; j < OL; ++j){
for(i = 0; i < NHS; ++i){
herm[i] += in[j][index];
}
}
for(j=0; j<OL; ++j){
for(i=0;i<NHS; ++i){
tmp = sum + herm[i]*in[j][index];
sum = tmp;
}
out[j][index] = sum;
sum =0.f;
}
}
As a side note on the source - the running sum I could do +=, but was playing with how changing that effects register usage (seems it doesn't - just adds an extra mov instruction). Additionally this source is oriented for accessing memory mapped to 3D space.
Counting out the registers it would seem there are 22 registers ( I believe a float[N] takes up N+1 registers - please correct me if I'm wronge) based on the declarations.
However compiling with:
nvcc -cubin -arch=sm_20 -Xptxas="-v" src/looptest.cu
yields:
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 25 registers, 72 bytes cmem[0]
Ok so the number is different that what is 'expected'. Additionally if compiled with :
nvcc -cubin -arch=sm_13 -Xptxas="-v" src/looptest.cu
The register usage is far less - 8 to be exact ( apparently due to stronger adherence in sm_20 than sm_13 to IEEE floating point math standards?):
ptxas info : Compiling entry function '_Z9loop_testPPfPPKfS2_4int3i' for 'sm_13'
ptxas info : Used 17 registers, 40+16 bytes smem, 8 bytes cmem[1]
As a final note, change the macro OL to 40, and suddenly:
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 28 registers, 72 bytes cmem[0]
In conclusion I would like to know where registers are being eaten up, and what results in the couple observations I have made.
I don't have enough experience with assembly to get through a cuobjdump - the answer certainly lies buried in there - maybe someone can enlighten me about what I should be looking for or show me a guide as to how to approach the assembly dump.
sm_20 and sm_13 are very different architectures, with very different instruction set (ISA) design. The main difference that causes the increase in register usage that you see is that sm_1x has special-purpose address registers, while sm_2x and later do not. Instead, addresses are stored in general-purpose registers just like values are, which means most programs require more registers on sm_2x than on sm_1x.
sm_20 also has twice the register file size of sm_13, to compensate for this affect.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With