Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Counting registers/thread in Cuda kernel

The nSight profiler tells me that the following kernel uses 52 registers per thread:

 //Just the first lines of the kernel.
__global__ void voles_kernel(float *params, int *ctrl_params, 
                             float dt, float currTime,
                             float *dev_voles, float *dev_weasels, 
                             curandStateMtgp32 *state) 
{

 __shared__ float dev_params[9];

 __shared__ int BuYeSimStep[4];

 if(threadIdx.x < 4)
 {
   BuYeSimStep[threadIdx.x] = ctrl_params[threadIdx.x];
 }

 if(threadIdx.x < 9){
     dev_params[threadIdx.x] = params[threadIdx.x];
 }

 __syncthreads();

float currVole = curand_uniform(&state[blockIdx.x]) + 3.0;
float currWeas = curand_uniform(&state[blockIdx.x]) + 0.1;
float oldVole = currVole;
float oldWeas = currWeas;

int jj;

if (blockIdx.x * blockDim.x + threadIdx.x < BuYeSimStep[2])
{
int dayIndex = 0;

/* Not declaring any new variable from here on, just doing arithmetics.
   ....... */

If each register has 4 bytes I don't understand how we get to 52 registers, even assuming that the arrays params[9] and ctrl_params[4] end up in registers (in which case using shared memory as I did doesn't make sense). I would like to increase occupancy, but I don't get why I'm using so many registers. Any ideas?

like image 405
Matteo Fasiolo Avatar asked Dec 21 '22 02:12

Matteo Fasiolo


2 Answers

It's generally difficult to look at C code and predict the register usage from it. The compiler may aggressively optimize code by increasing register usage, perhaps to save an instruction here or there. You seem to be making an assumption that register usage can be predicted from your C code variable allocations, and while there is some connection between the two, you cannot assume register usage can be computed directly from C code variable allocations.

Since you haven't provided your code, nobody can actually help with the register usage. If you want to better understand the register usage, you will need to look at the PTX code directly. To do this, compile your code using nvcc with the -ptx switch, and inspect the resultant .ptx file directly. To do this you may wish to refer to the PTX documentation as well as the nvcc documentation to look at the various compiler options.

You haven't provided your code, so it's not really possible to make any direct suggestions, but you may be able to reduce register usage by reducing constant usage, reducing or refactoring arithmetic usage, switching from double to float, and I'm sure there are many other suggestions as well. Register usage will also be affected if you are passing the -G switch to the compiler.

You can limit the compiler's usage of registers per thread by passing the -maxrregcount switch to nvcc with an appropriate parameter, such as -maxrregcount 20 which will instruct the compiler to limit itself to 20 registers per thread. This tactic may not give good results, however, or you may need to tune the parameter to a value which doesn't sacrifice too much performance. However you may find an optimum choice which doesn't sacrifice too much basic performance but allows you to improve occupancy. If you constrain the compiler too much, it will begin to spill it's needed register usage to local memory, which will generally reduce performance.

You should also be aware that you can pass -Xptxas -v to nvcc which will give useful output about the compiler's register usage and other related data (spilling, etc.) at compile time.

like image 116
Robert Crovella Avatar answered Dec 22 '22 14:12

Robert Crovella


If you want to increase the occupancy, a direct way is using compiler flag: maxregcount to restrict the usage of registers, but it may suffer a performance loss because some registers will be spilled to local memory, which is very slow.

like image 36
Xiaolong Xie Avatar answered Dec 22 '22 14:12

Xiaolong Xie