Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What kind of variables consume registers in CUDA?

Tags:

cuda

__global__ void add( int *c, const int* a, const int* b )
{
    int x = blockIdx.x;
    int y = blockIdx.y;
    int offset = x + y * gridDim.x;
    c[offset] = a[offset] + b[offset];
}

In the above example, I guess x, y, offset are saved in registers while

  • nvcc -Xptxas -v gives 4 registers, 24+16 bytes smem

  • profiler shows 4 registers

  • and the head of ptx file:

    .reg .u16 %rh<4>;
    .reg .u32 %r<9>;    
    .reg .u64 %rd<10>;  
    .loc    15  21  0   
    
    $LDWbegin__Z3addPiPKiS1_:   
    .loc    15  26  0  
    

Can anyone clarify the usage of registers? In Fermi, the maximum number of registers is 63 for each thread. In my program I want to test the case when a kernel consumes too many registers (so variables may have to be stored in local memory automatically and thus leads to performance decrease). Then at this point I can split one kernel into two so that each thread has enough registers. Assume that the SM resources are sufficient for concurrent kernels.

I'm not sure if I am right.

like image 752
user1525320 Avatar asked Jul 14 '12 11:07

user1525320


People also ask

What are registers in CUDA?

In general all scalar variables defined in CUDA code are stored in registers. Registers are local to a thread, and each thread has exclusive access to its own registers: values in registers cannot be accessed by other threads, even from the same block, and are not available for the host.

What is constant memory in CUDA?

The constant memory in CUDA is a dedicated memory space of 65536 bytes. It is dedicated because it has some special features like cache and broadcasting. The constant memory space resides in device memory and is cached in the constant cache mentioned in Compute Capability 1.

What is function of __ global __ qualifier in CUDA program?

__global__ : 1. A qualifier added to standard C. This alerts the compiler that a function should be compiled to run on a device (GPU) instead of host (CPU).

What are the three general section of CUDA program?

To execute any CUDA program, there are three main steps: Copy the input data from host memory to device memory, also known as host-to-device transfer. Load the GPU program and execute, caching data on-chip for performance. Copy the results from device memory to host memory, also called device-to-host transfer.


1 Answers

The register allocation in PTX is completely irrelevant to the final register consumption of the kernel. PTX is only an intermediate representation of the final machine code and uses static single assignment form, meaning that each register in PTX is only used once. A piece of PTX with hundreds of registers can compile into a kernel with only a few registers.

Register assignment is done by ptxas as a completely standalone compilation pass (either statically or just-in-time by the driver, or both) and it can perform a lot of code reordering and optimisations on the input PTX to improve throughput and conserve registers, meaning that there is little or no relationship between the variables in the original C or registers in PTX and the final register count of the assembled kernel.

nvcc does provide some ways to influence the register allocation behaviour of the assembler. You have __launch_bounds__ to provide heuristic hints to the compiler which can influence register allocation, and the compiler/assembler takes the -maxrregcount argument (at the potential expense of register spilling to local memory, which can lower performance). The volatile keyword used to make a difference to older versions of the nvopen64 based compiler and could influence the local memory spill behaviour. But you can't arbitrarily control or steer register allocation in the original C code or PTX assembly language code.

like image 94
talonmies Avatar answered Oct 14 '22 15:10

talonmies