What kind of variables consume registers in CUDA?

__global__ void add( int *c, const int* a, const int* b )
{
    int x = blockIdx.x;
    int y = blockIdx.y;
    int offset = x + y * gridDim.x;
    c[offset] = a[offset] + b[offset];
}

In the above example, I guess x, y, offset are saved in registers while

nvcc -Xptxas -v gives 4 registers, 24+16 bytes smem
profiler shows 4 registers

and the head of ptx file:

.reg .u16 %rh<4>;
.reg .u32 %r<9>;    
.reg .u64 %rd<10>;  
.loc    15  21  0   

$LDWbegin__Z3addPiPKiS1_:   
.loc    15  26  0

Can anyone clarify the usage of registers? In Fermi, the maximum number of registers is 63 for each thread. In my program I want to test the case when a kernel consumes too many registers (so variables may have to be stored in local memory automatically and thus leads to performance decrease). Then at this point I can split one kernel into two so that each thread has enough registers. Assume that the SM resources are sufficient for concurrent kernels.

I'm not sure if I am right.

What are registers in CUDA?

In general all scalar variables defined in CUDA code are stored in registers. Registers are local to a thread, and each thread has exclusive access to its own registers: values in registers cannot be accessed by other threads, even from the same block, and are not available for the host.

What is constant memory in CUDA?

The constant memory in CUDA is a dedicated memory space of 65536 bytes. It is dedicated because it has some special features like cache and broadcasting. The constant memory space resides in device memory and is cached in the constant cache mentioned in Compute Capability 1.

What is function of __ global __ qualifier in CUDA program?

__global__ : 1. A qualifier added to standard C. This alerts the compiler that a function should be compiled to run on a device (GPU) instead of host (CPU).

What are the three general section of CUDA program?

To execute any CUDA program, there are three main steps: Copy the input data from host memory to device memory, also known as host-to-device transfer. Load the GPU program and execute, caching data on-chip for performance. Copy the results from device memory to host memory, also called device-to-host transfer.

The register allocation in PTX is completely irrelevant to the final register consumption of the kernel. PTX is only an intermediate representation of the final machine code and uses static single assignment form, meaning that each register in PTX is only used once. A piece of PTX with hundreds of registers can compile into a kernel with only a few registers.

Register assignment is done by ptxas as a completely standalone compilation pass (either statically or just-in-time by the driver, or both) and it can perform a lot of code reordering and optimisations on the input PTX to improve throughput and conserve registers, meaning that there is little or no relationship between the variables in the original C or registers in PTX and the final register count of the assembled kernel.

nvcc does provide some ways to influence the register allocation behaviour of the assembler. You have __launch_bounds__ to provide heuristic hints to the compiler which can influence register allocation, and the compiler/assembler takes the -maxrregcount argument (at the potential expense of register spilling to local memory, which can lower performance). The volatile keyword used to make a difference to older versions of the nvopen64 based compiler and could influence the local memory spill behaviour. But you can't arbitrarily control or steer register allocation in the original C code or PTX assembly language code.

What kind of variables consume registers in CUDA?

Tags:

cuda

user1525320

People also ask

1 Answers

talonmies

Recent Activity

Donate For Us

What kind of variables consume registers in CUDA?

Tags:

cuda

user1525320

People also ask

1 Answers

talonmies

Related questions

Recent Activity

Donate For Us