Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Forcing CUDA to use register for a variable

Tags:

cuda

I have many unused registers in my kernel. I'd like to tell CUDA to use a few registers to hold some data, rather than doing a global data read every time I need it. (I'm not able to use shared mem.)

__global__ void simple(float *gData) {
float rData[1024];
for(int i=0; i<1024; i++) {
  rData[i]=gData[i];
  }
// work on the data here
}

compile w/: nvcc -arch sm_20 --ptxas-options=-v simple.cu, and I get
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
Used 2 registers, 40 bytes cmem[0]

__global__ void simple(float *gData) {
register float rData[1024];
for(int i=0; i<1024; i++) {
  rData[i]=gData[i];
  }
// work on the data here
}

register declaration does nothing.
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
Used 2 registers, 40 bytes cmem[0]

__global__ void simple(float *gData) {
volatile float rData[1024];
for(int i=0; i<1024; i++) {
  rData[i]=gData[i];
  }
// work on the data here
}

volatile declaration creates stack storage:
4096 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
Used 21 resisters, 40 bytes cmem[0]

1) Is there an easy way to tell the compiler to use register space for a variable?
2) Where is 'stack frame': register, global mem, local mem,...? What is a stack frame? (Since when does the GPU have a stack? A virtual stack?)
3) The simple.ptx file is basically empty: (nvcc -arch sm_20 -ptx simple.cu)

.loc 2 14 2
ret;

Any idea where I can find the real machine/compiled code?

like image 421
Doug Avatar asked Aug 28 '12 21:08

Doug


2 Answers

  • Dynamically indexed arrays cannot be stored in registers, because the GPU register file is not dynamically addressable.
  • Scalar variables are automatically stored in registers by the compiler.
  • Statically-indexed (i.e. where the index can be determined at compile time), small arrays (say, less than 16 floats) may be stored in registers by the compiler.

SM 2.0 GPUs (Fermi) only support up to 63 registers per thread. If this is exceeded, register values will be spilled/filled from local (off-chip) memory, supported by the cache hierarchy. SM 3.5 GPUs expand this to up to 255 registers per thread.

In general, as Jared mentions, using too many registers per thread is not desireable because it reduces occupancy, and therefore reduces latency hiding ability in the kernel. GPUs thrive on parallelism and do so by covering memory latency with work from other threads.

Therefore, you should probably not optimize arrays into registers. Instead, ensure that your memory accesses to those arrays across threads are as close to sequential as possible so you maximize coalescing (i.e. minimize memory transactions).

The example you give may be a case for shared memory if:

  1. Many threads in the block use the same data, or
  2. The per-thread array size is small enough to allocate enough space for all threads in multiple thread blocks (1024 floats per thread is far much).

As njuffa mentioned, the reason your kernel only uses 2 registers is because you don't do anything useful with the data in the kernel, and the dead code was all eliminated by the compiler.

like image 111
harrism Avatar answered Nov 02 '22 12:11

harrism


As noted already, registers (and the PTX "param space") cannot be indexed dynamically. In order to do that the compiler would have to emit code as for a switch...case block to turn the dynamic index into an immediate. I'm not sure it ever does automatically. You can help it happen using a fixed size tuple structure and a switch...case. C/C++ metaprogramming is likely to be the weapon of choice to keep code like this manageable.

Also, for CUDA 4.0 use the command line switch -Xopencc=-O3 in order to have anything but plain scalars (such as data structures) mapped to registers (see this post). For CUDA > 4.0 you have to disable debug support (no -G command line option - optimization happens only when debugging is disabled).

PTX level allows many more virtual registers than the hardware. Those are mapped to hardware registers at load time. The register limit you specify allows you to set an upper limit on the hardware resources used by the generated binary. It serves as a heuristic for the compiler to decide when to spill (see below) registers when compiling to PTX already so certain concurrency needs can be met (see "launch bounds", "occupancy" and "concurrent kernel execution" in the CUDA Documentation - you might also enjoy this most interesting presentation).

For Fermi GPUs there are at most 64 hardware registers. The 64th (or the last - when using less than the hardware's maximum) is used by the ABI as the stack pointer and thus for "register spilling" (it means freeing up registers by temporarily storing their values on the stack and happens when more registers are needed than available) so it is untouchable.

like image 25
Dude Avatar answered Nov 02 '22 13:11

Dude