I have many unused registers in my kernel. I'd like to tell CUDA to use a few registers to hold some data, rather than doing a global data read every time I need it. (I'm not able to use shared mem.)
__global__ void simple(float *gData) {
float rData[1024];
for(int i=0; i<1024; i++) {
rData[i]=gData[i];
}
// work on the data here
}
compile w/: nvcc -arch sm_20 --ptxas-options=-v simple.cu, and I get
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
Used 2 registers, 40 bytes cmem[0]
__global__ void simple(float *gData) {
register float rData[1024];
for(int i=0; i<1024; i++) {
rData[i]=gData[i];
}
// work on the data here
}
register declaration does nothing.
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
Used 2 registers, 40 bytes cmem[0]
__global__ void simple(float *gData) {
volatile float rData[1024];
for(int i=0; i<1024; i++) {
rData[i]=gData[i];
}
// work on the data here
}
volatile declaration creates stack storage:
4096 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
Used 21 resisters, 40 bytes cmem[0]
1) Is there an easy way to tell the compiler to use register space for a variable?
2) Where is 'stack frame': register, global mem, local mem,...? What is a stack frame? (Since when does the GPU have a stack? A virtual stack?)
3) The simple.ptx file is basically empty: (nvcc -arch sm_20 -ptx simple.cu)
.loc 2 14 2
ret;
Any idea where I can find the real machine/compiled code?
SM 2.0 GPUs (Fermi) only support up to 63 registers per thread. If this is exceeded, register values will be spilled/filled from local (off-chip) memory, supported by the cache hierarchy. SM 3.5 GPUs expand this to up to 255 registers per thread.
In general, as Jared mentions, using too many registers per thread is not desireable because it reduces occupancy, and therefore reduces latency hiding ability in the kernel. GPUs thrive on parallelism and do so by covering memory latency with work from other threads.
Therefore, you should probably not optimize arrays into registers. Instead, ensure that your memory accesses to those arrays across threads are as close to sequential as possible so you maximize coalescing (i.e. minimize memory transactions).
The example you give may be a case for shared memory if:
As njuffa mentioned, the reason your kernel only uses 2 registers is because you don't do anything useful with the data in the kernel, and the dead code was all eliminated by the compiler.
As noted already, registers (and the PTX "param space") cannot be indexed dynamically. In order to do that the compiler would have to emit code as for a switch...case
block to turn the dynamic index into an immediate. I'm not sure it ever does automatically. You can help it happen using a fixed size tuple structure and a switch...case
. C/C++ metaprogramming is likely to be the weapon of choice to keep code like this manageable.
Also, for CUDA 4.0 use the command line switch -Xopencc=-O3
in order to have anything but plain scalars (such as data structures) mapped to registers (see this post). For CUDA > 4.0 you have to disable debug support (no -G
command line option - optimization happens only when debugging is disabled).
PTX level allows many more virtual registers than the hardware. Those are mapped to hardware registers at load time. The register limit you specify allows you to set an upper limit on the hardware resources used by the generated binary. It serves as a heuristic for the compiler to decide when to spill (see below) registers when compiling to PTX already so certain concurrency needs can be met (see "launch bounds", "occupancy" and "concurrent kernel execution" in the CUDA Documentation - you might also enjoy this most interesting presentation).
For Fermi GPUs there are at most 64 hardware registers. The 64th (or the last - when using less than the hardware's maximum) is used by the ABI as the stack pointer and thus for "register spilling" (it means freeing up registers by temporarily storing their values on the stack and happens when more registers are needed than available) so it is untouchable.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With