I just noticed that my CUDA kernel uses exactly twice the space than that calculated by 'theory'. e.g.
__global__ void foo( )
{
__shared__ double t;
t = 1;
}
PTX info shows:
ptxas info : Function properties for _Z3foov, 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 4 registers, 16 bytes smem, 32 bytes cmem[0]
But the size of a double is only 8.
More example:
__global__ void foo( )
{
__shared__ int t[1024];
t[0] = 1;
}
ptxas info : Used 3 registers, 8192 bytes smem, 32 bytes cmem[0]
Could someone explain why?
Seems that the problem has gone in the current CUDA compiler.
__shared__ int a[1024];
compiled with command 'nvcc -m64 -Xptxas -v -ccbin /opt/gcc-4.6.3/bin/g++-4.6.3 shmem.cu' gives
ptxas info : Used 1 registers, 4112 bytes smem, 4 bytes cmem[1]
There are some shared memory overhead in this case, but the usage is not doubled.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With