To avoid really long and incohesive functions I am calling
a number of __device__ functions from a kernel. I allocate a shared
buffer at the beginning of the kernel call (which is per-thread-block)
and pass pointers to it to all the __device__ functions that are
performing some processing steps in the kernel.
I was wondering about the following:
If I allocate a shared memory buffer in a __global__ function,
how can other __device__ functions that I pass a pointer distinguish between the possible address types (global device or shared memory) that the pointer could refer to?
Note that it is invalid to decorate the formal parameters with a __shared__ modifier
according to the CUDA programming guide. The only way it could be implemented IMHO is by
a) putting markers on the allocated memory.
b) passing invisible parameters with the call.
c) having a virtual unified address space that has separate segments for global and shared memory and a threshold check on the pointer can be used?
So my question is: Do I need to worry about it or how should one proceed alternatively without inlining all functions into the main kernel?
On the side I was today horrified that NVCC with CUDA Toolkit 3.0 disallows so-called
'external calls from global functions', requiring them to be inlined. This means in effect
I have to declare all ___device___ functions inline and the separation of header / source
files is broken. This is of course quite ugly, but is there an alternative?
If I allocate a shared memory buffer in a global function how can other device functions that I pass a pointer distinguish between the possible address types (global device or shared mem) that the pointer could refer to.
Note that "shared" memory, in the context of CUDA, specifically means the on-chip memory that is shared between all threads in a block. So, if you mean an array declared with the __shared__ qualifier, it normally doesn't make sense to use it for passing information between device functions (as all the threads see the very same memory). I think the compiler might put regular arrays in shared memory? Or maybe it was in the register file. Anyway, there's a good chance that it ends up in global memory, which would be an inefficient way of passing information between the device functions (especially on < 2.0 devices).
On the side I was today horrified that NVCC with CUDA Toolkit 3.0 disallows so-called 'external calls from global functions', requiring them to be inlined. This means in effect I have to declare all device functions inline and the separation of header / source files is broken. This is of course quite ugly, but is there an alternative?
CUDA does not include a linker for device code so you must keep the kernel(s) and all related device functions in the same .cu file.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With