I want to call different instantiations of a templated CUDA kernel with dynamically allocated shared memory in one program. My first naive approach was to write:
template<typename T>
__global__ void kernel(T* ptr)
{
extern __shared__ T smem[];
// calculations here ...
}
template<typename T>
void call_kernel( T* ptr, const int n )
{
dim3 dimBlock(n), dimGrid;
kernel<<<dimGrid, dimBlock, n*sizeof(T)>>>(ptr);
}
int main(int argc, char *argv[])
{
const int n = 32;
float *float_ptr;
double *double_ptr;
cudaMalloc( (void**)&float_ptr, n*sizeof(float) );
cudaMalloc( (void**)&double_ptr, n*sizeof(double) );
call_kernel( float_ptr, n );
call_kernel( double_ptr, n ); // problem, 2nd instantiation
cudaFree( (void*)float_ptr );
cudaFree( (void*)double_ptr );
return 0;
}
However, this code cannot be compiled. nvcc gives me the following error message:
main.cu(4): error: declaration is incompatible with previous "smem"
(4): here
detected during:
instantiation of "void kernel(T *) [with T=double]"
(12): here
instantiation of "void call_kernel(T *, int) [with T=double]"
(24): here
I understand that I am running into a name conflict because the shared memory is declared as extern. Nevertheless there is no way around that if I want to define its size during runtime, as far as I know.
So, my question is: Is there any elegant way to obtain the desired behavior? With elegant I mean without code duplication etc.
Dynamically allocated shared memory is really just a size (in bytes) and a pointer being set up for the kernel. So something like this should work:
replace this:
extern __shared__ T smem[];
with this:
extern __shared__ __align__(sizeof(T)) unsigned char my_smem[];
T *smem = reinterpret_cast<T *>(my_smem);
You can see other examples of re-casting of dynamically allocated shared memory pointers in the programming guide which can serve other needs.
EDIT: updated my answer to reflect the comment by @njuffa.
(A variation on @RobertCrovella's answer)
NVCC is not willing to accept two extern __shared__
arrays of the same name but different types - even if they're never in each other's scope. We'll need to satisfy NVCC by having our template instances all use the same type for the shared memory under the hood, while letting the kernel code using them see the type it likes.
So we replace this instruction:
extern __shared__ T smem[];
with this one:
auto smem = shared_memory_proxy<T>();
where:
template <typename T>
__device__ T* shared_memory_proxy()
{
// do we need an __align__() here? I don't think so...
extern __shared__ unsigned char memory[];
return reinterpret_cast<T*>(memory);
}
is in some device-side code include file.
Advantages:
extern
, or alignment specifiers, or a reinterpret cast etc.edit: This is implemented as part of my CUDA kernel author's tools header-only library: shared_memory.cuh
(where it's named shared_memory::dynamic::proxy()
).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With