I have a templated wrapper function that calls a kernel (__global__
) defined in a .cu file like this
template<typename T, class M>
__global__ void compute_kernel(T* input, T* output, n) {
M m;
// compute stuff using m
};
template<typename T, class M>
void compute(T* input, T* output, int n) {
// ... compute blocks, threads, etc.
compute_kernel<T,M> <<<dim_grid, dim_block>>>(input, output, n);
// ...
};
and a header file to be included in host code that has only the declaration
template<typename T, class M>
void compute(T* input, T* output, int n);
However, calling compute()
from the host with arbitrary template parameters, the compilation fails with undefined reference to 'void reduce(...)'
and only if I add specialization declarations to the end of the .cu
file does the code compile:
template void
compute<int, Method1<int> >(int* input, int* output, int n);
template void
compute<float, Method1<float> >(float* input, float* output, int n);
template void
compute<int, Method2<int> >(int* input, int* output, int n);
template void
compute<float, Method2<float> >(float* input, float* output, int n);
So, is it necessary to specialize every templated function in order to make it callable from the host? (That's quite a drawback)
Thanks for your comments!
This is a C++ FAQ, not limited to CUDA.
If you have a template implementation in a .cpp or .cu file then when you compile that translation unit the compiler cannot possibly know what permutations of template parameters you will need. Therefore when you link you will get the errors.
You could put the implementation in a header file (in which case you'll need to instantiate in a .cu file since it contains CUDA) or you will have to explicitly instantiate all required permutations. If you have to do many of these then you could use a macro to instantiate all your permutations.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With