The CUDA runtime API allows us to launch kernels using the variable-number-of-arguments triple-chevron syntax:
my_kernel<<<grid_dims, block_dims, shared_mem_size>>>(
first_arg, second_arg, and_as_many, as_we, want_to, etc, etc);
but as regards "collaborative" kernels, the CUDA Programming Guide says (section C.3):
To enable grid synchronization, when launching the kernel it is necessary to use, instead of the
<<<...>>>
execution configuration syntax, thecuLaunchCooperativeKernel
CUDA runtime launch API:cudaLaunchCooperativeKernel( const T *func, dim3 gridDim, dim3 blockDim, void **args, size_t sharedMem = 0, cudaStream_t stream = 0 )
(or the CUDA driver equivalent).
I would rather not have to write my own wrapper code for building an array of pointers... is there really no facility in the runtime API to avoid that?
FWIW you can pass arbitrary structs (not immediately obvious from API docs) by just passing it via void* args. It's not obvious that the sizeof gets computed by the compiler in this case from the function signature and the right size is copied to the kernel. The API docs don't seem to elaborate on that.
struct Param { int a, b; void* device_ptr; };
Param param{aa, bb, d_ptr};
void *kArgs = {¶m};
cudaLaunchCooperativeKernel(..., kArgs, ...);
We can use something like the following workaround (requires --std=c++11
or a more recent C++ language standard):
namespace detail {
template <typename F, typename... Args>
void for_each_argument_address(F f, Args&&... args) {
[](...){}((void)(f( (void*) &std::forward<Args>(args) ), 0)...);
}
} // namespace detail
template<typename KernelFunction, typename... KernelParameters>
inline void cooperative_launch(
const KernelFunction& kernel_function,
stream::id_t stream_id,
launch_configuration_t launch_configuration,
KernelParameters... parameters)
{
void* arguments_ptrs[sizeof...(KernelParameters)];
auto arg_index = 0;
detail::for_each_argument_address(
[&](void * x) {arguments_ptrs[arg_index++] = x;},
parameters...);
cudaLaunchCooperativeKernel<KernelFunction>(
&kernel_function,
launch_configuration.grid_dimensions,
launch_configuration.block_dimensions,
arguments_ptrs,
launch_configuration.dynamic_shared_memory_size,
stream_id);
}
Note: This uses Sean Parent's classic for_each_arg()
one-liner. See also this post about it at FluentCPP.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With