Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Calling a kernel from a kernel

Tags:

cuda

A follow up Q from: CUDA: Calling a __device__ function from a kernel

I'm trying to speed up a sort operation. A simplified pseudo version follows:

// some costly swap operation
__device__ swap(float* ptrA, float* ptrB){
  float saveData;         // swap some 
  saveData= *Adata;       //   big complex
  *Adata= *Bdata          //     data chunk
  *Bdata= saveData;
}

// a rather simple sort operation
__global__ sort(float data[]){
  for (i=0; i<limit: i++){
  find left swap point
  find right swap point
  swap<<<1,1>>>(left, right);
  }
}

(Note: This simple version doesn't show the reduction techniques in the blocks.) The idea is that it is easy (fast) to identify the swap points. The swap operation is costly (slow). So use one block to find/identify the swap points. Use other blocks to do the swap operations. i.e. Do the actual swapping in parallel. This sounds like a decent plan. But if the compiler in-lines the device calls, then there is no parallel swapping taking place. Is there a way to tell the compiler to NOT in-line a device call?

like image 534
Doug Avatar asked Jul 31 '12 19:07

Doug


People also ask

Can a CUDA kernel call another kernel?

Dynamic Parallelism in CUDA 5.0 enables a CUDA kernel to create and synchronize new nested work, using the CUDA runtime API to launch other kernels, optionally synchronize on kernel completion, perform device memory management, and create and use streams and events, all without CPU involvement.

What is kernel invocation?

Invoking a GPU kernel is very similar to calling a function. CUDA offers the Chevron Syntax to configure and execute a kernel. The following is an example of a kernel invocation. The "<<<, >>>" brackets contain configuration parameters that specify the degree of parallelism at runtime.

Are kernel calls asynchronous?

Kernel calls are asynchronous from the point of view of the CPU so if you call 2 kernels in succession the second one will be called without waiting for the first one to finish.

How does CUDA achieve parallelism?

In CUDA Dynamic Parallelism, a parent grid launches kernels called child grids. A child grid inherits from the parent grid certain attributes and limits, such as the L1 cache / shared memory configuration and stack size. Note that every thread that encounters a kernel launch executes it.


2 Answers

It has been a long time that this question was asked. When I googled the same problem, I got to this page. Seems like I got the solution.

Solution:

I reached [here][1] somehow and saw the cool approach to launch kernel from within another kernel.

__global__ void kernel_child(float *var1, int N){
    //do data operations here
}


__global__ void kernel_parent(float *var1, int N)
{
    kernel_child<<<1,2>>>(var1,N);
} 

The dynamic parallelism on cuda 5.0 and over made this possible. Also while running make sure you use compute_35 architecture or above.

Terminal way You can run the above parent kernel (which will eventually run child kernel) from termial. Verified on a Linux machine.

$ nvcc -arch=sm_35 -rdc=true yourFile.cu
$ ./a.out

Hope it helps. Thank you! [1]: http://developer.download.nvidia.com/assets/cuda/docs/TechBrief_Dynamic_Parallelism_in_CUDA_v2.pdf

like image 73
Nabin Avatar answered Oct 12 '22 23:10

Nabin


Edit (2016):

Dynamic parallelism was introduced in the second generation of Kepler architecture GPUs. Launching kernels in the device is supported on compute capability 3.5 and higher devices.


Original Answer:

You will have to wait until the end of the year when the next generation of hardware is available. No current CUDA devices can launch kernels from other kernels - it is presently unsupported.

like image 33
talonmies Avatar answered Oct 13 '22 01:10

talonmies