A follow up Q from: CUDA: Calling a __device__ function from a kernel I'm trying to speed up a sort operation. A simplified pseudo version follows: <pre class="prettyprint"><code>// some costly swap operation __device__ swap(float* ptrA, float* ptrB){ float saveData; // swap some saveData= *Adata; // big complex *Adata= *Bdata // data chunk *Bdata= saveData; } // a rather simple sort operation __global__ sort(float data[]){ for (i=0; i<limit: i++){ find left swap point find right swap point swap<<<1,1>>>(left, right); } } </code></pre> (Note: This simple version doesn't show the reduction techniques in the blocks.) The idea is that it is easy (fast) to identify the swap points. The swap operation is costly (slow). So use one block to find/identify the swap points. Use other blocks to do the swap operations. i.e. Do the actual swapping in parallel. This sounds like a decent plan. But if the compiler in-lines the device calls, then there is no parallel swapping taking place. Is there a way to tell the compiler to NOT in-line a device call?

Edit (2016): Dynamic parallelism was introduced in the second generation of Kepler architecture GPUs. Launching kernels in the device is supported on compute capability 3.5 and higher devices. <hr> Original Answer: You will have to wait until the end of the year when the next generation of hardware is available. No current CUDA devices can launch kernels from other kernels - it is presently unsupported.

Calling a kernel from a kernel

Tags:

cuda

A follow up Q from: CUDA: Calling a __device__ function from a kernel

I'm trying to speed up a sort operation. A simplified pseudo version follows:

// some costly swap operation
__device__ swap(float* ptrA, float* ptrB){
  float saveData;         // swap some 
  saveData= *Adata;       //   big complex
  *Adata= *Bdata          //     data chunk
  *Bdata= saveData;
}

// a rather simple sort operation
__global__ sort(float data[]){
  for (i=0; i<limit: i++){
  find left swap point
  find right swap point
  swap<<<1,1>>>(left, right);
  }
}

(Note: This simple version doesn't show the reduction techniques in the blocks.) The idea is that it is easy (fast) to identify the swap points. The swap operation is costly (slow). So use one block to find/identify the swap points. Use other blocks to do the swap operations. i.e. Do the actual swapping in parallel. This sounds like a decent plan. But if the compiler in-lines the device calls, then there is no parallel swapping taking place. Is there a way to tell the compiler to NOT in-line a device call?

534

asked Jul 31 '12 19:07

Doug

2 Answers

It has been a long time that this question was asked. When I googled the same problem, I got to this page. Seems like I got the solution.

Solution:

I reached [here][1] somehow and saw the cool approach to launch kernel from within another kernel.

__global__ void kernel_child(float *var1, int N){
    //do data operations here
}


__global__ void kernel_parent(float *var1, int N)
{
    kernel_child<<<1,2>>>(var1,N);
}

The dynamic parallelism on cuda 5.0 and over made this possible. Also while running make sure you use compute_35 architecture or above.

Terminal way You can run the above parent kernel (which will eventually run child kernel) from termial. Verified on a Linux machine.

$ nvcc -arch=sm_35 -rdc=true yourFile.cu
$ ./a.out

Hope it helps. Thank you! [1]: http://developer.download.nvidia.com/assets/cuda/docs/TechBrief_Dynamic_Parallelism_in_CUDA_v2.pdf

answered Oct 12 '22 23:10

Nabin

Edit (2016):

Dynamic parallelism was introduced in the second generation of Kepler architecture GPUs. Launching kernels in the device is supported on compute capability 3.5 and higher devices.

Original Answer:

You will have to wait until the end of the year when the next generation of hardware is available. No current CUDA devices can launch kernels from other kernels - it is presently unsupported.

answered Oct 13 '22 01:10

talonmies

Related questions
                            
                                Copying data to "cufftComplex" data struct?
                            
                                How to normalize matrix columns in CUDA with max performance?
                            
                                What are "Other" Issue Stall Reasons displayed by the Nsight profiler?
                            
                                Is there a CUDA smart pointer?
                            
                                "register" keyword in CUDA
                            
                                CUDA: how to sum all elements of an array into one number within the GPU?
                            
                                Install CUDA 8 and CUDA 9 in windows
                            
                                <<< >>> cuda in vscode
                            
                                CUDA Matrix multiplication breaks for large matrices
                            
                                CUDA programming - L1 and L2 caches
                            
                                GPU reads from CPU or CPU writes to the GPU?
                            
                                What is the size of my CUDA texture memory?
                            
                                detecting NVIDIA GPUs without CUDA
                            
                                CUDA: Is the texture cache on FERMI separate from L1 cache?
                            
                                CUDA How to access constant memory in device kernel when the constant memory is declared in the host code?
                            
                                Is Cuda 6 supported with Visual Studio 2013?
                            
                                CUDA missing libGL.so libGLU.so and libX11.so
                            
                                What are templated methods in C++ / CUDA with 3 angle brackets (<<<)?
                            
                                Mathematica and CUDA
                            
                                Proper way to write kernel functions in CUDA?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With