Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

CUDA: Calling a __device__ function from a kernel

Tags:

I have a kernel that calls a device function inside an if statement. The code is as follows:

__device__ void SetValues(int *ptr,int id)
{
    if(ptr[threadIdx.x]==id) //question related to here
          ptr[threadIdx.x]++;
}

__global__ void Kernel(int *ptr)
{
    if(threadIdx.x<2)
         SetValues(ptr,threadIdx.x);
}

In the kernel threads 0-1 call SetValues concurrently. What happens after that? I mean there are now 2 concurrent calls to SetValues. Does every function call execute serially? So they behave like 2 kernel function calls?

like image 930
scatman Avatar asked Apr 19 '11 06:04

scatman


People also ask

What is device function in CUDA?

CUDA device functions can only be invoked from within the device (by a kernel or another device function). To define a device function: from numba import cuda @cuda. jit(device=True) def a_device_function(a, b): return a + b. Unlike a kernel function, a device function can return a value like normal functions.

Can a CUDA kernel call another kernel?

Basically, a child CUDA kernel can be called from within a parent CUDA kernel and then optionally synchronize on the completion of that child CUDA Kernel. The parent CUDA kernel can consume the output produced from the child CUDA kernel, all without CPU involvement [136].

What is kernel function in CUDA?

Figure 1 shows that the CUDA kernel is a function that gets executed on GPU. The parallel portion of your applications is executed K times in parallel by K different CUDA threads, as opposed to only one time like regular C/C++ functions. Figure 1. The kernel is a function executed on the GPU.

What is __ global __ In CUDA?

__global__ is a CUDA C keyword (declaration specifier) which says that the function, Executes on device (GPU) Calls from host (CPU) code.


1 Answers

CUDA actually inlines all functions by default (although Fermi and newer architectures do also support a proper ABI with function pointers and real function calls). So your example code gets compiled to something like this

__global__ void Kernel(int *ptr)
{
    if(threadIdx.x<2)
        if(ptr[threadIdx.x]==threadIdx.x)
            ptr[threadIdx.x]++;
}

Execution happens in parallel, just like normal code. If you engineer a memory race into a function, there is no serialization mechanism that can save you.

like image 110
talonmies Avatar answered Oct 06 '22 17:10

talonmies