Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How is a CUDA kernel launched?

I have created a simple CUDA application to add two matrices. It is compiling fine. I want to know how the kernel will be launched by all the threads and what will the flow be inside CUDA? I mean, in what fashion every thread will execute each element of the matrices.

I know this is a very basic concept, but I don't know this. I am confused regarding the flow.

like image 945
ATG Avatar asked Aug 29 '12 06:08

ATG


People also ask

How do CUDA kernels work?

In CUDA, the host refers to the CPU and its memory, while the device refers to the GPU and its memory. Code run on the host can manage memory on both the host and device, and also launches kernels which are functions executed on the device. These kernels are executed by many GPU threads in parallel.

Can a CUDA kernel launch another kernel?

CUDA Dynamic Parallelism Under dynamic parallelism, one kernel may launch another kernel, and that kernel may launch another, and so on. Each subordinate launch is considered a new “nesting level,” and the total number of levels is the “nesting depth” of the program.

What is cudaLaunchKernel?

According to CUDA docs, cudaLaunchKernel is called to launch a device function, which, in short, is code that is run on a GPU device. The profiler, therefore, states that a lot of computation is run on the GPU (as you probably expected) and this requires the data structures to be transferred on the device.

Can CUDA run on CPU?

The CUDA code cannot run directly on the CPU but can be emulated. Threads are computed in parallel as part of a vectorized loop.


2 Answers

You launch a grid of blocks.

Blocks are indivisibly assigned to multiprocessors (where the number of blocks on the multiprocessor determine the amount of available shared memory).

Blocks are further split into warps. For a Fermi GPU that is 32 threads that either execute the same instruction or are inactive (because they branched away, e.g. by exiting from a loop earlier than neighbors within the same warp or not taking the if they did). On a Fermi GPU at most two warps run on one multiprocessor at a time.

Whenever there is latency (that is execution stalls for memory access or data dependencies to complete) another warp is run (the number of warps that fit onto one multiprocessor - of the same or different blocks - is determined by the number of registers used by each thread and the amount of shared memory used by a/the block(s)).

This scheduling happens transparently. That is, you do not have to think about it too much. However, you might want to use the predefined integer vectors threadIdx (where is my thread within the block?), blockDim (how large is one block?), blockIdx (where is my block in the grid?) and gridDim (how large is the grid?) to split up work (read: input and output) among the threads. You might also want to read up how to effectively access the different types of memory (so multiple threads can be serviced within a single transaction) - but that's leading off topic.

NSight provides a graphical debugger that gives you a good idea of what's happening on the device once you got through the jargon jungle. Same goes for its profiler regarding those things you won't see in the debugger (e.g. stall reasons or memory pressure).

You can synchronize all threads within the grid (all there are) by another kernel launch. For non-overlapping, sequential kernel execution no further synchronization is needed.

The threads within one grid (or one kernel run - however you want to call it) can communicate via global memory using atomic operations (for arithmetic) or appropriate memory fences (for load or store access).

You can synchronize all threads within one block with the intrinsic instruction __syncthreads() (all threads will be active afterwards - although, as always, at most two warps can run on a Fermi GPU). The threads within one block can communicate via shared or global memory using atomic operations (for arithmetic) or appropriate memory fences (for load or store access).

As mentioned earlier, all threads within a warp are always "synchronized", although some might be inactive. They can communicate through shared or global memory (or "lane swapping" on upcoming hardware with compute capability 3). You can use atomic operations (for arithmetic) and volatile-qualified shared or global variables (load or store access happening sequentially within the same warp). The volatile qualifier tells the compiler to always access memory and never registers whose state cannot be seen by other threads.

Further, there are warp-wide vote functions that can help you make branch decisions or compute integer (prefix) sums.

OK, that's basically it. Hope that helps. Had a good flow writing :-).

like image 173
Dude Avatar answered Oct 12 '22 08:10

Dude


Lets take an example of addition of 4*4 matrices.. you have two matrices A and B, having dimension 4*4..

int main()
{
 int *a, *b, *c;            //To store your matrix A & B in RAM. Result will be stored in matrix C
 int *ad, *bd, *cd;         // To store matrices into GPU's RAM. 
 int N =4;                 //No of rows and columns.

 size_t size=sizeof(float)* N * N;

 a=(float*)malloc(size);     //Allocate space of RAM for matrix A
 b=(float*)malloc(size);     //Allocate space of RAM for matrix B

//allocate memory on device
  cudaMalloc(&ad,size);
  cudaMalloc(&bd,size);
  cudaMalloc(&cd,size);

//initialize host memory with its own indices
    for(i=0;i<N;i++)
      {
    for(j=0;j<N;j++)
         {
            a[i * N + j]=(float)(i * N + j);
            b[i * N + j]= -(float)(i * N + j);
         }
      }

//copy data from host memory to device memory
     cudaMemcpy(ad, a, size, cudaMemcpyHostToDevice);
     cudaMemcpy(bd, b, size, cudaMemcpyHostToDevice);

//calculate execution configuration 
   dim3 grid (1, 1, 1); 
   dim3 block (16, 1, 1);

//each block contains N * N threads, each thread calculates 1 data element

    add_matrices<<<grid, block>>>(ad, bd, cd, N);

   cudaMemcpy(c,cd,size,cudaMemcpyDeviceToHost);  
   printf("Matrix A was---\n");
    for(i=0;i<N;i++)
    {
        for(j=0;j<N;j++)
            printf("%f ",a[i*N+j]);
        printf("\n");
    }

   printf("\nMatrix B was---\n");
   for(i=0;i<N;i++)
    {
        for(j=0;j<N;j++)
            printf("%f ",b[i*N+j]);
        printf("\n");
    }

    printf("\nAddition of A and B gives C----\n");
    for(i=0;i<N;i++)
    {
        for(j=0;j<N;j++)
            printf("%f ",c[i*N+j]);   //if correctly evaluated, all values will be 0
        printf("\n");
    }



    //deallocate host and device memories
    cudaFree(ad); 
    cudaFree(bd); 
    cudaFree (cd);

    free(a);
    free(b);
    free(c);

    getch();
    return 1;
}

/////Kernel Part

__global__ void add_matrices(float *ad,float *bd,float *cd,int N)
{
  int index;
  index = blockIDx.x * blockDim.x + threadIDx.x            

  cd[index] = ad[index] + bd[index];
}

Lets take an example of addition of 16*16 matrices.. you have two matrices A and B, having dimension 16*16..

First of all you have to decide your thread configuration. You are suppose to launch a kernel function, which will perform the parallel computation of you matrix addition, which will get executed on your GPU device.

Now,, one grid is launched with one kernel function.. A grid can have max 65,535 no of blocks which can be arranged in 3 dimensional ways. (65535 * 65535 * 65535).

Every block in grid can have max 1024 no of threads.Those threads can also be arranged in 3 dimensional ways (1024 * 1024 * 64)

Now our problem is addition of 16 * 16 matrices..

A | 1  2  3  4 |        B | 1  2  3  4 |      C| 1  2  3  4 |
  | 5  6  7  8 |   +      | 5  6  7  8 |   =   | 5  6  7  8 | 
  | 9 10 11 12 |          | 9 10 11 12 |       | 9 10 11 12 |  
  | 13 14 15 16|          | 13 14 15 16|       | 13 14 15 16|

We need 16 threads to perform the computation.

i.e. A(1,1) + B (1,1) = C(1,1)
     A(1,2) + B (1,2) = C(1,2) 
     .        .          .
     .        .          . 
     A(4,4) + B (4,4) = C(4,4) 

All these threads will get executed simultaneously. So we need a block with 16 threads. For our convenience we will arrange threads in (16 * 1 * 1) way in a block As no of threads are 16 so we need one block only to store those 16 threads.

so, grid configuration will be dim3 Grid(1,1,1) i.e. grid will have only one block and block configuration will be dim3 block(16,1,1) i.e. block will have 16 threads arranged column wise.

Following program will give you the clear idea about its execution.. Understanding the indexing part(i.e. threadIDs, blockDim, blockID) is the important part. You need to go through the CUDA literature. Once you have clear idea about indexing, you will win the half battle! So spend some time with cuda books, different algorithms and paper-pencil of course!

like image 31
sandeep.ganage Avatar answered Oct 12 '22 09:10

sandeep.ganage