How is a CUDA kernel launched?

Tags:

I have created a simple CUDA application to add two matrices. It is compiling fine. I want to know how the kernel will be launched by all the threads and what will the flow be inside CUDA? I mean, in what fashion every thread will execute each element of the matrices.

I know this is a very basic concept, but I don't know this. I am confused regarding the flow.

945

asked Aug 29 '12 06:08

ATG

2 Answers

You launch a grid of blocks.

Blocks are indivisibly assigned to multiprocessors (where the number of blocks on the multiprocessor determine the amount of available shared memory).

Blocks are further split into warps. For a Fermi GPU that is 32 threads that either execute the same instruction or are inactive (because they branched away, e.g. by exiting from a loop earlier than neighbors within the same warp or not taking the if they did). On a Fermi GPU at most two warps run on one multiprocessor at a time.

Whenever there is latency (that is execution stalls for memory access or data dependencies to complete) another warp is run (the number of warps that fit onto one multiprocessor - of the same or different blocks - is determined by the number of registers used by each thread and the amount of shared memory used by a/the block(s)).

This scheduling happens transparently. That is, you do not have to think about it too much. However, you might want to use the predefined integer vectors threadIdx (where is my thread within the block?), blockDim (how large is one block?), blockIdx (where is my block in the grid?) and gridDim (how large is the grid?) to split up work (read: input and output) among the threads. You might also want to read up how to effectively access the different types of memory (so multiple threads can be serviced within a single transaction) - but that's leading off topic.

NSight provides a graphical debugger that gives you a good idea of what's happening on the device once you got through the jargon jungle. Same goes for its profiler regarding those things you won't see in the debugger (e.g. stall reasons or memory pressure).

You can synchronize all threads within the grid (all there are) by another kernel launch. For non-overlapping, sequential kernel execution no further synchronization is needed.

The threads within one grid (or one kernel run - however you want to call it) can communicate via global memory using atomic operations (for arithmetic) or appropriate memory fences (for load or store access).

You can synchronize all threads within one block with the intrinsic instruction __syncthreads() (all threads will be active afterwards - although, as always, at most two warps can run on a Fermi GPU). The threads within one block can communicate via shared or global memory using atomic operations (for arithmetic) or appropriate memory fences (for load or store access).

As mentioned earlier, all threads within a warp are always "synchronized", although some might be inactive. They can communicate through shared or global memory (or "lane swapping" on upcoming hardware with compute capability 3). You can use atomic operations (for arithmetic) and volatile-qualified shared or global variables (load or store access happening sequentially within the same warp). The volatile qualifier tells the compiler to always access memory and never registers whose state cannot be seen by other threads.

Further, there are warp-wide vote functions that can help you make branch decisions or compute integer (prefix) sums.

OK, that's basically it. Hope that helps. Had a good flow writing :-).

173

answered Oct 12 '22 08:10

Dude

Lets take an example of addition of 4*4 matrices.. you have two matrices A and B, having dimension 4*4..

int main()
{
 int *a, *b, *c;            //To store your matrix A & B in RAM. Result will be stored in matrix C
 int *ad, *bd, *cd;         // To store matrices into GPU's RAM. 
 int N =4;                 //No of rows and columns.

 size_t size=sizeof(float)* N * N;

 a=(float*)malloc(size);     //Allocate space of RAM for matrix A
 b=(float*)malloc(size);     //Allocate space of RAM for matrix B

//allocate memory on device
  cudaMalloc(&ad,size);
  cudaMalloc(&bd,size);
  cudaMalloc(&cd,size);

//initialize host memory with its own indices
    for(i=0;i<N;i++)
      {
    for(j=0;j<N;j++)
         {
            a[i * N + j]=(float)(i * N + j);
            b[i * N + j]= -(float)(i * N + j);
         }
      }

//copy data from host memory to device memory
     cudaMemcpy(ad, a, size, cudaMemcpyHostToDevice);
     cudaMemcpy(bd, b, size, cudaMemcpyHostToDevice);

//calculate execution configuration 
   dim3 grid (1, 1, 1); 
   dim3 block (16, 1, 1);

//each block contains N * N threads, each thread calculates 1 data element

    add_matrices<<<grid, block>>>(ad, bd, cd, N);

   cudaMemcpy(c,cd,size,cudaMemcpyDeviceToHost);  
   printf("Matrix A was---\n");
    for(i=0;i<N;i++)
    {
        for(j=0;j<N;j++)
            printf("%f ",a[i*N+j]);
        printf("\n");
    }

   printf("\nMatrix B was---\n");
   for(i=0;i<N;i++)
    {
        for(j=0;j<N;j++)
            printf("%f ",b[i*N+j]);
        printf("\n");
    }

    printf("\nAddition of A and B gives C----\n");
    for(i=0;i<N;i++)
    {
        for(j=0;j<N;j++)
            printf("%f ",c[i*N+j]);   //if correctly evaluated, all values will be 0
        printf("\n");
    }



    //deallocate host and device memories
    cudaFree(ad); 
    cudaFree(bd); 
    cudaFree (cd);

    free(a);
    free(b);
    free(c);

    getch();
    return 1;
}

/////Kernel Part

__global__ void add_matrices(float *ad,float *bd,float *cd,int N)
{
  int index;
  index = blockIDx.x * blockDim.x + threadIDx.x            

  cd[index] = ad[index] + bd[index];
}

Lets take an example of addition of 16*16 matrices.. you have two matrices A and B, having dimension 16*16..

First of all you have to decide your thread configuration. You are suppose to launch a kernel function, which will perform the parallel computation of you matrix addition, which will get executed on your GPU device.

Now,, one grid is launched with one kernel function.. A grid can have max 65,535 no of blocks which can be arranged in 3 dimensional ways. (65535 * 65535 * 65535).

Every block in grid can have max 1024 no of threads.Those threads can also be arranged in 3 dimensional ways (1024 * 1024 * 64)

Now our problem is addition of 16 * 16 matrices..

A | 1  2  3  4 |        B | 1  2  3  4 |      C| 1  2  3  4 |
  | 5  6  7  8 |   +      | 5  6  7  8 |   =   | 5  6  7  8 | 
  | 9 10 11 12 |          | 9 10 11 12 |       | 9 10 11 12 |  
  | 13 14 15 16|          | 13 14 15 16|       | 13 14 15 16|

We need 16 threads to perform the computation.

i.e. A(1,1) + B (1,1) = C(1,1)
     A(1,2) + B (1,2) = C(1,2) 
     .        .          .
     .        .          . 
     A(4,4) + B (4,4) = C(4,4)

All these threads will get executed simultaneously. So we need a block with 16 threads. For our convenience we will arrange threads in (16 * 1 * 1) way in a block As no of threads are 16 so we need one block only to store those 16 threads.

so, grid configuration will be dim3 Grid(1,1,1) i.e. grid will have only one block and block configuration will be dim3 block(16,1,1) i.e. block will have 16 threads arranged column wise.

Following program will give you the clear idea about its execution.. Understanding the indexing part(i.e. threadIDs, blockDim, blockID) is the important part. You need to go through the CUDA literature. Once you have clear idea about indexing, you will win the half battle! So spend some time with cuda books, different algorithms and paper-pencil of course!

answered Oct 12 '22 09:10

sandeep.ganage

Related questions
                            
                                spacy with joblib library generates _pickle.PicklingError: Could not pickle the task to send it to the workers
                            
                                Using AWS for parallel processing with R
                            
                                Understanding MATLAB on multiple cores, multiple processors and MPI
                            
                                Cancelling tasks in the Dispose method
                            
                                Parallelization doesn't work with the foreach package
                            
                                Is there a threadsafe and generic IList<T> in c#?
                            
                                MaxDegreeOfParallelism deciding on optimal value
                            
                                ContinueWhenAll doesn't wait for all task to complete
                            
                                OpenMP and NUMA relation?
                            
                                C++0x threads give no speed up
                            
                                Possible to use OpenCL on multi-computers?
                            
                                What is the OpenCL analogue for CUDA's __syncthreads() and blockIdx.x?
                            
                                Implementing Max Reduce in Cuda
                            
                                Task Dependency in OpenMP 4
                            
                                Parallel coding Vs Multithreading (on single cpu)
                            
                                Parallel Operation Batching
                            
                                Is MPI_Reduce blocking (or a natural barrier)?
                            
                                Distributing java threads over multiple servers?
                            
                                Sending data to workers
                            
                                Run asynchronous function in R

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How is a CUDA kernel launched?

Tags:

parallel-processing

cuda

gpgpu

nvidia

ATG

People also ask

2 Answers

Dude

sandeep.ganage

Recent Activity

Donate For Us