Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Cuda gridDim and blockDim

Tags:

cuda

People also ask

What is blockDim CUDA?

Paraphrased from the CUDA Programming Guide: gridDim: This variable contains the dimensions of the grid. blockIdx: This variable contains the block index within the grid. blockDim: This variable and contains the dimensions of the block. threadIdx: This variable contains the thread index within the block.

What is grid block and thread in CUDA?

CUDA kernels are subdivided into blocks. A group of threads is called a CUDA block. CUDA blocks are grouped into a grid. A kernel is executed as a grid of blocks of threads (Figure 2).

What is dim3 in CUDA?

dim3 is an integer vector type based on uint3 that is used to specify dimensions. When defining a variable of type dim3, any component left unspecified is initialized to 1. The same happens for the blocks and the grid.

What is CUDA warp?

In CUDA, groups of threads with consecutive thread indexes are bundled into warps; one full warp is executed on a single CUDA core. At runtime, a thread block is divided into a number of warps for execution on the cores of an SM. The size of a warp depends on the hardware.


  • blockDim.x,y,z gives the number of threads in a block, in the particular direction
  • gridDim.x,y,z gives the number of blocks in a grid, in the particular direction
  • blockDim.x * gridDim.x gives the number of threads in a grid (in the x direction, in this case)

block and grid variables can be 1, 2, or 3 dimensional. It's common practice when handling 1-D data to only create 1-D blocks and grids.

In the CUDA documentation, these variables are defined here

In particular, when the total threads in the x-dimension (gridDim.x*blockDim.x) is less than the size of the array I wish to process, then it's common practice to create a loop and have the grid of threads move through the entire array. In this case, after processing one loop iteration, each thread must then move to the next unprocessed location, which is given by tid+=blockDim.x*gridDim.x; In effect, the entire grid of threads is jumping through the 1-D array of data, a grid-width at a time. This topic, sometimes called a "grid-striding loop", is further discussed in this blog article.

You might want to consider taking a couple of the introductory CUDA webinars available on the NVIDIA webinar page. For example, these 2:

  • GPU Computing using CUDA C – An Introduction (2010) An introduction to the basics of GPU computing using CUDA C. Concepts will be illustrated with walkthroughs of code samples. No prior GPU Computing experience required
  • GPU Computing using CUDA C – Advanced 1 (2010) First level optimization techniques such as global memory optimization, and processor utilization. Concepts will be illustrated using real code examples

It would be 2 hours well spent, if you want to understand these concepts better.

The general topic of grid-striding loops is covered in some detail here.


Paraphrased from the CUDA Programming Guide:

gridDim: This variable contains the dimensions of the grid.

blockIdx: This variable contains the block index within the grid.

blockDim: This variable and contains the dimensions of the block.

threadIdx: This variable contains the thread index within the block.

You seem to be a bit confused about the thread hierachy that CUDA has; in a nutshell, for a kernel there will be 1 grid, (which I always visualize as a 3-dimensional cube). Each of its elements is a block, such that a grid declared as dim3 grid(10, 10, 2); would have 10*10*2 total blocks. In turn, each block is a 3-dimensional cube of threads.

With that said, it's common to only use the x-dimension of the blocks and grids, which is what it looks like the code in your question is doing. This is especially revlevant if you're working with 1D arrays. In that case, your tid+=blockDim.x * gridDim.x line would in effect be the unique index of each thread within your grid. This is because your blockDim.x would be the size of each block, and your gridDim.x would be the total number of blocks.

So if you launch a kernel with parameters

dim3 block_dim(128,1,1);
dim3 grid_dim(10,1,1);
kernel<<<grid_dim,block_dim>>>(...);

then in your kernel had threadIdx.x + blockIdx.x*blockDim.x you would effectively have:

threadIdx.x range from [0 ~ 128)

blockIdx.x range from [0 ~ 10)

blockDim.x equal to 128

gridDim.x equal to 10

Hence in calculating threadIdx.x + blockIdx.x*blockDim.x, you would have values within the range defined by: [0, 128) + 128 * [1, 10), which would mean your tid values would range from {0, 1, 2, ..., 1279}. This is useful for when you want to map threads to tasks, as this provides a unique identifier for all of your threads in your kernel.

However, if you have

int tid = threadIdx.x + blockIdx.x * blockDim.x;
tid += blockDim.x * gridDim.x;

then you'll essentially have: tid = [0, 128) + 128 * [1, 10) + (128 * 10), and your tid values would range from {1280, 1281, ..., 2559} I'm not sure where that would be relevant, but it all depends on your application and how you map your threads to your data. This mapping is pretty central to any kernel launch, and you're the one who determines how it should be done. When you launch your kernel you specify the grid and block dimensions, and you're the one who has to enforce the mapping to your data inside your kernel. As long as you don't exceed your hardware limits (for modern cards, you can have a maximum of 2^10 threads per block and 2^16 - 1 blocks per grid)


In this source code, we even have 4 threds, the kernel function can access all of 10 arrays. How?

#define N 10 //(33*1024)

__global__ void add(int *c){
    int tid = threadIdx.x + blockIdx.x * gridDim.x;

    if(tid < N)
        c[tid] = 1;

    while( tid < N)
    {
        c[tid] = 1;
        tid += blockDim.x * gridDim.x;
    }
}

int main(void)
{
    int c[N];
    int *dev_c;
    cudaMalloc( (void**)&dev_c, N*sizeof(int) );

    for(int i=0; i<N; ++i)
    {
        c[i] = -1;
    }

    cudaMemcpy(dev_c, c, N*sizeof(int), cudaMemcpyHostToDevice);

    add<<< 2, 2>>>(dev_c);
    cudaMemcpy(c, dev_c, N*sizeof(int), cudaMemcpyDeviceToHost );

    for(int i=0; i< N; ++i)
    {
        printf("c[%d] = %d \n" ,i, c[i] );
    }

    cudaFree( dev_c );
}

Why we do not create 10 threads ex) add<<<2,5>>> or add<5,2>>> Because we have to create reasonably small number of threads, if N is larger than 10 ex) 33*1024.

This source code is example of this case. arrays are 10, cuda threads are 4. How to access all 10 arrays only by 4 threads.

see the page about meaning of threadIdx, blockIdx, blockDim, gridDim in the cuda detail.

In this source code,

gridDim.x : 2    this means number of block of x

gridDim.y : 1    this means number of block of y

blockDim.x : 2   this means number of thread of x in a block

blockDim.y : 1   this means number of thread of y in a block

Our number of thread are 4, because 2*2(blocks * thread).

In add kernel function, we can access 0, 1, 2, 3 index of thread

->tid = threadIdx.x + blockIdx.x * blockDim.x

①0+0*2=0

②1+0*2=1

③0+1*2=2

④1+1*2=3

How to access rest of index 4, 5, 6, 7, 8, 9. There is a calculation in while loop

tid += blockDim.x + gridDim.x in while

** first call of kernel **

-1 loop: 0+2*2=4

-2 loop: 4+2*2=8

-3 loop: 8+2*2=12 ( but this value is false, while out!)

** second call of kernel **

-1 loop: 1+2*2=5

-2 loop: 5+2*2=9

-3 loop: 9+2*2=13 ( but this value is false, while out!)

** third call of kernel **

-1 loop: 2+2*2=6

-2 loop: 6+2*2=10 ( but this value is false, while out!)

** fourth call of kernel **

-1 loop: 3+2*2=7

-2 loop: 7+2*2=11 ( but this value is false, while out!)

So, all index of 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 can access by tid value.

refer to this page. http://study.marearts.com/2015/03/to-process-all-arrays-by-reasonably.html I cannot upload image, because low reputation.