Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Cuda block/grid dimensions: when to use dim3?

Tags:

cuda

gpu

I need some clearing up regarding the use of dim3 to set the number of threads in my CUDA kernel.

I have an image in a 1D float array, which I'm copying to the device with:

checkCudaErrors(cudaMemcpy( img_d, img.data, img.row * img.col * sizeof(float), cudaMemcpyHostToDevice));

Now I need to set the grid and block sizes to launch my kernel:

dim3 blockDims(512);
dim3 gridDims((unsigned int) ceil(img.row * img.col * 3 / blockDims.x));
myKernel<<< gridDims, blockDims>>>(...)

I'm wondering: in this case, since the data is 1D, does it matter if I use a dim3 structure? Any benefits over using

unsigned int num_blocks = ceil(img.row * img.col * 3 / blockDims.x));
myKernel<<<num_blocks, 512>>>(...)

instead?

Also, is my understanding correct that when using dim3, I'll reference the thread ID with 2 indices inside my kernel:

int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;

And when I'm not using dim3, I'll just use one index?

Thank you very much,

like image 333
user2121792 Avatar asked Jun 30 '15 14:06

user2121792


People also ask

What does dim3 do in CUDA?

dim3 is an integer vector type that can be used in CUDA code. Its most common application is to pass the grid and block dimensions in a kernel invocation. It can also be used in any user code for holding values of 3 dimensions.

What is dim3 grid dim indicates?

• dim3 gridDim; – Dimensions of the grid in blocks (gridDim.z unused) • dim3 blockDim; – Dimensions of the block in threads.

What is grid and block in CUDA?

A group of threads is called a CUDA block. CUDA blocks are grouped into a grid. A kernel is executed as a grid of blocks of threads (Figure 2). Each CUDA block is executed by one streaming multiprocessor (SM) and cannot be migrated to other SMs in GPU (except during preemption, debugging, or CUDA dynamic parallelism).

What is the maximum number of blocks supported by CUDA?

Each CUDA card has a maximum number of threads in a block (512, 1024, or 2048). Each thread also has a thread id: threadId = x + y Dx + z Dx Dy The threadId is like 1D representation of an array in memory.


1 Answers

The way you arrange the data in memory is independently on how you would configure the threads of your kernel.

The memory is always a 1D continuous space of bytes. However, the access pattern depends on how you are interpreting your data and also how you are accessing them by 1D, 2D and 3D blocks of threads.

dim3 is an integer vector type based on uint3 that is used to specify dimensions. When defining a variable of type dim3, any component left unspecified is initialized to 1.

The same happens for the blocks and the grid.

Read more at: http://docs.nvidia.com/cuda/cuda-c-programming-guide/#dim3

So, in both cases: dim3 blockDims(512); and myKernel<<<num_blocks, 512>>>(...) you will always have access to threadIdx.y and threadIdx.z.

As the thread ids start at zero, you can calculate a memory position as a row major order using also the ydimension:

int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;

int gid = img.col * y + x; 

because blockIdx.y and threadIdx.y will be zero.

To sumup, it does it matter if you use a dim3 structure. I would be clear where the configuration of the threads has been defined, and the 1D, 2D and 3D access pattern depends on how you are interpreting your data and also how you are accessing them by 1D, 2D and 3D blocks of threads.

like image 93
pQB Avatar answered Oct 17 '22 22:10

pQB