I need some clearing up regarding the use of dim3 to set the number of threads in my CUDA kernel.
I have an image in a 1D float array, which I'm copying to the device with:
checkCudaErrors(cudaMemcpy( img_d, img.data, img.row * img.col * sizeof(float), cudaMemcpyHostToDevice));
Now I need to set the grid and block sizes to launch my kernel:
dim3 blockDims(512);
dim3 gridDims((unsigned int) ceil(img.row * img.col * 3 / blockDims.x));
myKernel<<< gridDims, blockDims>>>(...)
I'm wondering: in this case, since the data is 1D, does it matter if I use a dim3 structure? Any benefits over using
unsigned int num_blocks = ceil(img.row * img.col * 3 / blockDims.x));
myKernel<<<num_blocks, 512>>>(...)
instead?
Also, is my understanding correct that when using dim3, I'll reference the thread ID with 2 indices inside my kernel:
int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
And when I'm not using dim3, I'll just use one index?
Thank you very much,
dim3 is an integer vector type that can be used in CUDA code. Its most common application is to pass the grid and block dimensions in a kernel invocation. It can also be used in any user code for holding values of 3 dimensions.
• dim3 gridDim; – Dimensions of the grid in blocks (gridDim.z unused) • dim3 blockDim; – Dimensions of the block in threads.
A group of threads is called a CUDA block. CUDA blocks are grouped into a grid. A kernel is executed as a grid of blocks of threads (Figure 2). Each CUDA block is executed by one streaming multiprocessor (SM) and cannot be migrated to other SMs in GPU (except during preemption, debugging, or CUDA dynamic parallelism).
Each CUDA card has a maximum number of threads in a block (512, 1024, or 2048). Each thread also has a thread id: threadId = x + y Dx + z Dx Dy The threadId is like 1D representation of an array in memory.
The way you arrange the data in memory is independently on how you would configure the threads of your kernel.
The memory is always a 1D continuous space of bytes. However, the access pattern depends on how you are interpreting your data and also how you are accessing them by 1D, 2D and 3D blocks of threads.
dim3
is an integer vector type based on uint3 that is used to specify dimensions. When defining a variable of type dim3, any component left unspecified is initialized to 1.
The same happens for the blocks and the grid.
Read more at: http://docs.nvidia.com/cuda/cuda-c-programming-guide/#dim3
So, in both cases: dim3 blockDims(512);
and myKernel<<<num_blocks, 512>>>(...)
you will always have access to threadIdx.y and threadIdx.z.
As the thread ids start at zero, you can calculate a memory position as a row major order using also the y
dimension:
int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
int gid = img.col * y + x;
because blockIdx.y
and threadIdx.y
will be zero.
To sumup, it does it matter if you use a dim3 structure. I would be clear where the configuration of the threads has been defined, and the 1D, 2D and 3D access pattern depends on how you are interpreting your data and also how you are accessing them by 1D, 2D and 3D blocks of threads.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With