I need some clearing up regarding the use of dim3 to set the number of threads in my CUDA kernel. I have an image in a 1D float array, which I'm copying to the device with: <pre class="prettyprint"><code>checkCudaErrors(cudaMemcpy( img_d, img.data, img.row * img.col * sizeof(float), cudaMemcpyHostToDevice)); </code></pre> Now I need to set the grid and block sizes to launch my kernel: <pre class="prettyprint"><code>dim3 blockDims(512); dim3 gridDims((unsigned int) ceil(img.row * img.col * 3 / blockDims.x)); myKernel<<< gridDims, blockDims>>>(...) </code></pre> I'm wondering: in this case, since the data is 1D, does it matter if I use a dim3 structure? Any benefits over using <pre class="prettyprint"><code>unsigned int num_blocks = ceil(img.row * img.col * 3 / blockDims.x)); myKernel<<<num_blocks, 512>>>(...) </code></pre> instead? Also, is my understanding correct that when using dim3, I'll reference the thread ID with 2 indices inside my kernel: <pre class="prettyprint"><code>int x = blockIdx.x * blockDim.x + threadIdx.x; int y = blockIdx.y * blockDim.y + threadIdx.y; </code></pre> And when I'm not using dim3, I'll just use one index? Thank you very much,

The way you arrange the data in memory is independently on how you would configure the threads of your kernel. The memory is always a 1D continuous space of bytes. However, the access pattern depends on how you are interpreting your data and also how you are accessing them by 1D, 2D and 3D blocks of threads. <blockquote> <code>dim3</code> is an integer vector type based on uint3 that is used to specify dimensions. When defining a variable of type dim3, any component left unspecified is initialized to 1. </blockquote> The same happens for the blocks and the grid. Read more at: http://docs.nvidia.com/cuda/cuda-c-programming-guide/#dim3 So, in both cases: <code>dim3 blockDims(512);</code> and <code>myKernel<<<num_blocks, 512>>>(...)</code> you will always have access to threadIdx.y and threadIdx.z. As the thread ids start at zero, you can calculate a memory position as a row major order using also the <code>y</code>dimension: <pre class="prettyprint"><code>int x = blockIdx.x * blockDim.x + threadIdx.x; int y = blockIdx.y * blockDim.y + threadIdx.y; int gid = img.col * y + x; </code></pre> because <code>blockIdx.y</code> and <code>threadIdx.y</code> will be zero. To sumup, it does it matter if you use a dim3 structure. I would be clear where the configuration of the threads has been defined, and the 1D, 2D and 3D access pattern depends on how you are interpreting your data and also how you are accessing them by 1D, 2D and 3D blocks of threads.

Cuda block/grid dimensions: when to use dim3?

Tags:

cuda

gpu

I need some clearing up regarding the use of dim3 to set the number of threads in my CUDA kernel.

I have an image in a 1D float array, which I'm copying to the device with:

checkCudaErrors(cudaMemcpy( img_d, img.data, img.row * img.col * sizeof(float), cudaMemcpyHostToDevice));

Now I need to set the grid and block sizes to launch my kernel:

dim3 blockDims(512);
dim3 gridDims((unsigned int) ceil(img.row * img.col * 3 / blockDims.x));
myKernel<<< gridDims, blockDims>>>(...)

I'm wondering: in this case, since the data is 1D, does it matter if I use a dim3 structure? Any benefits over using

unsigned int num_blocks = ceil(img.row * img.col * 3 / blockDims.x));
myKernel<<<num_blocks, 512>>>(...)

instead?

Also, is my understanding correct that when using dim3, I'll reference the thread ID with 2 indices inside my kernel:

int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;

And when I'm not using dim3, I'll just use one index?

Thank you very much,

333

asked Jun 30 '15 14:06

user2121792

1 Answers

The way you arrange the data in memory is independently on how you would configure the threads of your kernel.

The memory is always a 1D continuous space of bytes. However, the access pattern depends on how you are interpreting your data and also how you are accessing them by 1D, 2D and 3D blocks of threads.

dim3 is an integer vector type based on uint3 that is used to specify dimensions. When defining a variable of type dim3, any component left unspecified is initialized to 1.

The same happens for the blocks and the grid.

Read more at: http://docs.nvidia.com/cuda/cuda-c-programming-guide/#dim3

So, in both cases: dim3 blockDims(512); and myKernel<<<num_blocks, 512>>>(...) you will always have access to threadIdx.y and threadIdx.z.

As the thread ids start at zero, you can calculate a memory position as a row major order using also the ydimension:

int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;

int gid = img.col * y + x;

because blockIdx.y and threadIdx.y will be zero.

To sumup, it does it matter if you use a dim3 structure. I would be clear where the configuration of the threads has been defined, and the 1D, 2D and 3D access pattern depends on how you are interpreting your data and also how you are accessing them by 1D, 2D and 3D blocks of threads.

answered Oct 17 '22 22:10

pQB

Related questions
                            
                                Problem when calling template CUDA kernel
                            
                                Invalid argument in cudaMemcpy3D using width in bytes?
                            
                                How good is OpenCV GPU library for matrix operations?
                            
                                How to debug CUDA using eclipse Nsight with only one GPU
                            
                                How to measure GPU vs CPU performance? Which time measurement functions?
                            
                                How to generate, compile and run CUDA kernels at runtime
                            
                                Generating random numbers: CPU vs GPU, which currently wins?
                            
                                compilation .cu files with Dynamic Parallelism(CUDA)
                            
                                cuda, OpenGL interoperability: cudaErrorMemoryAllocation error on cudaGraphicsGLRegisterBuffer
                            
                                Why there are two warp schedulers in a SM of GPU?
                            
                                CUDA: LNK2005 error on __device__ function used in header file
                            
                                Synchronizations in GPUs
                            
                                AMD equivalent of the CUDA Driver API?
                            
                                How to emulate CUDA on windows
                            
                                Sum reduction with CUDA: What is N?
                            
                                Passing a C++/CUDA class to PyCUDA's SourceModule
                            
                                Cuda, executional thread order in a 3d-block
                            
                                NVCC 5.0 and OpenACC
                            
                                CUDA. How to unroll first 32 threads so they will be executed in parallel?
                            
                                cuBLAS synchronization best practices

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With