Confusion over grid and block dimensions

Tags:

cuda

I am trying to solve the problem at the end of lesson 1 of the Udacity course but I'm not sure if I have just made a typo or if the actual code is wrong.

void your_rgba_to_greyscale(const uchar4 * const h_rgbaImage, uchar4 * const d_rgbaImage, unsigned char* const d_greyImage, size_t numRows, size_t numCols)
{
    size_t totalPixels = numRows * numCols;
    size_t gridRows = totalPixels / 32;
    size_t gridCols = totalPixels / 32;
    const dim3 blockSize(32,32,1);
    const dim3 gridSize(gridCols,gridRows,1);
    rgba_to_greyscale<<<gridSize, blockSize>>>(d_rgbaImage, d_greyImage, numRows, numCols);
    cudaDeviceSynchronize(); checkCudaErrors(cudaGetLastError());
}

The other method is:

void rgba_to_greyscale(const uchar4* const rgbaImage, unsigned char* const greyImage, int numRows, int numCols)
{   
    int x = (blockIdx.x * blockDim.x) + threadIdx.x;
    int y = (blockIdx.y * blockDim.y) + threadIdx.y;
    uchar4 rgba = rgbaImage[x * numCols + y];
    float channelSum = 0.299f * rgba.x + 0.587f * rgba.y + 0.114f * rgba.z;
    greyImage[x * numCols + y] = channelSum;
}

Error message says the following:

libdc1394 error: failed to initialize libdc1394
Cuda error at student_func.cu:76
unspecified launch failure cudaGetLastError()
we were unable to execute your code. Did you set the grid and/or block size correctly?

But then, it says that the code has compiled,

Your code compiled!
error output: libdc1394 error: Failed to initialize libdc1394
Cuda error at student_func.cu:76
unspecified launch failure cudaGetLastError()

Line 76 is the last line in the first code block and as far as I'm aware I haven't changed anything in it. Line 76 is as follows,

rgba_to_greyscale<<<gridSize, blockSize>>>(d_rgbaImage, d_greyImage, numRows, numCols);

I can't actually find the declaration of cudaGetLastError().

I'm mainly concerned with my understanding on setting up the grid/block dimensions + whether the first methods approach was right with regards to mapping between a 1D array of pixel positions and my threads.

EDIT: I guess I've misunderstood something. Is numRows the number of pixels in the vertical? And is numCols the pixels in horizontal direction?

My block is made up of 8 x 8 threads, where each thread represents 1 pixel? If so, I'm assuming that's why I had to divide by 4 when calculating gridRows since the image is not square? I'm assuming I could have also made a block that was 2:1 columns : rows?

Screen shot

EDIT 2: I just tried to change my block so that it was 2:1 ratio, so I could then divide numRows and numCol by the same number but its now showing blank areas at the bottom and side. Why is there blank areas both at the bottom and side. I haven't changed the y dimensions of by grid or block.

enter image description here

993

asked Jun 13 '13 11:06

Hans Rudel

1 Answers

each blocks processes 32*32 pixels, and there are (totalPixels / 32) * (totalPixels / 32) blocks, so you process totalPixels ^ 2 pixels - that seems wrong

1st was wrong, this should be the correct one:

const dim3 blockSize(32,32,1);

size_t gridCols = (numCols + blockSize.x - 1) / blockSize.x;
size_t gridRows = (numRows + blockSize.y - 1) / blockSize.y;

it is a pretty common pattern for 2d - you can remember it

in the sample image size is not power of two and you want block to cover all your image(or even more)

so next must be correct: gridCols * blockSize.x >= numCols gridRows * blockSize.y >= numRows

you choose block size and basing on it you compute amount of blocks you need to cover all image

after that, in the kernel, you must check that you are not 'out of image', for cases with bad size

another problem is in the kernel, it must be (y * numCols + x), not oposite

kernel:

int x = (blockIdx.x * blockDim.x) + threadIdx.x;
int y = (blockIdx.y * blockDim.y) + threadIdx.y;

if(x < numCols && y < numRows)
{
    uchar4 rgba = rgbaImage[y * numCols + x];
    float channelSum = 0.299f * rgba.x + 0.587f * rgba.y + 0.114f * rgba.z;
    greyImage[y * numCols + x] = channelSum;
}

calling code:

const dim3 blockSize(4,32,1); // may be any

size_t gridCols = (numCols + blockSize.x - 1) / blockSize.x;
size_t gridRows = (numRows + blockSize.y - 1) / blockSize.y;

const dim3 gridSize(gridCols,gridRows,1);
rgba_to_greyscale<<<gridSize, blockSize>>>(d_rgbaImage, d_greyImage, numRows, numCols);
cudaDeviceSynchronize(); 
checkCudaErrors(cudaGetLastError());

damn it, i feel like i am doing things even harder to understand (

188

answered Sep 24 '22 23:09

ShPavel

Related questions
                            
                                Analyzing memory access coalescing of my CUDA kernel
                            
                                How to prevent two CUDA programs from interfering
                            
                                conditional syncthreads & deadlock (or not)
                            
                                Error : argument of type "int" is incompatible with parameter of type "const void *"
                            
                                What's the efficient way to swap two register variables in CUDA?
                            
                                How can I check the progress of matrix multiplication?
                            
                                CMake does not properly find CUDA library
                            
                                Using multiple CUDA GPUs
                            
                                How to define CUDA device constant like a C++ const/constexpr?
                            
                                ERROR: The Compose file './docker-compose.yaml' is invalid because: Unsupported config option for services.nvidia-smi-test: 'runtime'
                            
                                Is it possible to enable syntax highlighting for CUDA 4.0 in Visual Studio 2010?
                            
                                Programming CUDA using Delphi or FreePascal
                            
                                Issue with production release of CUDA Toolkit 4.0 and Nsight 2.0
                            
                                Cuda Bayer/CFA demosaicing example
                            
                                initializer not allowed for __shared__ variable for cuda
                            
                                /usr/bin/ld: cannot find -lcudart
                            
                                L1 cache persistance across CUDA kernels
                            
                                CUDA-capability and CUDA version: compatible?
                            
                                How to get properties from active CUDA device?
                            
                                Why CUDA Occupancy is defined in terms of number of active warps over max warps supported

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With