I am trying to solve the problem at the end of lesson 1 of the Udacity course but I'm not sure if I have just made a typo or if the actual code is wrong.
void your_rgba_to_greyscale(const uchar4 * const h_rgbaImage, uchar4 * const d_rgbaImage, unsigned char* const d_greyImage, size_t numRows, size_t numCols)
{
size_t totalPixels = numRows * numCols;
size_t gridRows = totalPixels / 32;
size_t gridCols = totalPixels / 32;
const dim3 blockSize(32,32,1);
const dim3 gridSize(gridCols,gridRows,1);
rgba_to_greyscale<<<gridSize, blockSize>>>(d_rgbaImage, d_greyImage, numRows, numCols);
cudaDeviceSynchronize(); checkCudaErrors(cudaGetLastError());
}
The other method is:
void rgba_to_greyscale(const uchar4* const rgbaImage, unsigned char* const greyImage, int numRows, int numCols)
{
int x = (blockIdx.x * blockDim.x) + threadIdx.x;
int y = (blockIdx.y * blockDim.y) + threadIdx.y;
uchar4 rgba = rgbaImage[x * numCols + y];
float channelSum = 0.299f * rgba.x + 0.587f * rgba.y + 0.114f * rgba.z;
greyImage[x * numCols + y] = channelSum;
}
Error message says the following:
libdc1394 error: failed to initialize libdc1394
Cuda error at student_func.cu:76
unspecified launch failure cudaGetLastError()
we were unable to execute your code. Did you set the grid and/or block size correctly?
But then, it says that the code has compiled,
Your code compiled!
error output: libdc1394 error: Failed to initialize libdc1394
Cuda error at student_func.cu:76
unspecified launch failure cudaGetLastError()
Line 76 is the last line in the first code block and as far as I'm aware I haven't changed anything in it. Line 76 is as follows,
rgba_to_greyscale<<<gridSize, blockSize>>>(d_rgbaImage, d_greyImage, numRows, numCols);
I can't actually find the declaration of cudaGetLastError()
.
I'm mainly concerned with my understanding on setting up the grid/block dimensions + whether the first methods approach was right with regards to mapping between a 1D array of pixel positions and my threads.
EDIT:
I guess I've misunderstood something. Is numRows
the number of pixels in the vertical? And is numCols
the pixels in horizontal direction?
My block is made up of 8 x 8 threads, where each thread represents 1 pixel? If so, I'm assuming that's why I had to divide by 4 when calculating gridRows
since the image is not square? I'm assuming I could have also made a block that was 2:1 columns : rows?
EDIT 2:
I just tried to change my block so that it was 2:1 ratio, so I could then divide numRows
and numCol
by the same number but its now showing blank areas at the bottom and side. Why is there blank areas both at the bottom and side. I haven't changed the y dimensions of by grid or block.
' The maximum x, y and z dimensions of a block are 1024, 1024 and 64, and it should be allocated such that x × y × z ≤ 1024, which is the maximum number of threads per block.
Choosing the number of threads per block is very complicated. Most CUDA algorithms admit a large range of possibilities, and the choice is based on what makes the kernel run most efficiently. It is almost always a multiple of 32, and at least 64, because of how the thread scheduling hardware works.
dim3 is an integer vector type that can be used in CUDA code. Its most common application is to pass the grid and block dimensions in a kernel invocation. It can also be used in any user code for holding values of 3 dimensions.
each blocks processes 32*32 pixels, and there are (totalPixels / 32) * (totalPixels / 32) blocks, so you process totalPixels ^ 2 pixels - that seems wrong
1st was wrong, this should be the correct one:
const dim3 blockSize(32,32,1);
size_t gridCols = (numCols + blockSize.x - 1) / blockSize.x;
size_t gridRows = (numRows + blockSize.y - 1) / blockSize.y;
it is a pretty common pattern for 2d - you can remember it
in the sample image size is not power of two and you want block to cover all your image(or even more)
so next must be correct: gridCols * blockSize.x >= numCols gridRows * blockSize.y >= numRows
you choose block size and basing on it you compute amount of blocks you need to cover all image
after that, in the kernel, you must check that you are not 'out of image', for cases with bad size
another problem is in the kernel, it must be (y * numCols + x), not oposite
kernel:
int x = (blockIdx.x * blockDim.x) + threadIdx.x;
int y = (blockIdx.y * blockDim.y) + threadIdx.y;
if(x < numCols && y < numRows)
{
uchar4 rgba = rgbaImage[y * numCols + x];
float channelSum = 0.299f * rgba.x + 0.587f * rgba.y + 0.114f * rgba.z;
greyImage[y * numCols + x] = channelSum;
}
calling code:
const dim3 blockSize(4,32,1); // may be any
size_t gridCols = (numCols + blockSize.x - 1) / blockSize.x;
size_t gridRows = (numRows + blockSize.y - 1) / blockSize.y;
const dim3 gridSize(gridCols,gridRows,1);
rgba_to_greyscale<<<gridSize, blockSize>>>(d_rgbaImage, d_greyImage, numRows, numCols);
cudaDeviceSynchronize();
checkCudaErrors(cudaGetLastError());
damn it, i feel like i am doing things even harder to understand (
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With