I am having some difficulties to understand the batch loading as in the comments is referred. In order to compute the convolution in a pixel the mask whose size is 5 must become centered on this specific pixel. The image is divided into tiles. These tiles after applying the convolution mask are the final output tiles whose size is TILE_WIDTH*TILE_WIDTH
. For the pixels that belong to the border of the output tile the mask must borrow some pixels from the neighbor tile, when this tile belong to the borders of the image. Otherwise, these borrowed values are assigned to zero. These two steps are depicted in
if (srcY >= 0 && srcY < height && srcX >= 0 && srcX < width)
N_ds[destY][destX] = I[src];
else
N_ds[destY][destX] = 0;
For that reason the shared memory array has TILE_WIDTH + Mask_width - 1
dimension in each side. The following parts of the code are unclear to me.
destY
and destX
index.
Dividing the output index by the input tile width what does it means?The srcY
add srcX
index.
Why destY
and destX
index take part in srcY
add srcX
index?
srcY = blockIdx.y * TILE_WIDTH + destY - Mask_radius;
srcX = blockIdx.x * TILE_WIDTH + destX - Mask_radius;
TILE_WIDTH * TILE_WIDTH
? EDIT: Image added. In green there are the output tiles and in red we have the mask centered in 114 index. It is obvious that the mask borrows elements from different tiles. Finally, this image refers to one channel.
Example: Based on the image below I have tryied to wrote an example. The output tile has blockIdx.x=1
and blockIdx.y=1
based on that destY=0
and destX=0
. Also,
srcY = 1*6+0-3=3
, srcX = 3
and src = (3*18+3)*3+0=171
. Based on the calculations and the image example we do not have a match. In the first shared memory possision the value that should be stored is that with global index 57
. What is wrong with the abovementioned calculations? Can any one help please?
#define Mask_width 5
#define Mask_radius Mask_width/2
#define TILE_WIDTH 16
#define w (TILE_WIDTH + Mask_width - 1)
#define clamp(x) (min(max((x), 0.0), 1.0))
__global__ void convolution(float *I, const float* __restrict__ M, float *P,
int channels, int width, int height) {
__shared__ float N_ds[w][w];
int k;
for (k = 0; k < channels; k++) {
// First batch loading
int dest = threadIdx.y * TILE_WIDTH + threadIdx.x,
destY = dest / w, destX = dest % w,
srcY = blockIdx.y * TILE_WIDTH + destY - Mask_radius,
srcX = blockIdx.x * TILE_WIDTH + destX - Mask_radius,
src = (srcY * width + srcX) * channels + k;
if (srcY >= 0 && srcY < height && srcX >= 0 && srcX < width)
N_ds[destY][destX] = I[src];
else
N_ds[destY][destX] = 0;
// Second batch loading
dest = threadIdx.y * TILE_WIDTH + threadIdx.x + TILE_WIDTH * TILE_WIDTH;
destY = dest / w, destX = dest % w;
srcY = blockIdx.y * TILE_WIDTH + destY - Mask_radius;
srcX = blockIdx.x * TILE_WIDTH + destX - Mask_radius;
src = (srcY * width + srcX) * channels + k;
if (destY < w) {
if (srcY >= 0 && srcY < height && srcX >= 0 && srcX < width)
N_ds[destY][destX] = I[src];
else
N_ds[destY][destX] = 0;
}
__syncthreads();
float accum = 0;
int y, x;
for (y = 0; y < Mask_width; y++)
for (x = 0; x < Mask_width; x++)
accum += N_ds[threadIdx.y + y][threadIdx.x + x] * M[y * Mask_width + x];
y = blockIdx.y * TILE_WIDTH + threadIdx.y;
x = blockIdx.x * TILE_WIDTH + threadIdx.x;
if (y < height && x < width)
P[(y * width + x) * channels + k] = clamp(accum);
__syncthreads();
}
}
Your question is similar in concept to my first question on StackOverflow: Moving a (BS_X+1)(BS_Y+1) global memory matrix by BS_XBS_Y threads.
You are facing the following problem: each thread block of size TILE_WIDTHxTILE_WIDTH
should fill a shared memory area of size (TILE_WIDTH + Mask_width - 1)x(TILE_WIDTH + Mask_width - 1)
.
4) Generally, what is the intuitive explanation of having two loadings?
Since the shared memory area (TILE_WIDTH + Mask_width - 1)x(TILE_WIDTH + Mask_width - 1)
is larger than the block size TILE_WIDTHxTILE_WIDTH
and assuming it is smaller than 2xTILE_WIDTHxTILE_WIDTH
, then each thread should move at most two elements from global memory to shared memory. This is the reason why you have a two-stages loading.
1) The
destY
anddestX
index. Dividing the output index by the input tile width what does it means?
This concerns the first load stage which is appointed to load TILE_WIDTHxTILE_WIDTH
elements from global memory and fills the uppermost part of the shared memory area.
So, the operation
dest = threadIdx.y * TILE_WIDTH + threadIdx.x;
flattens the 2D coordinates of the generic thread while
destX = dest % w;
destY = dest / w;
makes the inverse operation, in that it calculates the 2D coordinates of the generic thread with respect to the shared memory area.
2) The
srcY
addsrcX
index. WhydestY
anddestX
index take part insrcY
addsrcX
index?
srcY = blockIdx.y * TILE_WIDTH + destY - Mask_radius;
srcX = blockIdx.x * TILE_WIDTH + destX - Mask_radius;
(blockIdx.x * TILE_WIDTH, blockIdx.y * TILE_WIDTH)
would be the coordinates of the global memory location if the block size and the shared memory size were the same. Since you are "borrowing" memory values also from neighboor tiles, then you have to shift the above coordinates by (destX - Mask_radius, destY - Mask_radius)
.
3) Why in the second loading we use the offset TILE_WIDTH * TILE_WIDTH?
You have this offset because in the first memory stage you have already filled the "first" TILE_WIDTHxTILE_WIDTH
locations of the shared memory.
EDIT
The picture below illustrates the correspondence between the flattened thread index dest
and the shared memory locations. In the picture, the blue boxes represent the elements of the generic tile while the red boxes the elements of the neighboor tiles. The union of the blue and red boxes correspond to the overall shared memory locations. As you can see, all the 256
threads of a thread block are involved in filling the upper part of the shared memory above the green line, while only 145
are involved in filling the lower part of the shared memory below the green line. Now you should also understand the TILE_WIDTH x TILE_WIDTH
offset.
Please, notice that you have at most 2
memory loads per thread due to the particular choice of your parameters. For example, if you have TILE_WIDTH = 8
, then the number of threads in a thread block is 64
, while the shared memory size is 12x12=144
, which means that each thread is in charge to perform at least 2
shared memory writes since 144/64=2.25
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With