I have read the article Optimizing Parallel Reduction in CUDA by Mark Harris, and I found it really very useful, but still I am sometimes unable to understand 1 or 2 concepts. It is written on pg 18:
//First add during load
// each thread loads one element from global to shared mem
unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;
sdata[tid] = g_idata[i];
__syncthreads();
Optimized Code: With 2 loads and 1st add of the reduction:
// perform first level of reduction,
// reading from global memory, writing to shared memory
unsigned int tid = threadIdx.x; ...1
unsigned int i = blockIdx.x*(blockDim.x*2) + threadIdx.x; ...2
sdata[tid] = g_idata[i] + g_idata[i+blockDim.x]; ...3
__syncthreads(); ...4
I am unable to understand line 2; if I have 256 elements, and if I choose 128 as my blocksize, then why I am multiplying it with 2? Please explain how to determine the blocksize?
In general, the parallel reduction can be applied for any binary associative operator, i.e. (A*B)*C = A*(B*C) . With such operator *, the parallel reduction algorithm repetedely groups the array arguments in pairs. Each pair is computed in parallel with others, halving the overall array size in one step.
In computer science, the reduction operator is a type of operator that is commonly used in parallel programming to reduce the elements of an array into a single result. Reduction operators are associative and often (but not necessarily) commutative.
Parallel algorithms need to optimize one more resource, the communication between different processors. There are two ways parallel processors communicate, shared memory or message passing.
Parallel Programming Paradigms - Processors and MemoryThat paradigm implies basic assumptions about how fundamental hardware components, such as processors and memory, are connected to one another. These basic assumptions are crucial to writing functionally correct and efficient parallel code.
Basically, it is performing the operation shown in the picture below:
This code is basically saying that half of the threads will performance the reading from global memory and writing to shared memory, as shown in the picture.
You execute a Kernel, and now you want to reduce some values, you limit the access to the code above to only half of the total of threads running. Imagining you have 4 blocks, each one with 512 threads, you limit the code above to only be executed by the first two blocks, and you have a g_idate[4*512]
:
unsigned int i = blockIdx.x*(blockDim.x*2) + threadIdx.x;
sdata[tid] = g_idata[i] + g_idata[i+blockDim.x];
So:
thread 0 of block = 0 will copy the position 0 and 512,
thread 1 of block = 0 position 1 and 513;
thread 511 of block = 0 position 511 and 1023;
thread 0 of block 1 position 1024 and 1536
thread 511 of block = 1 position 1535 and 2047
The blockDim.x*2
is used because each thread will access to position i
and i+blockDim.x
so you need to multiple by 2
to guarantee that the threads on next id
block do not compute the position of g_idata
already computed.
In the optimized code you run the kernel with blocks half as large as in the non-optimized implementation.
Let's call the size of the block in non-optimized code work
, let half of this size be called unit
, and let these sizes have same numerical value for the optimized code as well.
In the non-optimized code you run the kernel with as many threads as the work
is, that is blockDim = 2 * unit
. The code in each block just copies part of g_idata
to an array in shared memory, of size 2 * unit
.
In the optimized code blockDim = unit
, so there are now 1/2 of the threads, and the array in shared memory is 2x smaller. In line 3 first summand comes from even units, while second from odd units. In this way all the data required for reduction is taken into account.
Example:
If you run non-optimized kernel with blockDim=256=work
(single block, unit=128
), then optimized code has a single block of blockDim=128=unit
. Since this block gets blockIdx=0
, the *2
does not matter; the first thread does g_idata[0] + g_idata[0 + 128]
.
If you had 512 elements, and run non-optimized with 2 blocks of size 256 (work=256
, unit=128
), then optimized code has 2 blocks, but now of size 128. The first thread in second block (blockIdx=1
) does g_idata[2*128] + g_idata[2*128+128]
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With