I have read the article Optimizing Parallel Reduction in CUDA by Mark Harris, and I found it really very useful, but still I am sometimes unable to understand 1 or 2 concepts. It is written on pg 18: <pre class="prettyprint"><code>//First add during load // each thread loads one element from global to shared mem unsigned int tid = threadIdx.x; unsigned int i = blockIdx.x*blockDim.x + threadIdx.x; sdata[tid] = g_idata[i]; __syncthreads(); </code></pre> Optimized Code: With 2 loads and 1st add of the reduction: <pre class="prettyprint"><code>// perform first level of reduction, // reading from global memory, writing to shared memory unsigned int tid = threadIdx.x; ...1 unsigned int i = blockIdx.x*(blockDim.x*2) + threadIdx.x; ...2 sdata[tid] = g_idata[i] + g_idata[i+blockDim.x]; ...3 __syncthreads(); ...4 </code></pre> I am unable to understand line 2; if I have 256 elements, and if I choose 128 as my blocksize, then why I am multiplying it with 2? Please explain how to determine the blocksize?

Basically, it is performing the operation shown in the picture below: <img src="https://i.stack.imgur.com/2Wkb8.png" alt="enter image description here"> This code is basically saying that half of the threads will performance the reading from global memory and writing to shared memory, as shown in the picture. You execute a Kernel, and now you want to reduce some values, you limit the access to the code above to only half of the total of threads running. Imagining you have 4 blocks, each one with 512 threads, you limit the code above to only be executed by the first two blocks, and you have a <code>g_idate[4*512]</code>: <pre class="prettyprint"><code>unsigned int i = blockIdx.x*(blockDim.x*2) + threadIdx.x; sdata[tid] = g_idata[i] + g_idata[i+blockDim.x]; </code></pre> So: <pre class="prettyprint"><code>thread 0 of block = 0 will copy the position 0 and 512, thread 1 of block = 0 position 1 and 513; thread 511 of block = 0 position 511 and 1023; thread 0 of block 1 position 1024 and 1536 thread 511 of block = 1 position 1535 and 2047 </code></pre> The <code>blockDim.x*2</code> is used because each thread will access to position <code>i</code> and <code>i+blockDim.x</code> so you need to multiple by <code>2</code> to guarantee that the threads on next <code>id</code> block do not compute the position of <code>g_idata</code> already computed.

Parallel Reduction

I have read the article Optimizing Parallel Reduction in CUDA by Mark Harris, and I found it really very useful, but still I am sometimes unable to understand 1 or 2 concepts. It is written on pg 18:

//First add during load

// each thread loads one element from global to shared mem

unsigned int tid = threadIdx.x;

unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;

sdata[tid] = g_idata[i];
__syncthreads();

Optimized Code: With 2 loads and 1st add of the reduction:

// perform first level of reduction,

// reading from global memory, writing to shared memory
unsigned int tid = threadIdx.x;                                    ...1

unsigned int i = blockIdx.x*(blockDim.x*2) + threadIdx.x;          ...2

sdata[tid] = g_idata[i] + g_idata[i+blockDim.x];                   ...3

__syncthreads();                                                   ...4

I am unable to understand line 2; if I have 256 elements, and if I choose 128 as my blocksize, then why I am multiplying it with 2? Please explain how to determine the blocksize?

What is parallel reduction in Cuda?

In general, the parallel reduction can be applied for any binary associative operator, i.e. (A*B)*C = A*(B*C) . With such operator *, the parallel reduction algorithm repetedely groups the array arguments in pairs. Each pair is computed in parallel with others, halving the overall array size in one step.

What is parallel reduction in Wikipedia?

In computer science, the reduction operator is a type of operator that is commonly used in parallel programming to reduce the elements of an array into a single result. Reduction operators are associative and often (but not necessarily) commutative.

What is the purpose of parallel algorithms?

Parallel algorithms need to optimize one more resource, the communication between different processors. There are two ways parallel processors communicate, shared memory or message passing.

What is parallel paradigm?

Parallel Programming Paradigms - Processors and MemoryThat paradigm implies basic assumptions about how fundamental hardware components, such as processors and memory, are connected to one another. These basic assumptions are crucial to writing functionally correct and efficient parallel code.

Basically, it is performing the operation shown in the picture below:

enter image description here

This code is basically saying that half of the threads will performance the reading from global memory and writing to shared memory, as shown in the picture.

You execute a Kernel, and now you want to reduce some values, you limit the access to the code above to only half of the total of threads running. Imagining you have 4 blocks, each one with 512 threads, you limit the code above to only be executed by the first two blocks, and you have a g_idate[4*512]:

unsigned int i = blockIdx.x*(blockDim.x*2) + threadIdx.x;  

sdata[tid] = g_idata[i] + g_idata[i+blockDim.x];

So:

thread 0 of block = 0  will copy the position 0 and 512,  
thread 1 of block = 0 position 1 and 513;
thread 511 of block = 0 position 511 and 1023;
thread 0 of block 1 position 1024 and 1536
thread 511 of block = 1 position 1535 and 2047

The blockDim.x*2 is used because each thread will access to position i and i+blockDim.x so you need to multiple by 2 to guarantee that the threads on next id block do not compute the position of g_idata already computed.

In the optimized code you run the kernel with blocks half as large as in the non-optimized implementation.

Let's call the size of the block in non-optimized code work, let half of this size be called unit, and let these sizes have same numerical value for the optimized code as well.

In the non-optimized code you run the kernel with as many threads as the work is, that is blockDim = 2 * unit. The code in each block just copies part of g_idata to an array in shared memory, of size 2 * unit.

In the optimized code blockDim = unit, so there are now 1/2 of the threads, and the array in shared memory is 2x smaller. In line 3 first summand comes from even units, while second from odd units. In this way all the data required for reduction is taken into account.

Example: If you run non-optimized kernel with blockDim=256=work (single block, unit=128), then optimized code has a single block of blockDim=128=unit. Since this block gets blockIdx=0, the *2 does not matter; the first thread does g_idata[0] + g_idata[0 + 128].

If you had 512 elements, and run non-optimized with 2 blocks of size 256 (work=256, unit=128), then optimized code has 2 blocks, but now of size 128. The first thread in second block (blockIdx=1) does g_idata[2*128] + g_idata[2*128+128].

Parallel Reduction

Tags:

c++

c

parallel-processing

cuda

gpu

robot

People also ask

2 Answers

dreamcrash

P Marecki

Recent Activity

Donate For Us

Parallel Reduction

Tags:

c++

c

parallel-processing

cuda

gpu

robot

People also ask

2 Answers

dreamcrash

P Marecki

Related questions

Recent Activity

Donate For Us