Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Block reduction in CUDA

I am trying to do reduction in CUDA and I am really a newbie. I am currently studying a sample code from NVIDIA.

I guess I am really not sure how to set up the block size and grid size, especially when my input array is larger (512 X 512) than a single block size.

Here is the code.

template <unsigned int blockSize>
__global__ void reduce6(int *g_idata, int *g_odata, unsigned int n)
{
    extern __shared__ int sdata[];
    unsigned int tid = threadIdx.x;
    unsigned int i = blockIdx.x*(blockSize*2) + tid;
    unsigned int gridSize = blockSize*2*gridDim.x;
    sdata[tid] = 0;

    while (i < n) 
    { 
        sdata[tid] += g_idata[i] + g_idata[i+blockSize]; 
        i += gridSize; 
    }

    __syncthreads();

    if (blockSize >= 512) { if (tid < 256) { sdata[tid] += sdata[tid + 256]; } __syncthreads(); }
    if (blockSize >= 256) { if (tid < 128) { sdata[tid] += sdata[tid + 128]; } __syncthreads(); }
    if (blockSize >= 128) { if (tid < 64) { sdata[tid] += sdata[tid + 64]; } __syncthreads(); }

    if (tid < 32) 
    {
        if (blockSize >= 64) sdata[tid] += sdata[tid + 32];
        if (blockSize >= 32) sdata[tid] += sdata[tid + 16];
        if (blockSize >= 16) sdata[tid] += sdata[tid + 8];
        if (blockSize >= 8) sdata[tid] += sdata[tid + 4];
        if (blockSize >= 4) sdata[tid] += sdata[tid + 2];
        if (blockSize >= 2) sdata[tid] += sdata[tid + 1];
    }

    if (tid == 0) g_odata[blockIdx.x] = sdata[0];
}

However, it seems to me the g_odata[blockIdx.x] saves the partial sums from all blocks, and, if I want to get the final result, I need to sum all the terms within the g_odata[blockIdx.x] array.

I am wondering: is there a kernel to do the whole summation? or am I misunderstanding things here? I would really appreciate if anyone can educate me with this. Thanks very much.

like image 861
Ono Avatar asked Apr 08 '14 13:04

Ono


People also ask

What is reduction in CUDA?

Parallel reduction algorithm typically refers to an algorithm which combines an array of elements, producing a single result. Typical problems that fall into this category are: summing up all elements in an array. finding a maximum in an array.

What is block size in CUDA?

CUDA architecture limits the numbers of threads per block (1024 threads per block limit). The dimension of the thread block is accessible within the kernel through the built-in blockDim variable. All threads within a block can be synchronized using an intrinsic function __syncthreads .

What does you mean by parallel reduction?

One common approach to this problem is parallel reduction. This can be applied for many problems, a min operation being just one of them. It works by using half the number of threads of the elements in the dataset. Every thread calculates the minimum of its own element and some other element.

What is reduction in parallel programming?

In computer science, the reduction operator is a type of operator that is commonly used in parallel programming to reduce the elements of an array into a single result. Reduction operators are associative and often (but not necessarily) commutative.


1 Answers

Your understanding is correct. The reductions demonstrated here end up with a sequence of block-sums deposited in global memory.

To sum all of these block sums together, requires some form of global synchronization. You must wait until all the blocks are complete before adding their sums together. You have a number of options at this point, some of which are:

  1. launch a new kernel after the main kernel to sum the block-sums together
  2. add the block sums on the host
  3. use atomics to add the block sums together, at the end of the main kernel
  4. use a method like threadfence reduction to add the block sums together in the main kernel.
  5. Use CUDA cooperative groups to place a grid-wide sync in the kernel code. Sum the block sums after the grid-wide sync (perhaps in one block).

If you search around the CUDA tag you can find examples of all these, and discussions of their pros and cons. To see how the main kernel you posted is used for a complete reduction, look at the parallel reduction sample code.

like image 81
Robert Crovella Avatar answered Oct 13 '22 01:10

Robert Crovella