I am trying to do reduction in CUDA and I am really a newbie. I am currently studying a sample code from NVIDIA. I guess I am really not sure how to set up the block size and grid size, especially when my input array is larger (<code>512 X 512</code>) than a single block size. Here is the code. <pre class="prettyprint"><code>template <unsigned int blockSize> __global__ void reduce6(int *g_idata, int *g_odata, unsigned int n) { extern __shared__ int sdata[]; unsigned int tid = threadIdx.x; unsigned int i = blockIdx.x*(blockSize*2) + tid; unsigned int gridSize = blockSize*2*gridDim.x; sdata[tid] = 0; while (i < n) { sdata[tid] += g_idata[i] + g_idata[i+blockSize]; i += gridSize; } __syncthreads(); if (blockSize >= 512) { if (tid < 256) { sdata[tid] += sdata[tid + 256]; } __syncthreads(); } if (blockSize >= 256) { if (tid < 128) { sdata[tid] += sdata[tid + 128]; } __syncthreads(); } if (blockSize >= 128) { if (tid < 64) { sdata[tid] += sdata[tid + 64]; } __syncthreads(); } if (tid < 32) { if (blockSize >= 64) sdata[tid] += sdata[tid + 32]; if (blockSize >= 32) sdata[tid] += sdata[tid + 16]; if (blockSize >= 16) sdata[tid] += sdata[tid + 8]; if (blockSize >= 8) sdata[tid] += sdata[tid + 4]; if (blockSize >= 4) sdata[tid] += sdata[tid + 2]; if (blockSize >= 2) sdata[tid] += sdata[tid + 1]; } if (tid == 0) g_odata[blockIdx.x] = sdata[0]; } </code></pre> However, it seems to me the <code>g_odata[blockIdx.x]</code> saves the partial sums from all blocks, and, if I want to get the final result, I need to sum all the terms within the <code>g_odata[blockIdx.x]</code> array. I am wondering: is there a kernel to do the whole summation? or am I misunderstanding things here? I would really appreciate if anyone can educate me with this. Thanks very much.

Your understanding is correct. The reductions demonstrated here end up with a sequence of block-sums deposited in global memory. To sum all of these block sums together, requires some form of global synchronization. You must wait until all the blocks are complete before adding their sums together. You have a number of options at this point, some of which are: <ol> <li>launch a new kernel after the main kernel to sum the block-sums together</li> <li>add the block sums on the host</li> <li>use atomics to add the block sums together, at the end of the main kernel</li> <li>use a method like threadfence reduction to add the block sums together in the main kernel.</li> <li>Use CUDA cooperative groups to place a grid-wide sync in the kernel code. Sum the block sums after the grid-wide sync (perhaps in one block).</li> </ol> If you search around the CUDA tag you can find examples of all these, and discussions of their pros and cons. To see how the main kernel you posted is used for a complete reduction, look at the parallel reduction sample code.

Block reduction in CUDA

Tags:

algorithm

cuda

reduction

cub

I am trying to do reduction in CUDA and I am really a newbie. I am currently studying a sample code from NVIDIA.

I guess I am really not sure how to set up the block size and grid size, especially when my input array is larger (512 X 512) than a single block size.

Here is the code.

template <unsigned int blockSize>
__global__ void reduce6(int *g_idata, int *g_odata, unsigned int n)
{
    extern __shared__ int sdata[];
    unsigned int tid = threadIdx.x;
    unsigned int i = blockIdx.x*(blockSize*2) + tid;
    unsigned int gridSize = blockSize*2*gridDim.x;
    sdata[tid] = 0;

    while (i < n) 
    { 
        sdata[tid] += g_idata[i] + g_idata[i+blockSize]; 
        i += gridSize; 
    }

    __syncthreads();

    if (blockSize >= 512) { if (tid < 256) { sdata[tid] += sdata[tid + 256]; } __syncthreads(); }
    if (blockSize >= 256) { if (tid < 128) { sdata[tid] += sdata[tid + 128]; } __syncthreads(); }
    if (blockSize >= 128) { if (tid < 64) { sdata[tid] += sdata[tid + 64]; } __syncthreads(); }

    if (tid < 32) 
    {
        if (blockSize >= 64) sdata[tid] += sdata[tid + 32];
        if (blockSize >= 32) sdata[tid] += sdata[tid + 16];
        if (blockSize >= 16) sdata[tid] += sdata[tid + 8];
        if (blockSize >= 8) sdata[tid] += sdata[tid + 4];
        if (blockSize >= 4) sdata[tid] += sdata[tid + 2];
        if (blockSize >= 2) sdata[tid] += sdata[tid + 1];
    }

    if (tid == 0) g_odata[blockIdx.x] = sdata[0];
}

However, it seems to me the g_odata[blockIdx.x] saves the partial sums from all blocks, and, if I want to get the final result, I need to sum all the terms within the g_odata[blockIdx.x] array.

I am wondering: is there a kernel to do the whole summation? or am I misunderstanding things here? I would really appreciate if anyone can educate me with this. Thanks very much.

861

asked Apr 08 '14 13:04

Ono

1 Answers

Your understanding is correct. The reductions demonstrated here end up with a sequence of block-sums deposited in global memory.

To sum all of these block sums together, requires some form of global synchronization. You must wait until all the blocks are complete before adding their sums together. You have a number of options at this point, some of which are:

launch a new kernel after the main kernel to sum the block-sums together
add the block sums on the host
use atomics to add the block sums together, at the end of the main kernel
use a method like threadfence reduction to add the block sums together in the main kernel.
Use CUDA cooperative groups to place a grid-wide sync in the kernel code. Sum the block sums after the grid-wide sync (perhaps in one block).

If you search around the CUDA tag you can find examples of all these, and discussions of their pros and cons. To see how the main kernel you posted is used for a complete reduction, look at the parallel reduction sample code.

answered Oct 13 '22 01:10

Robert Crovella

Related questions
                            
                                Finding taxicab Numbers
                            
                                Greatest distance between set of longitude/latitude points
                            
                                Split array into pieces of X length
                            
                                How to determine if two web pages are the same?
                            
                                Next month, same day in PHP
                            
                                Algorithm to find the smallest snippet from searching a document?
                            
                                Find the pair across 2 arrays with kth largest sum [closed]
                            
                                Largest rectangular sub matrix with the same number
                            
                                Efficient scheduling of university courses
                            
                                What is a Bw-tree?
                            
                                Word frequency in a large text file
                            
                                Is it possible to evaluate lambda calculus terms efficiently?
                            
                                How do I reverse a UTF-8 string in place?
                            
                                How can I check Hamming Weight without converting to binary?
                            
                                Combining MD5 hash values
                            
                                How does bubble sort compare to selection sort?
                            
                                Find word in dictionary of unknown size using only a method to get a word by index
                            
                                Algorithm to find 100 closest stars to the origin
                            
                                Mergesort - Is Bottom-Up faster than Top-Down?
                            
                                find pair of numbers whose difference is an input value 'k' in an unsorted array

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With