CUDA - Parallel Reduction Sum

Tags:

I am trying to implement a parallel reduction sum in CUDA 7.5. I have been trying to follow the NVIDIA PDF that walks you through the initial algorithm and then steadily more optimised versions. I am currently making an array that is filled with 1 as the value in every array position so that I can check the output is correct but I am getting a value of -842159451 for an array of size 64. I am expecting that the kernel code is correct as I have followed the exact code from NVIDIA for it but here is my kernel:

__global__ void reduce0(int *input, int *output) {
    extern __shared__ int sdata[];

    unsigned int tid = threadIdx.x;
    unsigned int i = blockIdx.x * blockDim.x + threadIdx.x;

    sdata[tid] = input[i];

    __syncthreads();

    for (unsigned int s = 1; s < blockDim.x; s *= 2) {
        if (tid % (2 * s) == 0) {
            sdata[tid] += sdata[tid + s];
        }

        __syncthreads();
    }

    if (tid == 0) output[blockIdx.x] = sdata[0];
}

Here is my code calling the kernel, which is where I expect my problem to be:

int main()
{
    int numThreadsPerBlock = 1024;

    int *hostInput;
    int *hostOutput; 
    int *deviceInput; 
    int *deviceOutput; 

    int numInputElements = 64;
    int numOutputElements; // number of elements in the output list, initialised below

    numOutputElements = numInputElements / (numThreadsPerBlock / 2);
    if (numInputElements % (numThreadsPerBlock / 2)) {
        numOutputElements++;
    }

    hostInput = (int *)malloc(numInputElements * sizeof(int));
    hostOutput = (int *)malloc(numOutputElements * sizeof(int));


    for (int i = 0; i < numInputElements; ++i) {
        hostInput[i] = 1;
    }

    const dim3 blockSize(numThreadsPerBlock, 1, 1);
    const dim3 gridSize(numOutputElements, 1, 1);

    cudaMalloc((void **)&deviceInput, numInputElements * sizeof(int));
    cudaMalloc((void **)&deviceOutput, numOutputElements * sizeof(int));

    cudaMemcpy(deviceInput, hostInput, numInputElements * sizeof(int), cudaMemcpyHostToDevice);

    reduce0 << <gridSize, blockSize >> >(deviceInput, deviceOutput);

    cudaMemcpy(hostOutput, deviceOutput, numOutputElements * sizeof(int), cudaMemcpyDeviceToHost);

    for (int ii = 1; ii < numOutputElements; ii++) {
        hostOutput[0] += hostOutput[ii]; //accumulates the sum in the first element
    }

    int sumGPU = hostOutput[0];

    printf("GPU Result: %d\n", sumGPU);

    std::string wait;
    std::cin >> wait;

    return 0;
}

I have also tried bigger and smaller array sizes for the input and I get the same result of a very large negative value no matter the size of the array.

721

asked Nov 02 '17 19:11

Terramet

1 Answers

Seems you are using a dynamically allocated shared array:

extern __shared__ int sdata[];

but you are not allocating it in the kernel invocation:

reduce0 <<<gridSize, blockSize >>>(deviceInput, deviceOutput);

You have two options:

Option 1

Allocate the shared memory statically in the kernel, e.g.

constexpr int threadsPerBlock = 1024;
__shared__ int sdata[threadsPerBlock];

More often than not I find this the cleanest approach, as it works without a problem when you have multiple arrays in shared memory. The drawback is that while the size usually depends on the number of threads in the block, you need the size to be known at compile-time.

Option 2

Specify the amount of dynamically allocated shared memory in the kernel invocation.

reduce0 <<<gridSize, blockSize, numThreadsPerBlock*sizeof(int) >>>(deviceInput, deviceOutput);

This will work for any value of numThreadsPerBlock (provided it is within the allowed range of course). The drawback is that if you have multiple extern shared arrays, you need to figure out how to put then in the memory yourself, so that one does not overwrite the other.

Note, there may be other problems in your code. I didn't test it. This is something I spotted immediately upon glancing over your code.

158

answered Nov 03 '22 18:11

CygnusX1

Related questions
                            
                                Check if a DLL is signed C++
                            
                                What is gsl::multi_span to be used for?
                            
                                Getting user input from R console: Rcpp and std::cin
                            
                                co_await appears to be suboptimal?
                            
                                C++ overloading and overriding
                            
                                Why would you create a class with one single member which is an operator()? [duplicate]
                            
                                How to count C++ array items with a template function while allowing for empty arrays
                            
                                VSCode intellisense with C++ headers
                            
                                Run threads in gtest
                            
                                Get item by index from boost::variant like it's possible with std::variant
                            
                                Call C++ member function pointer without knowing which class
                            
                                Why isn't std::array's operator==() marked constexpr?
                            
                                Using inner class with CRTP
                            
                                How to read std::queue shared with another thread?
                            
                                adding <iostream> breaks the code in g++-7
                            
                                Difference between cross GCC and MacOSX GCC on eclipse
                            
                                Can we use heterogeneous lookup comparator to perform a "partial-match" search on STL associative containers?
                            
                                Create cartesian product expansion of two variadic, non-type template parameter packs
                            
                                Default mode of fstream
                            
                                Fully statically build application with all dependencies (libgcc, etc.)?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

CUDA - Parallel Reduction Sum

Tags:

c++

parallel-processing

cuda

Terramet

People also ask

1 Answers

CygnusX1

Recent Activity

Donate For Us