CUDA: how to sum all elements of an array into one number within the GPU?

Tags:

First of all, let me state that I am fully aware that my question has been already asked: Block reduction in CUDA However, as I hope to make clear, my question is a follow-up to that and I have particular needs that make the solution found by that OP to be unsuitable.

So, let me explain. In my current code, I run a Cuda kernel at every iteration of a while-loop to do some computations over the values of an array. As an example, think of it like the following:

int max_iterations = 1000;
int iteration = 0;
while(iteration < max_iterations)
{
    __global__ void calcKernel(int* idata, int* odata)
    {
        int i = blockIdx.x*blockDim.x + threadIdx.x;
        if (i < n)
        {
            odata[i] = (idata[i] + 2) * 5;
        }
    }

    iteration++;
}

However, next I have to accomplish seemingly hard task for the GPU. At each iteration of the while-loop that calls the kernel, I have to sum all values generated within odata and save the result in an intarray called result, at a position within such array that corresponds to the current iteration. It has to be accomplished inside the kernel or at least still in the GPU because due to performance constrains, I can only retrieve the result array in the very end after all iterations are completed.

A wrong naïve attempt woud look something like the following:

int max_iterations = 1000;
int iteration = 0;
while(iteration < max_iterations)
{
    __global__ void calcKernel(int* idata, int* odata, int* result)
    {
        int i = blockIdx.x*blockDim.x + threadIdx.x;
        if (i < n)
        {
            odata[i] = (idata[i] + 2) * 5;
        }
    }

    result[iteration] = 0;
    for(int j=0; j < max_iterations; j++)
    {
        result[iteration] += odata[j];            
    }

    iteration++;
}

Of course, the code above does not work due to the GPU distributing the code across threads. In order to lear how to properly do that, I have been reading other questions here in the site about array reduction using CUDA. In particular, I found a mention to a very good NVIDIA's pdf about such subject, which is also discussed in the former SO question I mentioned at the beginning: http://developer.download.nvidia.com/compute/cuda/1.1-Beta/x86_website/projects/reduction/doc/reduction.pdf

However, while I fully understand the steps of the code described in such slides, as well as the general optimizations, I don't get how that approach can sum-reduce an array to one number if the code actually ouptus a full array (and one of unclear dimensions). Could someone please shed some light about it and show me an example on how that would work (i.e. how to get the one-number out of the output array)?

Now, going back to that question I mentioned at the beginning (Block reduction in CUDA). Note that its accepted answer merely suggests one to read the pdf I linked above - which does not talk about what to do with the output array generated by the code. In the comments, the OP there mentions that he/she was able to finishi the job by summing the output array at the CPU - which is something I cannot do, since that would mean downloading the output array every iteration of my while-loop. Lastly, the third answer in that link suggests the use of a library to accomplish this - but I am interested in learning the native way of doing so.

Alternatively, I would be also very interested in any other propositions about how to implement what I am described above.

937

asked Mar 01 '17 07:03

user123443563

1 Answers

You have already found the canonical information regarding block parallel reductions, so I will not repeat that. If you don't want to write a lot of new code yourself to do this, I would suggest looking at the CUB library block_reduce implementation, which provides an optimal block wise reduction operation with the addition of about 4 lines of code to your existing kernel.

On the real question here, you can do what you want if you do something like this:

__global__ void kernel(....., int* iter_result, int iter_num) {

    // Your calculations first so that each thread holds its result

    // Block wise reduction so that one thread in each block holds sum of thread results

    // The one thread holding the adds the block result to the global iteration result
    if (threadIdx.x == 0)
        atomicAdd(iter_result + iter_num, block_ressult);
}

The key here is that an atomic function is used to safely update the kernel run result with the results from a given block without a memory race. You absolutely must initialise iter_result before running the kernel, otherwise the code won't work, but that is the basic kernel design pattern.

answered Sep 30 '22 14:09

4 revs

Related questions
                            
                                Adding an np.array as a column in a pandas.DataFrame
                            
                                What does "[Ljava.lang.Object;@" mean?
                            
                                check to see if all values inside an array is a number
                            
                                Efficiently generating multiple instances of numpy.random.choice without replacement
                            
                                Realm Lists and Swift Arrays
                            
                                What is the zero-length array mentioned in the draft standard?
                            
                                A more elegant secondary sort for arrays [closed]
                            
                                Multithreaded Functional Programming in Swift
                            
                                jQuery: If key exists in array, get value
                            
                                Removing equivalent but unique objects from a Javascript array
                            
                                Concise and robust way to read a line of space-separated integers in Go
                            
                                YAML multi-line strings: keys vs array values
                            
                                How do I sum over a dimension in array while dropping the traversed dimension?
                            
                                JS array concatenation for results of recursive flattening
                            
                                Powershell Get-ChildItem returns object not array
                            
                                Efficent way to get all keys in unorganized array of objects
                            
                                Removing similar elements from array
                            
                                MongoDB 3.4 - get array of subdocuments without root document
                            
                                How to in-place initialize an array?
                            
                                Array range complement

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

CUDA: how to sum all elements of an array into one number within the GPU?

Tags:

arrays

cuda

sum

nvidia

reduction

user123443563

People also ask

1 Answers

4 revs

Recent Activity

Donate For Us