OpenCL float sum reduction

Question

I would like to apply a reduce on this piece of my kernel code (1 dimensional data):

__local float sum = 0;
int i;
for(i = 0; i < length; i++)
  sum += //some operation depending on i here;

Instead of having just 1 thread that performs this operation, I would like to have n threads (with n = length) and at the end having 1 thread to make the total sum.

In pseudo code, I would like to able to write something like this:

int i = get_global_id(0);
__local float sum = 0;
sum += //some operation depending on i here;
barrier(CLK_LOCAL_MEM_FENCE);
if(i == 0)
  res = sum;

Is there a way?

I have a race condition on sum.

Bruce Dean · Accepted Answer

To get you started you could do something like the example below (see Scarpino). Here we also take advantage of vector processing by using the OpenCL float4 data type.

Keep in mind that the kernel below returns a number of partial sums: one for each local work group, back to the host. This means that you will have to carry out the final sum by adding up all the partial sums, back on the host. This is because (at least with OpenCL 1.2) there is no barrier function that synchronizes work-items in different work-groups.

If summing the partial sums on the host is undesirable, you can get around this by launching multiple kernels. This introduces some kernel-call overhead, but in some applications the extra penalty is acceptable or insignificant. To do this with the example below you will need to modify your host code to call the kernel repeatedly and then include logic to stop executing the kernel after the number of output vectors falls below the local size (details left to you or check the Scarpino reference).

EDIT: Added extra kernel argument for the output. Added dot product to sum over the float 4 vectors.

__kernel void reduction_vector(__global float4* data,__local float4* partial_sums, __global float* output) 
{
    int lid = get_local_id(0);
    int group_size = get_local_size(0);
    partial_sums[lid] = data[get_global_id(0)];
    barrier(CLK_LOCAL_MEM_FENCE);

    for(int i = group_size/2; i>0; i >>= 1) {
        if(lid < i) {
            partial_sums[lid] += partial_sums[lid + i];
        }
        barrier(CLK_LOCAL_MEM_FENCE);
    }

    if(lid == 0) {
        output[get_group_id(0)] = dot(partial_sums[0], (float4)(1.0f));
    }
}

Craig Chapman · Answer

I know this is a very old post, but from everything I've tried, the answer from Bruce doesn't work, and the one from Adam is inefficient due to both global memory use and kernel execution overhead.

The comment by Jordan on the answer from Bruce is correct that this algorithm breaks down in each iteration where the number of elements is not even. Yet it is essentially the same code as can be found in several search results.

I scratched my head on this for several days, partially hindered by the fact that my language of choice is not C/C++ based, and also it's tricky if not impossible to debug on the GPU. Eventually though, I found an answer which worked.

This is a combination of the answer by Bruce, and that from Adam. It copies the source from global memory into local, but then reduces by folding the top half onto the bottom repeatedly, until there is no data left.

The result is a buffer containing the same number of items as there are work-groups used (so that very large reductions can be broken down), which must be summed by the CPU, or else call from another kernel and do this last step on the GPU.

This part is a little over my head, but I believe, this code also avoids bank switching issues by reading from local memory essentially sequentially. ** Would love confirmation on that from anyone that knows.

Note: The global 'AOffset' parameter can be omitted from the source if your data begins at offset zero. Simply remove it from the kernel prototype and the fourth line of code where it's used as part of an array index...

__kernel void Sum(__global float * A, __global float *output, ulong AOffset, __local float * target ) {
        const size_t globalId = get_global_id(0);
        const size_t localId = get_local_id(0);
        target[localId] = A[globalId+AOffset];

        barrier(CLK_LOCAL_MEM_FENCE);
        size_t blockSize = get_local_size(0);
        size_t halfBlockSize = blockSize / 2;
        while (halfBlockSize>0) {
            if (localId<halfBlockSize) {
                target[localId] += target[localId + halfBlockSize];
                if ((halfBlockSize*2)<blockSize) { // uneven block division
                    if (localId==0) { // when localID==0
                        target[localId] += target[localId + (blockSize-1)];
                    }
                }
            }
            barrier(CLK_LOCAL_MEM_FENCE);
            blockSize = halfBlockSize;
            halfBlockSize = blockSize / 2;
        }
        if (localId==0) {
            output[get_group_id(0)] = target[0];
        }
    }

https://pastebin.com/xN4yQ28N

Timur Magomedov · Answer

You can use new work_group_reduce_add() function for sum reduction inside single work group if you have support for OpenCL C 2.0 features

OpenCL float sum reduction

Tags:

multithreading

parallel-processing

race-condition

opencl

reduction

Kami

3 Answers

Bruce Dean

Craig Chapman

Timur Magomedov

Recent Activity

Donate For Us

OpenCL float sum reduction

Tags:

multithreading

parallel-processing

race-condition

opencl

reduction

Kami

3 Answers

Bruce Dean

Craig Chapman

Timur Magomedov

Related questions

Recent Activity

Donate For Us