Strategy for doing final reduction

Tags:

I am trying to implement an OpenCL version for doing reduction of a array of float.

To achieve it, I took the following code snippet found on the web :

__kernel void sumGPU ( __global const double *input, 
                       __global double *partialSums,
               __local double *localSums)
 {
  uint local_id = get_local_id(0);
  uint group_size = get_local_size(0);

  // Copy from global memory to local memory
  localSums[local_id] = input[get_global_id(0)];

  // Loop for computing localSums
  for (uint stride = group_size/2; stride>0; stride /=2)
     {
      // Waiting for each 2x2 addition into given workgroup
      barrier(CLK_LOCAL_MEM_FENCE);

      // Divide WorkGroup into 2 parts and add elements 2 by 2
      // between local_id and local_id + stride
      if (local_id < stride)
        localSums[local_id] += localSums[local_id + stride];
     }

  // Write result into partialSums[nWorkGroups]
  if (local_id == 0)
    partialSums[get_group_id(0)] = localSums[0];
 }

This kernel code works well but I would like to compute the final sum by adding all the partial sums of each work group. Currently, I do this step of final sum by CPU with a simple loop and iterations nWorkGroups.

I saw also another solution with atomic functions but it seems to be implemented for int, not for floats. I think that only CUDA provides atomic functions for float.

I saw also that I could another kernel code which performs this operation of sum but I would like to avoid this solution in order to keep a simple readable source. Maybe I cannot do without this solution...

I must tell you that I use OpenCL 1.2 (returned by clinfo) on a Radeon HD 7970 Tahiti 3GB (I think that OpenCL 2.0 is not supported with my card).

More generally, I would like to get advice about the simplest method to perform this last final summation with my graphics card model and OpenCL 1.2.

941

asked Apr 27 '16 02:04

youpilat13

1 Answers

If that float's order of magnitude is smaller than exa scale, then:

Instead of

if (local_id == 0)
  partialSums[get_group_id(0)] = localSums[0];

You could use

if (local_id == 0)
{
    if(strategy==ATOMIC)
    {
        long integer_part=getIntegerPart(localSums[0]);
        atom_add (&totalSumIntegerPart[0] ,integer_part);
        long float_part=1000000*getFloatPart(localSums[0]);
         // 1000000 for saving meaningful 7 digits as integer
        atom_add (&totalSumFloatPart[0] ,float_part);
    }
}

this will overflow float part so when you divide it by 1000000 in another kernel, it may have more than 1000000 value so you get its integer part and add it to the real integer part:

   float value=0;
   if(strategy==ATOMIC)
   {
       float float_part=getFloatPart_(totalSumFloatPart[0]);
       float integer_part=getIntegerPart_(totalSumFloatPart[0])
       + totalSumIntegerPart[0];
       value=integer_part+float_part;
   }

just a few atomic operations shouldn't be effective on whole kernel time.

Some of these get___part can be written easily already using floor and similar functions. Some need a divide by 1M.

answered Sep 25 '22 13:09

huseyin tugrul buyukisik

Related questions
                            
                                How to use getaddrinfo to connect to a server using the external IP?
                            
                                Bitwise memmove
                            
                                Pointer arithmetic and integral promotion
                            
                                Overwrite return address simple format string exploit
                            
                                Is it safe to poll for a variable written by other threads in a C program?
                            
                                Find the lexicographic order of an integer partition
                            
                                Soft Real Time Linux Scheduling
                            
                                Compile Lua without automatic conversion between strings and numbers
                            
                                C/C++ linkage convention
                            
                                SSE Code runs 30% faster, yet when in use show over 20% CPU increase
                            
                                How to override exit(), perhaps by throwing exception
                            
                                What are the benefits to using BIO_printf() instead of printf()?
                            
                                How to have aliased variables in shared libraries?
                            
                                Resolving conversion warnings with compound assignment operators
                            
                                GCC does not emit a warning when compiling [duplicate]
                            
                                Deterministic bit scrambling to filter coordinates
                            
                                Multithreaded C Lua module leading to segfault in Lua script
                            
                                Fast algorithm mapping int to monotonically increasing int subset
                            
                                How to intercept keyboard input at the lowest level in linux?
                            
                                No output when using `fprintf' after `fwprintf'

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Strategy for doing final reduction

Tags:

arrays

c

opencl

reduction

youpilat13

People also ask

1 Answers

huseyin tugrul buyukisik

Recent Activity

Donate For Us