I've been learning Cuda and I am still getting to grips with parallelism. The problem I am having at the moment is implementing a max reduce on an array of values. This is my kernel <pre class="prettyprint"><code>__global__ void max_reduce(const float* const d_array, float* d_max, const size_t elements) { extern __shared__ float shared[]; int tid = threadIdx.x; int gid = (blockDim.x * blockIdx.x) + tid; if (gid < elements) shared[tid] = d_array[gid]; __syncthreads(); for (unsigned int s=blockDim.x/2; s>0; s>>=1) { if (tid < s && gid < elements) shared[tid] = max(shared[tid], shared[tid + s]); __syncthreads(); } if (gid == 0) *d_max = shared[tid]; } </code></pre> I have implemented a min reduce using the same method (replacing the max function with the min) which works fine. To test the kernel, I found the min and max values using a serial for loop. The min and max values always come out the same in the kernel but only the min reduce matches up. Is there something obvious I'm missing/doing wrong?

Your main conclusion in your deleted answer was correct: the kernel you have posted doesn't comprehend the fact that at the end of that kernel execution, you have done a good deal of the overall reduction, but the results are not quite complete. The results of each block must be combined (somehow). As pointed out in the comments, there are a few other issues with your code as well. Let's take a look at a modified version of it: <pre class="prettyprint"><code>__device__ float atomicMaxf(float* address, float val) { int *address_as_int =(int*)address; int old = *address_as_int, assumed; while (val > __int_as_float(old)) { assumed = old; old = atomicCAS(address_as_int, assumed, __float_as_int(val)); } return __int_as_float(old); } __global__ void max_reduce(const float* const d_array, float* d_max, const size_t elements) { extern __shared__ float shared[]; int tid = threadIdx.x; int gid = (blockDim.x * blockIdx.x) + tid; shared[tid] = -FLOAT_MAX; // 1 if (gid < elements) shared[tid] = d_array[gid]; __syncthreads(); for (unsigned int s=blockDim.x/2; s>0; s>>=1) { if (tid < s && gid < elements) shared[tid] = max(shared[tid], shared[tid + s]); // 2 __syncthreads(); } // what to do now? // option 1: save block result and launch another kernel if (tid == 0) d_max[blockIdx.x] = shared[tid]; // 3 // option 2: use atomics if (tid == 0) atomicMaxf(d_max, shared[0]); } </code></pre> <ol> <li>As Pavan indicated, you need to initialize your shared memory array. The last block launched may not be a "full" block, if <code>gridDim.x*blockDim.x</code> is greater than <code>elements</code>. </li> <li>Note that in this line, even though we are checking that the thread operating (<code>gid</code>) is less than <code>elements</code>, when we add <code>s</code> to <code>gid</code> for indexing into the shared memory we can still index outside of the legitimate values copied into shared memory, in the last block. Therefore we need the shared memory initialization indicated in note 1.</li> <li>As you already discovered, your last line was not correct. Each block produces it's own result, and we must combine them somehow. One method you might consider if the number of blocks launched is small (more on this later) is to use atomics. Normally we steer people away from using atomics since they are "costly" in terms of execution time. However, the other option we are faced with is saving the block result in global memory, finishing the kernel, and then possibly launching another kernel to combine the individual block results. If I have launched a large number of blocks initially (say more than 1024) then if I follow this methodology I might end up launching two additional kernels. Thus the consideration of atomics. As indicated, there is no native <code>atomicMax</code> function for floats, but as indicated in the documentation, you can use <code>atomicCAS</code> to generate any arbitrary atomic function, and I have provided an example of that in <code>atomicMaxf</code> which provides an atomic max for <code>float</code>.</li> </ol> But is running 1024 or more atomic functions (one per block) the best way? Probably not. When launching kernels of threadblocks, we really only need to launch enough threadblocks to keep the machine busy. As a rule of thumb we want at least 4-8 warps operating per SM, and somewhat more is probably a good idea. But there's no particular benefit from a machine utilization standpoint to launch thousands of threadblocks initially. If we pick a number like 8 threadblocks per SM, and we have at most, say, 14-16 SMs in our GPU, this gives us a relatively small number of 8*14 = 112 threadblocks. Let's choose 128 (8*16) for a nice round number. There's nothing magical about this, it's just enough to keep the GPU busy. If we make each of these 128 threadblocks do additional work to solve the whole problem, we can then leverage our use of atomics without (perhaps) paying too much of a penalty for doing so, and avoid multiple kernel launches. So how would this look?: <pre class="prettyprint"><code>__device__ float atomicMaxf(float* address, float val) { int *address_as_int =(int*)address; int old = *address_as_int, assumed; while (val > __int_as_float(old)) { assumed = old; old = atomicCAS(address_as_int, assumed, __float_as_int(val)); } return __int_as_float(old); } __global__ void max_reduce(const float* const d_array, float* d_max, const size_t elements) { extern __shared__ float shared[]; int tid = threadIdx.x; int gid = (blockDim.x * blockIdx.x) + tid; shared[tid] = -FLOAT_MAX; while (gid < elements) { shared[tid] = max(shared[tid], d_array[gid]); gid += gridDim.x*blockDim.x; } __syncthreads(); gid = (blockDim.x * blockIdx.x) + tid; // 1 for (unsigned int s=blockDim.x/2; s>0; s>>=1) { if (tid < s && gid < elements) shared[tid] = max(shared[tid], shared[tid + s]); __syncthreads(); } if (tid == 0) atomicMaxf(d_max, shared[0]); } </code></pre> With this modified kernel, when creating the kernel launch, we are not deciding how many threadblocks to launch based on the overall data size (<code>elements</code>). Instead we are launching a fixed number of blocks (say, 128, you can modify this number to find out what runs fastest), and letting each threadblock (and thus the entire grid) loop through memory, computing partial max operations on each element in shared memory. Then, in the line marked with comment 1, we must re-set the <code>gid</code> variable to it's initial value. This is actually unnecessary and the block reduction loop code can be further simplified if we guarantee that the size of the grid (<code>gridDim.x*blockDim.x</code>) is less than <code>elements</code>, which is not difficult to do at kernel launch. Note that when using this atomic method, it's necessary to initialize the result (<code>*d_max</code> in this case) to an appropriate value, like <code>-FLOAT_MAX</code>. Again, we normally steer people way from atomic usage, but in this case, it's worth considering if we carefully manage it, and it allows us to save the overhead of an additional kernel launch. For a ninja-level analysis of how to do fast parallel reductions, take a look at Mark Harris' excellent whitepaper which is available with the relevant CUDA sample.

Implementing Max Reduce in Cuda

Tags:

parallel-processing

cuda

I've been learning Cuda and I am still getting to grips with parallelism. The problem I am having at the moment is implementing a max reduce on an array of values. This is my kernel

__global__ void max_reduce(const float* const d_array,
                     float* d_max,
                     const size_t elements)
{
    extern __shared__ float shared[];

    int tid = threadIdx.x;
    int gid = (blockDim.x * blockIdx.x) + tid;

    if (gid < elements)
        shared[tid] = d_array[gid];
    __syncthreads();

    for (unsigned int s=blockDim.x/2; s>0; s>>=1) 
    {
        if (tid < s && gid < elements)
            shared[tid] = max(shared[tid], shared[tid + s]);
        __syncthreads();
    }

    if (gid == 0)
        *d_max = shared[tid];
}

I have implemented a min reduce using the same method (replacing the max function with the min) which works fine.

To test the kernel, I found the min and max values using a serial for loop. The min and max values always come out the same in the kernel but only the min reduce matches up.

Is there something obvious I'm missing/doing wrong?

362

asked Jun 28 '13 18:06

CurtisJC

1 Answers

Your main conclusion in your deleted answer was correct: the kernel you have posted doesn't comprehend the fact that at the end of that kernel execution, you have done a good deal of the overall reduction, but the results are not quite complete. The results of each block must be combined (somehow). As pointed out in the comments, there are a few other issues with your code as well. Let's take a look at a modified version of it:

__device__ float atomicMaxf(float* address, float val)
{
    int *address_as_int =(int*)address;
    int old = *address_as_int, assumed;
    while (val > __int_as_float(old)) {
        assumed = old;
        old = atomicCAS(address_as_int, assumed,
                        __float_as_int(val));
        }
    return __int_as_float(old);
}


__global__ void max_reduce(const float* const d_array, float* d_max, 
                                              const size_t elements)
{
    extern __shared__ float shared[];

    int tid = threadIdx.x;
    int gid = (blockDim.x * blockIdx.x) + tid;
    shared[tid] = -FLOAT_MAX;  // 1

    if (gid < elements)
        shared[tid] = d_array[gid];
    __syncthreads();

    for (unsigned int s=blockDim.x/2; s>0; s>>=1) 
    {
        if (tid < s && gid < elements)
            shared[tid] = max(shared[tid], shared[tid + s]);  // 2
        __syncthreads();
    }
    // what to do now?
    // option 1: save block result and launch another kernel
    if (tid == 0)        
        d_max[blockIdx.x] = shared[tid]; // 3
    // option 2: use atomics
    if (tid == 0)
      atomicMaxf(d_max, shared[0]);
}

As Pavan indicated, you need to initialize your shared memory array. The last block launched may not be a "full" block, if gridDim.x*blockDim.x is greater than elements.
Note that in this line, even though we are checking that the thread operating (gid) is less than elements, when we add s to gid for indexing into the shared memory we can still index outside of the legitimate values copied into shared memory, in the last block. Therefore we need the shared memory initialization indicated in note 1.
As you already discovered, your last line was not correct. Each block produces it's own result, and we must combine them somehow. One method you might consider if the number of blocks launched is small (more on this later) is to use atomics. Normally we steer people away from using atomics since they are "costly" in terms of execution time. However, the other option we are faced with is saving the block result in global memory, finishing the kernel, and then possibly launching another kernel to combine the individual block results. If I have launched a large number of blocks initially (say more than 1024) then if I follow this methodology I might end up launching two additional kernels. Thus the consideration of atomics. As indicated, there is no native atomicMax function for floats, but as indicated in the documentation, you can use atomicCAS to generate any arbitrary atomic function, and I have provided an example of that in atomicMaxf which provides an atomic max for float.

But is running 1024 or more atomic functions (one per block) the best way? Probably not.

When launching kernels of threadblocks, we really only need to launch enough threadblocks to keep the machine busy. As a rule of thumb we want at least 4-8 warps operating per SM, and somewhat more is probably a good idea. But there's no particular benefit from a machine utilization standpoint to launch thousands of threadblocks initially. If we pick a number like 8 threadblocks per SM, and we have at most, say, 14-16 SMs in our GPU, this gives us a relatively small number of 8*14 = 112 threadblocks. Let's choose 128 (8*16) for a nice round number. There's nothing magical about this, it's just enough to keep the GPU busy. If we make each of these 128 threadblocks do additional work to solve the whole problem, we can then leverage our use of atomics without (perhaps) paying too much of a penalty for doing so, and avoid multiple kernel launches. So how would this look?:

__device__ float atomicMaxf(float* address, float val)
{
    int *address_as_int =(int*)address;
    int old = *address_as_int, assumed;
    while (val > __int_as_float(old)) {
        assumed = old;
        old = atomicCAS(address_as_int, assumed,
                        __float_as_int(val));
        }
    return __int_as_float(old);
}


__global__ void max_reduce(const float* const d_array, float* d_max, 
                                              const size_t elements)
{
    extern __shared__ float shared[];

    int tid = threadIdx.x;
    int gid = (blockDim.x * blockIdx.x) + tid;
    shared[tid] = -FLOAT_MAX; 

    while (gid < elements) {
        shared[tid] = max(shared[tid], d_array[gid]);
        gid += gridDim.x*blockDim.x;
        }
    __syncthreads();
    gid = (blockDim.x * blockIdx.x) + tid;  // 1
    for (unsigned int s=blockDim.x/2; s>0; s>>=1) 
    {
        if (tid < s && gid < elements)
            shared[tid] = max(shared[tid], shared[tid + s]);
        __syncthreads();
    }

    if (tid == 0)
      atomicMaxf(d_max, shared[0]);
}

With this modified kernel, when creating the kernel launch, we are not deciding how many threadblocks to launch based on the overall data size (elements). Instead we are launching a fixed number of blocks (say, 128, you can modify this number to find out what runs fastest), and letting each threadblock (and thus the entire grid) loop through memory, computing partial max operations on each element in shared memory. Then, in the line marked with comment 1, we must re-set the gid variable to it's initial value. This is actually unnecessary and the block reduction loop code can be further simplified if we guarantee that the size of the grid (gridDim.x*blockDim.x) is less than elements, which is not difficult to do at kernel launch.

Note that when using this atomic method, it's necessary to initialize the result (*d_max in this case) to an appropriate value, like -FLOAT_MAX.

Again, we normally steer people way from atomic usage, but in this case, it's worth considering if we carefully manage it, and it allows us to save the overhead of an additional kernel launch.

For a ninja-level analysis of how to do fast parallel reductions, take a look at Mark Harris' excellent whitepaper which is available with the relevant CUDA sample.

156

answered Oct 04 '22 07:10

Robert Crovella

Related questions
                            
                                Calculating FLOPS (Floating-point Operations per Seconds)
                            
                                Does the nVidia RDMA GPUDirect always operate only physical addresses (in physical address space of the CPU)?
                            
                                How to asynchronously copy memory from the host to the device using thrust and CUDA streams
                            
                                How to debug into CUDA kernel code using visual studio 2008?
                            
                                What do work items execute when conditionals are used in GPU programming?
                            
                                Atomic Operations in CUDA? Which header file to include?
                            
                                Amdahl's law and GPU
                            
                                Cuda Clang and OS X Mavericks
                            
                                SVD speed in CPU and GPU
                            
                                All CUDA devices are used for display: Can not debug my CUDA-code from within desktop environment
                            
                                CUDA function call-able by either the device or host
                            
                                is there a better and a faster way to copy from CPU memory to GPU using thrust?
                            
                                CUDA coalesced access to global memory
                            
                                CUDA5 Examples: Has anyone translated some cutil definitions to CUDA5?
                            
                                Is CUDA pinned memory zero-copy?
                            
                                How to list CUDA devices in windows 7 using cmd?
                            
                                Use of unique_ptr and cudaMalloc
                            
                                Parameters to CUDA kernels
                            
                                About cudaMemcpyAsync Function
                            
                                What is the OpenCL analogue for CUDA's __syncthreads() and blockIdx.x?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With