CUDA Dot Product

Question

I'm trying to implement the classic dot-product kernel for double precision arrays with atomic computation of the final sum across the various blocks. I used the atomicAdd for double precision as stated in page 116 of the programming guide.Probably i'm doing something wrong.The partial sums across the threads in every block are computed correctly but afterwords the atomic operation doesn't seem to be working properly since every time i run my kernel with the same data,i receive different results. I'll be grateful if somebody could spot the mistake or provide an alternative solution! Here is my kernel:

__global__ void cuda_dot_kernel(int *n,double *a, double *b, double *dot_res)
{
    __shared__ double cache[threadsPerBlock]; //thread shared memory
    int global_tid=threadIdx.x + blockIdx.x * blockDim.x;
    int i=0,cacheIndex=0;
    double temp = 0;
    cacheIndex = threadIdx.x;
    while (global_tid < (*n)) {
        temp += a[global_tid] * b[global_tid];
        global_tid += blockDim.x * gridDim.x;
    }
    cache[cacheIndex] = temp;
    __syncthreads();
    for (i=blockDim.x/2; i>0; i>>=1) {
        if (threadIdx.x < i) {
            cache[threadIdx.x] += cache[threadIdx.x + i];
        }
        __syncthreads();
    }
    __syncthreads();
    if (cacheIndex==0) {
        *dot_res=cuda_atomicAdd(dot_res,cache[0]);
    }
}

And here is my device function atomicAdd:

__device__ double cuda_atomicAdd(double *address, double val)
{
    double assumed,old=*address;
    do {
        assumed=old;
        old= __longlong_as_double(atomicCAS((unsigned long long int*)address,
                    __double_as_longlong(assumed),
                    __double_as_longlong(val+assumed)));
    }while (assumed!=old);

    return old;
}

Jared Hoberock · Accepted Answer

Getting a reduction right using ad hoc CUDA code can be tricky, so here's an alternative solution using a Thrust algorithm, which is included with the CUDA Toolkit:

#include <thrust/inner_product.h>
#include <thrust/device_ptr.h>

double do_dot_product(int n, double *a, double *b)
{
  // wrap raw pointers to device memory with device_ptr
  thrust::device_ptr<double> d_a(a), d_b(b);

  // inner_product implements a mathematical dot product
  return thrust::inner_product(d_a, d_a + n, d_b, 0.0);
}

CUDA Dot Product

Tags:

cuda

dot-product

tim_chemeng

1 Answers

Jared Hoberock

Recent Activity

Donate For Us

CUDA Dot Product

Tags:

cuda

dot-product

tim_chemeng

1 Answers

Jared Hoberock

Related questions

Recent Activity

Donate For Us