Disappointing results in pyCUDA benchmark for distance computing between N points

Question

The following script was set-up for benchmark purposes. It computes the distance between N points using an Euclidean L2 norm. Three different routines are implemented:

High-level solution using the scipy.spatial.distance.pdist function.
Fairly low-level OpenMP powered scipy.weave.inline solution.
pyCUDA powered GPGPU solution.

Here are the benchmark results on a i5-3470 (16GB RAM) using a GTX660 (2GB RAM):

    ------------
    Scipy Pdist
    Execution time: 3.01975 s
    Frist five elements: [ 0.74968684  0.71457213  0.833188    0.48084545  0.86407363]
    Last five elements: [ 0.65717077  0.76850474  0.29652017  0.856179    0.56074625]
    ------------
    Weave Inline
    Execution time: 2.48705 s
    Frist five elements: [ 0.74968684  0.71457213  0.83318806  0.48084542  0.86407363]
    Last five elements: [ 0.65717083  0.76850474  0.29652017  0.856179    0.56074625]
    ------------
    pyCUDA
    CUDA clock timing:  0.713028930664
    Execution time: 2.04364 s
    Frist five elements: [ 0.74968684  0.71457213  0.83318806  0.48084542  0.86407363]
    Last five elements: [ 0.65717083  0.76850468  0.29652017  0.856179    0.56074625]
    ------------

I am a bit disappointed on the pyCUDA perfomance. Since I am new to CUDA, there is probably something I am missing here. So where is the crux of the matter ? Am I reaching the limits of global memory bandwidth ? Poor choice of block- and gridsizes ?

import numpy,time,math
import pycuda.autoinit
import pycuda.driver as drv
from pycuda.compiler import SourceModule
from scipy.spatial.distance import pdist
from scipy import weave

def weave_solution(x):
    """
    OpenMP powered weave inline.
    """
    N,DIM     = numpy.shape(x)
    L         = ((N-1)**2+(N-1))/2
    solution  = numpy.zeros(L).astype(numpy.float32)
    ncpu      = 4
    weave_omp = {'headers'           : ['<omp.h>'],
                 'extra_compile_args': ['-fopenmp'],
                 'extra_link_args'   : ['-lgomp']}
    code = \
         r'''
         omp_set_num_threads(ncpu);
         #pragma omp parallel
         {            
            int j,d,pos;
            float r=0.0;

            #pragma omp for
               for (int i=0; i<(N-1); i++){
                  for (j=(i+1); j<N; j++){
                     r = 0.0;
                     for (d=0; d<DIM; d++){
                        r += (x[i*DIM+d]-x[j*DIM+d])*(x[i*DIM+d]-x[j*DIM+d]);
                     }
                     pos = (i*N+j)-(i*(i+1)/2)-i-1;
                     solution[pos] = sqrt(r);
                  }
               }

         }
         '''
    weave.inline(code,['x','N','DIM','solution','ncpu'],**weave_omp)
    return numpy.array(solution)

def scipy_solution(x):
    """
    SciPy High-level function
    """
    return pdist(x).astype(numpy.float32)

def cuda_solution(x):
    """
    pyCUDA
    """
    N,DIM     = numpy.shape(x)
    N         = numpy.int32(N)
    DIM       = numpy.int32(DIM)    
    L         = ((N-1)**2+(N-1))/2
    solution  = numpy.zeros(L).astype(numpy.float32)

    start = drv.Event()
    end   = drv.Event()       

    mod = SourceModule("""
    __global__ void distance(float *x,int N,int DIM,float *solution){

    const int i = blockDim.x * blockIdx.x + threadIdx.x;

    int j,d,pos;
    float r=0.0;

    if ( i < (N-1) ){

       for (j=(i+1); j<N; j++){

          r = 0.0;
          for (d=0; d<DIM; d++){
             r += (x[i*DIM+d]-x[j*DIM+d])*(x[i*DIM+d]-x[j*DIM+d]);
          }
          pos = (i*N+j)-(i*(i+1)/2)-i-1;
          solution[pos] = sqrt(r);
       }

    }
    }
    """)


    func = mod.get_function("distance")

    start.record()
    func(drv.In(x),N,DIM,drv.Out(solution),block=(192,1,1),grid=(192,1))
    end.record()
    end.synchronize()
    secs = start.time_till(end)*1e-3

    print "CUDA clock timing: ",secs
    return solution

if __name__ == '__main__':

    # Set up data points
    N   = 25000
    DIM = 3
    x   = numpy.random.rand(N,DIM).astype(numpy.float32)

    print "-"*12
    # Scipy solution
    print "Scipy Pdist"
    stime = time.time()
    spsolution = scipy_solution(x)
    stime = time.time()-stime
    print "Execution time: {0:.5f} s".format(stime)
    print "Frist five elements:", spsolution[:5]
    print "Last five elements:", spsolution[-5:]    
    print "-"*12

    # Weave solution
    print "Weave Inline"
    wtime = time.time()
    wsolution = weave_solution(x)
    wtime = time.time()-wtime
    print "Execution time: {0:.5f} s".format(wtime)
    print "Frist five elements:", wsolution[:5]
    print "Last five elements:", wsolution[-5:]
    print "-"*12

    # pyCUDA solution
    print "pyCUDA"
    ctime = time.time()
    csolution = cuda_solution(x)
    ctime = time.time()-ctime
    print "Execution time: {0:.5f} s".format(ctime)
    print "Frist five elements:", csolution[:5]
    print "Last five elements:", csolution[-5:]    
    print "-"*12

Edit:

I have added the hash bang line

#!/usr/bin/env python

at the top of the file and made it executable. After commenting out the computation using weave.inline and scipy.spatial.distance.pdist, the NVIDIA Visual Profiler promts the following results:

NVIDIA Visual Profiler

1-----1 · Accepted Answer

Right now you have 192 threads each updating N-1 positions, you could easily launch more blocks/threads.

What you want to do is instead of this loop for (j=(i+1); j<N; j++){, replace it with N-1 threads doing just the inner loop.

If you want to take it further you could have N-1 * DIM threads each doing the statement in the inner loop, store the result to shared memory and finally do an reduction on that. See Optimizing Parallel Reduction in CUDA

Looking at this line:

r += (x[i*DIM+d]-x[j*DIM+d])*(x[i*DIM+d]-x[j*DIM+d]);

The memory transactions and pattern is not uniform and coalesced. Also do not know if nvcc will optimizes your expression to only two memory transactions instead of four shown here, as I do not know if pycuda passes -O3 to nvcc. Put (x[i*DIM+d]-x[j*DIM+d]) into a register variable to make sure and just square it yourself.

Else you can also try to put #pragma unroll before each for loop to unroll them if possible.

Disappointing results in pyCUDA benchmark for distance computing between N points

Tags:

python

cuda

scipy

pycuda

Rakulan S.

1 Answers

1-----1

Recent Activity

Donate For Us

Disappointing results in pyCUDA benchmark for distance computing between N points

Tags:

python

cuda

scipy

pycuda

Rakulan S.

1 Answers

1-----1

Related questions

Recent Activity

Donate For Us