The following script was set-up for benchmark purposes. It computes the distance between N points using an Euclidean L2 norm. Three different routines are implemented:
scipy.spatial.distance.pdist
function.scipy.weave.inline
solution.Here are the benchmark results on a i5-3470 (16GB RAM) using a GTX660 (2GB RAM):
------------
Scipy Pdist
Execution time: 3.01975 s
Frist five elements: [ 0.74968684 0.71457213 0.833188 0.48084545 0.86407363]
Last five elements: [ 0.65717077 0.76850474 0.29652017 0.856179 0.56074625]
------------
Weave Inline
Execution time: 2.48705 s
Frist five elements: [ 0.74968684 0.71457213 0.83318806 0.48084542 0.86407363]
Last five elements: [ 0.65717083 0.76850474 0.29652017 0.856179 0.56074625]
------------
pyCUDA
CUDA clock timing: 0.713028930664
Execution time: 2.04364 s
Frist five elements: [ 0.74968684 0.71457213 0.83318806 0.48084542 0.86407363]
Last five elements: [ 0.65717083 0.76850468 0.29652017 0.856179 0.56074625]
------------
I am a bit disappointed on the pyCUDA perfomance. Since I am new to CUDA, there is probably something I am missing here. So where is the crux of the matter ? Am I reaching the limits of global memory bandwidth ? Poor choice of block- and gridsizes ?
import numpy,time,math
import pycuda.autoinit
import pycuda.driver as drv
from pycuda.compiler import SourceModule
from scipy.spatial.distance import pdist
from scipy import weave
def weave_solution(x):
"""
OpenMP powered weave inline.
"""
N,DIM = numpy.shape(x)
L = ((N-1)**2+(N-1))/2
solution = numpy.zeros(L).astype(numpy.float32)
ncpu = 4
weave_omp = {'headers' : ['<omp.h>'],
'extra_compile_args': ['-fopenmp'],
'extra_link_args' : ['-lgomp']}
code = \
r'''
omp_set_num_threads(ncpu);
#pragma omp parallel
{
int j,d,pos;
float r=0.0;
#pragma omp for
for (int i=0; i<(N-1); i++){
for (j=(i+1); j<N; j++){
r = 0.0;
for (d=0; d<DIM; d++){
r += (x[i*DIM+d]-x[j*DIM+d])*(x[i*DIM+d]-x[j*DIM+d]);
}
pos = (i*N+j)-(i*(i+1)/2)-i-1;
solution[pos] = sqrt(r);
}
}
}
'''
weave.inline(code,['x','N','DIM','solution','ncpu'],**weave_omp)
return numpy.array(solution)
def scipy_solution(x):
"""
SciPy High-level function
"""
return pdist(x).astype(numpy.float32)
def cuda_solution(x):
"""
pyCUDA
"""
N,DIM = numpy.shape(x)
N = numpy.int32(N)
DIM = numpy.int32(DIM)
L = ((N-1)**2+(N-1))/2
solution = numpy.zeros(L).astype(numpy.float32)
start = drv.Event()
end = drv.Event()
mod = SourceModule("""
__global__ void distance(float *x,int N,int DIM,float *solution){
const int i = blockDim.x * blockIdx.x + threadIdx.x;
int j,d,pos;
float r=0.0;
if ( i < (N-1) ){
for (j=(i+1); j<N; j++){
r = 0.0;
for (d=0; d<DIM; d++){
r += (x[i*DIM+d]-x[j*DIM+d])*(x[i*DIM+d]-x[j*DIM+d]);
}
pos = (i*N+j)-(i*(i+1)/2)-i-1;
solution[pos] = sqrt(r);
}
}
}
""")
func = mod.get_function("distance")
start.record()
func(drv.In(x),N,DIM,drv.Out(solution),block=(192,1,1),grid=(192,1))
end.record()
end.synchronize()
secs = start.time_till(end)*1e-3
print "CUDA clock timing: ",secs
return solution
if __name__ == '__main__':
# Set up data points
N = 25000
DIM = 3
x = numpy.random.rand(N,DIM).astype(numpy.float32)
print "-"*12
# Scipy solution
print "Scipy Pdist"
stime = time.time()
spsolution = scipy_solution(x)
stime = time.time()-stime
print "Execution time: {0:.5f} s".format(stime)
print "Frist five elements:", spsolution[:5]
print "Last five elements:", spsolution[-5:]
print "-"*12
# Weave solution
print "Weave Inline"
wtime = time.time()
wsolution = weave_solution(x)
wtime = time.time()-wtime
print "Execution time: {0:.5f} s".format(wtime)
print "Frist five elements:", wsolution[:5]
print "Last five elements:", wsolution[-5:]
print "-"*12
# pyCUDA solution
print "pyCUDA"
ctime = time.time()
csolution = cuda_solution(x)
ctime = time.time()-ctime
print "Execution time: {0:.5f} s".format(ctime)
print "Frist five elements:", csolution[:5]
print "Last five elements:", csolution[-5:]
print "-"*12
Edit:
I have added the hash bang line
#!/usr/bin/env python
at the top of the file and made it executable. After commenting out the computation using weave.inline
and scipy.spatial.distance.pdist
, the NVIDIA Visual Profiler promts the following results:
Right now you have 192 threads each updating N-1 positions, you could easily launch more blocks/threads.
What you want to do is instead of this loop for (j=(i+1); j<N; j++){
, replace it with N-1 threads doing just the inner loop.
If you want to take it further you could have N-1 *
DIM
threads each doing the statement in the inner loop, store the result to shared memory and finally do an reduction on that. See Optimizing Parallel Reduction in CUDA
Looking at this line:
r += (x[i*DIM+d]-x[j*DIM+d])*(x[i*DIM+d]-x[j*DIM+d]);
The memory transactions and pattern is not uniform and coalesced. Also do not know if nvcc will optimizes your expression to only two memory transactions instead of four shown here, as I do not know if pycuda passes -O3 to nvcc. Put (x[i*DIM+d]-x[j*DIM+d])
into a register variable to make sure and just square it yourself.
Else you can also try to put #pragma unroll
before each for loop to unroll them if possible.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With