I'm trying to speed up a simple piece of code written in Cython with OpenMP. It's a double loop that for each position in the input array adds a quantity at each reference point. Here's the main part of the code:
cimport cython
import numpy as np
cimport numpy as np
cimport openmp
DTYPE = np.double
ctypedef np.double_t DTYPE_t
cdef extern from "math.h" nogil :
DTYPE_t sqrt(DTYPE_t)
@cython.cdivision(True)
@cython.boundscheck(False)
def summation(np.ndarray[DTYPE_t,ndim=2] pos, np.ndarray[DTYPE_t,ndim=1] weights,
np.ndarray[DTYPE_t, ndim=2] points, int num_threads = 0):
from cython.parallel cimport prange, parallel, threadid
if num_threads <= 0 :
num_threads = openmp.omp_get_num_procs()
if num_threads > openmp.omp_get_num_procs() :
num_threads = openmp.omp_get_num_procs()
openmp.omp_set_num_threads(num_threads)
cdef unsigned int nips = len(points)
cdef np.ndarray[DTYPE_t, ndim=1] sum_array = np.zeros(nips, dtype = np.float64)
cdef np.ndarray[DTYPE_t, ndim=2] sum_array3d = np.zeros((nips,3), dtype = np.float64)
cdef unsigned int n = len(weights)
cdef unsigned int pi, i, id
cdef double dx, dy, dz, dr, weight_i, xi,yi,zi
print 'num_threads = ', openmp.omp_get_num_threads()
for i in prange(n,nogil=True,schedule='static'):
weight_i = weights[i]
xi = pos[i,0]
yi = pos[i,1]
zi = pos[i,2]
for pi in range(nips) :
dx = points[pi,0] - xi
dy = points[pi,1] - yi
dz = points[pi,2] - zi
dr = 1.0/sqrt(dx*dx + dy*dy + dz*dz)
sum_array[pi] += weight_i * dr
sum_array3d[pi,0] += weight_i * dx
sum_array3d[pi,1] += weight_i * dy
sum_array3d[pi,2] += weight_i * dz
return sum_array, sum_array3d
I've put it into a gist with the associated test and setup files (https://gist.github.com/rokroskar/6ed1bfc1a5f8f9c183a6)
Two issues arise:
First, in the current configuration I'm not able to get any speedup. The code runs on multiple cores, but the timings show no advantage.
Second, the results are different depending on the number of cores indicating that there's a
race condition. Aren't the in-place sums supposed to end up being reductions? Or is something funny happening since it's a nested for loop? I figured everything inside the prange
is executed in each thread individually.
Both of these go away if I reverse the order of the loops -- but because my outer loop the way it's structured now is where all the data reading gets done, if I reverse them the arrays are traversed num_thread times which is wasteful. I've also tried putting the whole nested loop inside a with parallel():
block and use thread-local buffers explicitly but wasn't able to get it to work.
Clearly I'm missing something basic about how OpenMP is supposed to work (though maybe this is particular to Cython?) so I would be grateful for tips!
In this chapter we will learn about Cython's multithreading features to access thread-based parallelism. Our focus will be on the prange Cython function, which allows us to easily transform serial for loops to use multiple threads and tap into all available CPU cores.
Serial run on Laptop: 23.295 seconds on a single core. OpenMP: 2 threads on dual core: 12.79 seconds.
Cython allows you to release the GIL. That means that you can do multi-threading in at least 2 ways: Directly in Cython, using OpenMP with prange. Using e.g. joblib with a multi-threading backend (the parts of your code that will be parallelized are the parts that release the GIL)
Have you tried switching the two loops so you don't have multiple threads reading and writing from the same locations? I'm pretty sure that these loops are not automatically promoted into an OpenMP reduction and increments such as "sum_array3d[pi,0] += weight_i * dx" are not atomic.
Also, since the computation is relatively simple, Cython might be overkill and you might get away with using either Parakeet or Numba instead.
Parakeet will, by default, use OpenMP to execute comprehensions in parallel. You would have to rewrite your code to look something like:
@parakeet.jit
def summation(pos, weights, points):
n_points = len(points)
n_weights = len(weights)
sum_array3d = np.zeros((n_points,3))
def compute(i):
pxi = points[i, 0]
pyi = points[i, 1]
pzi = points[i, 2]
total = 0.0
for j in xrange(n_weights):
weight_j = weights[j]
xj = pos[j,0]
yj = pos[j,1]
zj = pos[j,2]
dx = pxi - pos[j, 0]
dy = pyi - pos[j, 1]
dz = pzi - pos[j, 2]
dr = 1.0/np.sqrt(dx*dx + dy*dy + dz*dz)
total += weight_j * dr
sum_array3d[i,0] += weight_j * dx
sum_array3d[i,1] += weight_j * dy
sum_array3d[i,2] += weight_j * dz
return total
sum_array = np.array([compute(i) for i in xrange(n_points)])
return sum_array, sum_array3d
As for Numba, I'm not sure if the prange construct has made it into the free version yet.
edit: Forehead slap, sorry I missed the part of your question where you considered switching the loops.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With