lack of speedup and erroneous results with OpenMP and Cython

Tags:

I'm trying to speed up a simple piece of code written in Cython with OpenMP. It's a double loop that for each position in the input array adds a quantity at each reference point. Here's the main part of the code:

cimport cython  
import numpy as np
cimport numpy as np
cimport openmp
DTYPE = np.double
ctypedef np.double_t DTYPE_t

cdef extern from "math.h" nogil :
  DTYPE_t sqrt(DTYPE_t)


@cython.cdivision(True)
@cython.boundscheck(False)
def summation(np.ndarray[DTYPE_t,ndim=2] pos, np.ndarray[DTYPE_t,ndim=1] weights, 
              np.ndarray[DTYPE_t, ndim=2] points, int num_threads = 0):

    from cython.parallel cimport prange, parallel, threadid

    if num_threads <= 0 : 
        num_threads = openmp.omp_get_num_procs()

    if num_threads > openmp.omp_get_num_procs() : 
        num_threads = openmp.omp_get_num_procs()

    openmp.omp_set_num_threads(num_threads)

    cdef unsigned int nips = len(points)
    cdef np.ndarray[DTYPE_t, ndim=1] sum_array = np.zeros(nips, dtype = np.float64)
    cdef np.ndarray[DTYPE_t, ndim=2] sum_array3d = np.zeros((nips,3), dtype = np.float64)

    cdef unsigned int n = len(weights)

    cdef unsigned int pi, i, id
    cdef double dx, dy, dz, dr, weight_i, xi,yi,zi

    print 'num_threads = ', openmp.omp_get_num_threads()

    for i in prange(n,nogil=True,schedule='static'):
        weight_i = weights[i]
        xi = pos[i,0]
        yi = pos[i,1]
        zi = pos[i,2]
        for pi in range(nips) :
            dx = points[pi,0] - xi
            dy = points[pi,1] - yi
            dz = points[pi,2] - zi
            dr = 1.0/sqrt(dx*dx + dy*dy + dz*dz)
            sum_array[pi] += weight_i * dr
            sum_array3d[pi,0] += weight_i * dx
            sum_array3d[pi,1] += weight_i * dy
            sum_array3d[pi,2] += weight_i * dz


    return sum_array, sum_array3d

I've put it into a gist with the associated test and setup files (https://gist.github.com/rokroskar/6ed1bfc1a5f8f9c183a6)

Two issues arise:

First, in the current configuration I'm not able to get any speedup. The code runs on multiple cores, but the timings show no advantage.

Second, the results are different depending on the number of cores indicating that there's a race condition. Aren't the in-place sums supposed to end up being reductions? Or is something funny happening since it's a nested for loop? I figured everything inside the prange is executed in each thread individually.

Both of these go away if I reverse the order of the loops -- but because my outer loop the way it's structured now is where all the data reading gets done, if I reverse them the arrays are traversed num_thread times which is wasteful. I've also tried putting the whole nested loop inside a with parallel(): block and use thread-local buffers explicitly but wasn't able to get it to work.

Clearly I'm missing something basic about how OpenMP is supposed to work (though maybe this is particular to Cython?) so I would be grateful for tips!

674

asked Nov 19 '13 15:11

Rok

1 Answers

Have you tried switching the two loops so you don't have multiple threads reading and writing from the same locations? I'm pretty sure that these loops are not automatically promoted into an OpenMP reduction and increments such as "sum_array3d[pi,0] += weight_i * dx" are not atomic.

Also, since the computation is relatively simple, Cython might be overkill and you might get away with using either Parakeet or Numba instead.

Parakeet will, by default, use OpenMP to execute comprehensions in parallel. You would have to rewrite your code to look something like:

@parakeet.jit
def summation(pos, weights, points):
  n_points = len(points)
  n_weights = len(weights)
  sum_array3d = np.zeros((n_points,3))
  def compute(i):
    pxi = points[i, 0]
    pyi = points[i, 1]
    pzi = points[i, 2]
    total = 0.0
    for j in xrange(n_weights):
      weight_j = weights[j]
      xj = pos[j,0]
      yj = pos[j,1]
      zj = pos[j,2]
      dx = pxi - pos[j, 0]
      dy = pyi - pos[j, 1]
      dz = pzi - pos[j, 2]
      dr = 1.0/np.sqrt(dx*dx + dy*dy + dz*dz)
      total += weight_j * dr
      sum_array3d[i,0] += weight_j * dx
      sum_array3d[i,1] += weight_j * dy
      sum_array3d[i,2] += weight_j * dz
    return total 
  sum_array = np.array([compute(i) for i in xrange(n_points)])
  return sum_array, sum_array3d

As for Numba, I'm not sure if the prange construct has made it into the free version yet.

edit: Forehead slap, sorry I missed the part of your question where you considered switching the loops.

answered Sep 20 '22 11:09

Alex Rubinsteyn

Related questions
                            
                                Mp3 parsing in Python
                            
                                Python - why it doesn't create a new instance of object? [duplicate]
                            
                                How can I find all public repos in github that a user contributes to?
                            
                                Nearest neighbor 1 dimensional data with a specified range
                            
                                Stop cmdloop() from stripping white space from the input line in python
                            
                                How do you get the Titan graph database working with Python?
                            
                                Python 2.7: ImportError: DLL load failed: The specified module could not be found
                            
                                How I make color calibration in opencv using a colorchecker?
                            
                                Check if any tests raise a deprecation warning with pytest
                            
                                Matplotlib legend fontsize
                            
                                Large celery task memory leak
                            
                                regex module with pypy
                            
                                Why does SymPy give me the wrong answer when I row-reduce a symbolic matrix?
                            
                                How can I override standard handler404, handler403, handler500 in Django?
                            
                                Python C++ extension: compile only modified source files
                            
                                How to add/import a Django project into a virtualenv?
                            
                                Routes With Custom Domains Using Flask
                            
                                Need to convert to lambda function
                            
                                Type hinting class not yet imported
                            
                                How to use peewee limit()?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

lack of speedup and erroneous results with OpenMP and Cython

Tags:

python

c

multithreading

cython

openmp

Rok

People also ask

1 Answers

Alex Rubinsteyn

Recent Activity

Donate For Us