How could the application of a function to the elements of a NumPy array through <code>numpy.apply_along_axis()</code> be parallelized so as to take advantage of multiple cores? This seems to be a natural thing to do, in the common case where all the calls to the function being applied are independent. In my particular case—if this matters—, the axis of application is axis 0: <code>np.apply_along_axis(func, axis=0, arr=param_grid)</code> (<code>np</code> being NumPy). I had a quick look at Numba, but I can't seem to get this parallelization, with a loop like: <pre class="prettyprint"><code>@numba.jit(parallel=True) result = np.empty(shape=params.shape[1:]) for index in np.ndindex(*result.shape)): # All the indices of params[0,...] result[index] = func(params[(slice(None),) + index]) # Applying func along axis 0 </code></pre> There is also apparently a compilation option in NumPy for parallelization through OpenMP, but it does not seem to be accessible through MacPorts. One can also think of maybe cutting the array in a few pieces and using threads (so as to avoid copying the data) and applying the function on each piece in parallel. This is more complex than what I am looking for (and might not work if the Global Interpreter Lock is not released enough). It would be very nice to be able to use multiple cores in a simple way for simple parallelizable tasks like applying a function to all the elements of an array (which is essentially what is needed here, with the small complication that function <code>func()</code> takes a 1D array of parameters).

Alright, I worked it out: an idea is to use the standard <code>multiprocessing</code> module and split the original array in just a few chunks (so as to limit communication overhead with the workers). This can be done relatively easily as follows: <pre class="prettyprint"><code>import multiprocessing import numpy as np def parallel_apply_along_axis(func1d, axis, arr, *args, **kwargs): """ Like numpy.apply_along_axis(), but takes advantage of multiple cores. """ # Effective axis where apply_along_axis() will be applied by each # worker (any non-zero axis number would work, so as to allow the use # of `np.array_split()`, which is only done on axis 0): effective_axis = 1 if axis == 0 else axis if effective_axis != axis: arr = arr.swapaxes(axis, effective_axis) # Chunks for the mapping (only a few chunks): chunks = [(func1d, effective_axis, sub_arr, args, kwargs) for sub_arr in np.array_split(arr, multiprocessing.cpu_count())] pool = multiprocessing.Pool() individual_results = pool.map(unpacking_apply_along_axis, chunks) # Freeing the workers: pool.close() pool.join() return np.concatenate(individual_results) </code></pre> where the function <code>unpacking_apply_along_axis()</code> being applied in <code>Pool.map()</code> is separate as it should (so that subprocesses can import it), and is simply a thin wrapper that handles the fact that <code>Pool.map()</code> only takes a single argument: <pre class="prettyprint"><code>def unpacking_apply_along_axis((func1d, axis, arr, args, kwargs)): """ Like numpy.apply_along_axis(), but with arguments in a tuple instead. This function is useful with multiprocessing.Pool().map(): (1) map() only handles functions that take a single argument, and (2) this function can generally be imported from a module, as required by map(). """ return np.apply_along_axis(func1d, axis, arr, *args, **kwargs) </code></pre> (in Python 3, this should be written as <pre class="prettyprint"><code>def unpacking_apply_along_axis(all_args): (func1d, axis, arr, args, kwargs) = all_args </code></pre> because argument unpacking was removed). In my particular case, this resulted in a 2x speedup on 2 cores with hyper-threading. A factor closer to 4x would have been nicer, but the speed up is already nice, in just a few lines of code, and it should be better for machines with more cores (which are quite common). Maybe there is a way of avoiding data copies and using shared memory (maybe through the <code>multiprocessing</code> module itself)?

Easy parallelization of numpy.apply_along_axis()?

Tags:

performance

python

arrays

parallel-processing

numpy

How could the application of a function to the elements of a NumPy array through numpy.apply_along_axis() be parallelized so as to take advantage of multiple cores? This seems to be a natural thing to do, in the common case where all the calls to the function being applied are independent.

In my particular case—if this matters—, the axis of application is axis 0: np.apply_along_axis(func, axis=0, arr=param_grid) (np being NumPy).

I had a quick look at Numba, but I can't seem to get this parallelization, with a loop like:

Click to copy

@numba.jit(parallel=True)
result = np.empty(shape=params.shape[1:])
for index in np.ndindex(*result.shape)):  # All the indices of params[0,...]
    result[index] = func(params[(slice(None),) + index])  # Applying func along axis 0

There is also apparently a compilation option in NumPy for parallelization through OpenMP, but it does not seem to be accessible through MacPorts.

One can also think of maybe cutting the array in a few pieces and using threads (so as to avoid copying the data) and applying the function on each piece in parallel. This is more complex than what I am looking for (and might not work if the Global Interpreter Lock is not released enough).

It would be very nice to be able to use multiple cores in a simple way for simple parallelizable tasks like applying a function to all the elements of an array (which is essentially what is needed here, with the small complication that function func() takes a 1D array of parameters).

753

asked Aug 05 '17 21:08

Eric O Lebigot

1 Answers

Alright, I worked it out: an idea is to use the standard multiprocessing module and split the original array in just a few chunks (so as to limit communication overhead with the workers). This can be done relatively easily as follows:

Click to copy

import multiprocessing

import numpy as np

def parallel_apply_along_axis(func1d, axis, arr, *args, **kwargs):
    """
    Like numpy.apply_along_axis(), but takes advantage of multiple
    cores.
    """        
    # Effective axis where apply_along_axis() will be applied by each
    # worker (any non-zero axis number would work, so as to allow the use
    # of `np.array_split()`, which is only done on axis 0):
    effective_axis = 1 if axis == 0 else axis
    if effective_axis != axis:
        arr = arr.swapaxes(axis, effective_axis)

    # Chunks for the mapping (only a few chunks):
    chunks = [(func1d, effective_axis, sub_arr, args, kwargs)
              for sub_arr in np.array_split(arr, multiprocessing.cpu_count())]

    pool = multiprocessing.Pool()
    individual_results = pool.map(unpacking_apply_along_axis, chunks)
    # Freeing the workers:
    pool.close()
    pool.join()

    return np.concatenate(individual_results)

where the function unpacking_apply_along_axis() being applied in Pool.map() is separate as it should (so that subprocesses can import it), and is simply a thin wrapper that handles the fact that Pool.map() only takes a single argument:

Click to copy

def unpacking_apply_along_axis((func1d, axis, arr, args, kwargs)):
    """
    Like numpy.apply_along_axis(), but with arguments in a tuple
    instead.

    This function is useful with multiprocessing.Pool().map(): (1)
    map() only handles functions that take a single argument, and (2)
    this function can generally be imported from a module, as required
    by map().
    """
    return np.apply_along_axis(func1d, axis, arr, *args, **kwargs)

(in Python 3, this should be written as

Click to copy

def unpacking_apply_along_axis(all_args):
    (func1d, axis, arr, args, kwargs) = all_args

because argument unpacking was removed).

In my particular case, this resulted in a 2x speedup on 2 cores with hyper-threading. A factor closer to 4x would have been nicer, but the speed up is already nice, in just a few lines of code, and it should be better for machines with more cores (which are quite common). Maybe there is a way of avoiding data copies and using shared memory (maybe through the multiprocessing module itself)?

138

answered Oct 17 '22 22:10

Eric O Lebigot

Related questions
                            
                                Find time shift of two signals using cross correlation
                            
                                if statement to one line with return [duplicate]
                            
                                Send html email using flask in Python
                            
                                how to remove matplotlib output lines from showing in jupyter notebook when plotting [duplicate]
                            
                                Numpy is calculating wrong [duplicate]
                            
                                How to set labels in matplotlib.hlines
                            
                                Getting the difference (in values) between two dictionaries in python
                            
                                Login Wordpress with requests - Python3
                            
                                feature_names must be unique - Xgboost
                            
                                Convert csv to JSON tree structure?
                            
                                'numpy.ndarray' object has no attribute 'imshow'
                            
                                rgb to yuv conversion and accessing Y, U and V channels
                            
                                ANOVA for groups within a dataframe using scipy
                            
                                Byte code of a compiled script differs based on how it was compiled [duplicate]
                            
                                Python class methods: when is self not needed
                            
                                Check for valid domain name in a string?
                            
                                Popping first element from a Python tuple
                            
                                How can I get sign bit of an integer in python?
                            
                                How to include the function name into logging
                            
                                all permutations of +-r, +-s

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Easy parallelization of numpy.apply_along_axis()?

Tags:

performance

python

arrays

parallel-processing

numpy

Eric O Lebigot

People also ask

1 Answers

Eric O Lebigot

Recent Activity

Donate For Us