<p>I need to apply the same function onto every row in a numpy array and store the result again in a numpy array.</p> <pre class="prettyprint"><code># states will contain results of function applied to a row in array states = np.empty_like(array) for i, ar in enumerate(array): states[i] = function(ar, *args) # do some other stuff on states </code></pre> <p><code>function</code> does some <strong>non trivial</strong> filtering of my data and returns an array when the conditions are True and when they are False. <code>function</code> can either be pure python or cython compiled. The filtering operations on the rows are complicated and can depend on previous values in the row, this means I can't operate on the whole array in an element-by-element fashion</p> <p>Is there a way to do something like this in dask for example?</p>

<h3>Dask solution</h3> <p>You could do with with dask.array by chunking the array by row, calling <code>map_blocks</code>, then computing the result</p> <pre class="prettyprint"><code>ar = ... x = da.from_array(ar, chunks=(1, arr.shape[1])) x.map_blocks(function, *args) states = x.compute() </code></pre> <p>By default this will use threads, you can use processes in the following way</p> <pre class="prettyprint"><code>from dask.multiprocessing import get states = x.compute(get=get) </code></pre> <h3>Pool solution</h3> <p>However dask is probably overkill for embarrassingly parallel computations like this, you could get by with a threadpool</p> <pre class="prettyprint"><code>from multiprocessing.pool import ThreadPool pool = ThreadPool() ar = ... states = np.empty_like(array) def f(i): states[i] = function(ar[i], *args) pool.map(f, range(len(ar))) </code></pre> <p>And you could switch to processes with the following change</p> <pre class="prettyprint"><code>from multiprocessing import Pool pool = Pool() </code></pre>

Parallelize loop over numpy rows

Tags:

python

numpy

dask

I need to apply the same function onto every row in a numpy array and store the result again in a numpy array.

# states will contain results of function applied to a row in array
states = np.empty_like(array)

for i, ar in enumerate(array):
    states[i] = function(ar, *args)

# do some other stuff on states

function does some non trivial filtering of my data and returns an array when the conditions are True and when they are False. function can either be pure python or cython compiled. The filtering operations on the rows are complicated and can depend on previous values in the row, this means I can't operate on the whole array in an element-by-element fashion

Is there a way to do something like this in dask for example?

728

asked Sep 28 '15 05:09

Max Linke

1 Answers

Dask solution

You could do with with dask.array by chunking the array by row, calling map_blocks, then computing the result

ar = ...
x = da.from_array(ar, chunks=(1, arr.shape[1]))
x.map_blocks(function, *args)
states = x.compute()

By default this will use threads, you can use processes in the following way

from dask.multiprocessing import get
states = x.compute(get=get)

Pool solution

However dask is probably overkill for embarrassingly parallel computations like this, you could get by with a threadpool

from multiprocessing.pool import ThreadPool
pool = ThreadPool()

ar = ...
states = np.empty_like(array)

def f(i):
    states[i] = function(ar[i], *args)

pool.map(f, range(len(ar)))

And you could switch to processes with the following change

from multiprocessing import Pool
pool = Pool()

127

answered Nov 03 '22 01:11

MRocklin

Related questions
                            
                                Python: Force pprint to display unicode strings as strings?
                            
                                Turn list of company names into tickers
                            
                                Minifying a Flask application when templates have inline JS?
                            
                                struct.error: required argument is not an integer
                            
                                Modify Levenshtein-Distance to ignore order
                            
                                Theano stack matrices programmatically?
                            
                                Custom lookup_field with django rest framework, Cannot resolve detail
                            
                                Suppress namespace in ElementTree
                            
                                how to handle common code in a django project which is used by multiple apps
                            
                                What is the proper way to create a numpy array of transformation matrices
                            
                                Python: Writing Counter to a csv file
                            
                                How to read parameter values from a file in Python
                            
                                Getting error from beanstalk when trying to deploy flask app: "no module named flask"
                            
                                Change language on Firefox with Selenium Python
                            
                                Difficulty getting OpenMP to work with f2py
                            
                                Why still has "commands out of sync; you can't run this command now" error
                            
                                How to add multiple pictures in Python ebay sdk
                            
                                Capture URL with query string as parameter in Flask route
                            
                                Filtering list of tuples based on condition
                            
                                How can I check if a type is a subtype of a type in Python?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With