Say I have a Numpy vector, <pre class="prettyprint"><code>A = zeros(100) </code></pre> and I divide it into subvectors by a list of breakpoints which index into <code>A</code>, for instance, <pre class="prettyprint"><code>breaks = linspace(0, 100, 11, dtype=int) </code></pre> So the <code>i</code>-th subvector would be lie between the indices <code>breaks[i]</code> (inclusive) and <code>breaks[i+1]</code> (exclusive). The breaks are not necessarily equispaced, this is only an example. However, they will always be strictly increasing. Now I want to operate on these subvectors. For instance, if I want to set all elements of the <code>i</code>-th subvector to <code>i</code>, I might do: <pre class="prettyprint"><code>for i in range(len(breaks) - 1): A[breaks[i] : breaks[i+1]] = i </code></pre> Or I might want to compute the subvector means: <pre class="prettyprint"><code>b = empty(len(breaks) - 1) for i in range(len(breaks) - 1): b = A[breaks[i] : breaks[i+1]].mean() </code></pre> And so on. How can I avoid using <code>for</code> loops and instead vectorize these operations?

You can use simple <code>np.cumsum</code> - <pre class="prettyprint"><code>import numpy as np # Form zeros array of same size as input array and # place ones at positions where intervals change A1 = np.zeros_like(A) A1[breaks[1:-1]] = 1 # Perform cumsum along it to create a staircase like array, as the final output out = A1.cumsum() </code></pre> Sample run - <pre class="prettyprint"><code>In [115]: A Out[115]: array([3, 8, 0, 4, 6, 4, 8, 0, 2, 7, 4, 9, 3, 7, 3, 8, 6, 7, 1, 6]) In [116]: breaks Out[116]: array([ 0, 4, 9, 11, 18, 20]) In [142]: out Out[142]: array([0, 0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 3, 3, 3, 3, 3, 3, 3, 4, 4]..) </code></pre> <hr> If you want to have mean values of those subvectors from <code>A</code>, you can use <code>np.bincount</code> - <pre class="prettyprint"><code>mean_vals = np.bincount(out, weights=A)/np.bincount(out) </code></pre> If you are looking to extend this functionality and use a custom function instead, you might want to look into MATLAB's <code>accumarray</code> equivalent for <code>Python/Numpy</code>: <code>numpy_groupies</code> whose source code is available here.

There really isn't a single answer to your question, but several techniques that you can use as building blocks. Another one you may find helpful: All numpy ufuncs have a <code>.reduceat</code> method, which you can use to your advantage for some of your calculations: <pre class="prettyprint"><code>>>> a = np.arange(100) >>> breaks = np.linspace(0, 100, 11, dtype=np.intp) >>> counts = np.diff(breaks) >>> counts array([10, 10, 10, 10, 10, 10, 10, 10, 10, 10]) >>> sums = np.add.reduceat(a, breaks[:-1], dtype=np.float) >>> sums array([ 45., 145., 245., 345., 445., 545., 645., 745., 845., 945.]) >>> sums / counts # i.e. the mean array([ 4.5, 14.5, 24.5, 34.5, 44.5, 54.5, 64.5, 74.5, 84.5, 94.5]) </code></pre>

Vectorizing a Numpy slice operation

Tags:

python

vectorization

numpy

Say I have a Numpy vector,

A = zeros(100)

and I divide it into subvectors by a list of breakpoints which index into A, for instance,

breaks = linspace(0, 100, 11, dtype=int)

So the i-th subvector would be lie between the indices breaks[i] (inclusive) and breaks[i+1] (exclusive). The breaks are not necessarily equispaced, this is only an example. However, they will always be strictly increasing.

Now I want to operate on these subvectors. For instance, if I want to set all elements of the i-th subvector to i, I might do:

for i in range(len(breaks) - 1):
    A[breaks[i] : breaks[i+1]] = i

Or I might want to compute the subvector means:

b = empty(len(breaks) - 1)
for i in range(len(breaks) - 1):
    b = A[breaks[i] : breaks[i+1]].mean()

And so on.

How can I avoid using for loops and instead vectorize these operations?

660

asked Apr 27 '15 11:04

cfh

3 Answers

You can use simple np.cumsum -

import numpy as np

# Form zeros array of same size as input array and 
# place ones at positions where intervals change
A1 = np.zeros_like(A)
A1[breaks[1:-1]] = 1

# Perform cumsum along it to create a staircase like array, as the final output
out = A1.cumsum()

Sample run -

In [115]: A
Out[115]: array([3, 8, 0, 4, 6, 4, 8, 0, 2, 7, 4, 9, 3, 7, 3, 8, 6, 7, 1, 6])

In [116]: breaks
Out[116]: array([ 0,  4,  9, 11, 18, 20])

In [142]: out
Out[142]: array([0, 0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 3, 3, 3, 3, 3, 3, 3, 4, 4]..)

If you want to have mean values of those subvectors from A, you can use np.bincount -

mean_vals = np.bincount(out, weights=A)/np.bincount(out)

If you are looking to extend this functionality and use a custom function instead, you might want to look into MATLAB's accumarray equivalent for Python/Numpy: numpy_groupies whose source code is available here.

answered Oct 02 '22 21:10

Divakar

There really isn't a single answer to your question, but several techniques that you can use as building blocks. Another one you may find helpful:

All numpy ufuncs have a .reduceat method, which you can use to your advantage for some of your calculations:

>>> a = np.arange(100)
>>> breaks = np.linspace(0, 100, 11, dtype=np.intp)
>>> counts = np.diff(breaks)
>>> counts
array([10, 10, 10, 10, 10, 10, 10, 10, 10, 10])
>>> sums = np.add.reduceat(a, breaks[:-1], dtype=np.float)
>>> sums
array([  45.,  145.,  245.,  345.,  445.,  545.,  645.,  745.,  845.,  945.])
>>> sums / counts  # i.e. the mean
array([  4.5,  14.5,  24.5,  34.5,  44.5,  54.5,  64.5,  74.5,  84.5,  94.5])

answered Oct 02 '22 22:10

Jaime

You could use np.repeat:

In [35]: np.repeat(np.arange(0, len(breaks)-1), np.diff(breaks))
Out[35]: 
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4,
       4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6,
       6, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 9, 9,
       9, 9, 9, 9, 9, 9, 9, 9])

To compute arbitrary binned statistics you could use scipy.stats.binned_statistic:

import numpy as np
import scipy.stats as stats

breaks = np.linspace(0, 100, 11, dtype=int)
A = np.random.random(100)

means, bin_edges, binnumber = stats.binned_statistic(
    x=np.arange(len(A)), values=A, statistic='mean', bins=breaks)

stats.binned_statistic can compute means, medians, counts, sums; or, to compute an arbitrary statistics for each bin, you can pass a callable to the statistic parameter:

def func(values):
    return values.mean()

funcmeans, bin_edges, binnumber = stats.binned_statistic(
    x=np.arange(len(A)), values=A, statistic=func, bins=breaks)

assert np.allclose(means, funcmeans)

answered Oct 02 '22 23:10

unutbu

Related questions
                            
                                Bootstrap Carousel Implementation in Django
                            
                                python xlsxwriter change all cell widths when using write_row
                            
                                What is the difference between super() being called at the beginning or end of a method?
                            
                                django dynamic related name on FK model inhertiance
                            
                                Python check exit status of a shell command
                            
                                In django, how can I filter or exclude multiple things?
                            
                                Track value changes in a repetitive list in Python
                            
                                fitting a circle to a binary image
                            
                                Precision of repr(f), str(f), print(f) when f is float
                            
                                Drop rows if value in a specific column is not an integer in pandas dataframe
                            
                                When I am importing `http.server` from the idle it works, but when I run a python file having `import http.server` there is an error
                            
                                Error while fetching Tweets with Tweepy
                            
                                PyMongo raises [errno 49] can't assign requested address after a large number of queries
                            
                                Cassandra query making - Cannot execute this query as it might involve data filtering and thus may have unpredictable performance
                            
                                NumPy random seed produces different random numbers
                            
                                Change multiple items in a list at one time in Python
                            
                                re.sub on lists - python 3
                            
                                Distance matrix for rows in pandas dataframe
                            
                                Converting a .txt file to an image in Python
                            
                                Warnings on pdfminer

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With