Say I have a Numpy vector,
A = zeros(100)
and I divide it into subvectors by a list of breakpoints which index into A
, for instance,
breaks = linspace(0, 100, 11, dtype=int)
So the i
-th subvector would be lie between the indices breaks[i]
(inclusive) and breaks[i+1]
(exclusive).
The breaks are not necessarily equispaced, this is only an example.
However, they will always be strictly increasing.
Now I want to operate on these subvectors. For instance, if I want to set all elements of the i
-th subvector to i
, I might do:
for i in range(len(breaks) - 1):
A[breaks[i] : breaks[i+1]] = i
Or I might want to compute the subvector means:
b = empty(len(breaks) - 1)
for i in range(len(breaks) - 1):
b = A[breaks[i] : breaks[i+1]].mean()
And so on.
How can I avoid using for
loops and instead vectorize these operations?
The concept of vectorized operations on NumPy allows the use of more optimal and pre-compiled functions and mathematical operations on NumPy array objects and data sequences. The Output and Operations will speed up when compared to simple non-vectorized operations. Example 1: Using vectorized sum method on NumPy array.
Vectorization is the process of converting an algorithm from operating on a single value at a time to operating on a set of values (vector) at one time. Modern CPUs provide direct support for vector operations where a single instruction is applied to multiple data (SIMD).
Again, some have observed vectorize to be faster than normal for loops, but even the NumPy documentation states: “The vectorize function is provided primarily for convenience, not for performance.
Definition: In the context of high-level languages like Python, Matlab, and R, the term vectorization describes the use of optimized, pre-compiled code written in a low-level language (e.g. C) to perform mathematical operations over a sequence of data.
You can use simple np.cumsum
-
import numpy as np
# Form zeros array of same size as input array and
# place ones at positions where intervals change
A1 = np.zeros_like(A)
A1[breaks[1:-1]] = 1
# Perform cumsum along it to create a staircase like array, as the final output
out = A1.cumsum()
Sample run -
In [115]: A
Out[115]: array([3, 8, 0, 4, 6, 4, 8, 0, 2, 7, 4, 9, 3, 7, 3, 8, 6, 7, 1, 6])
In [116]: breaks
Out[116]: array([ 0, 4, 9, 11, 18, 20])
In [142]: out
Out[142]: array([0, 0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 3, 3, 3, 3, 3, 3, 3, 4, 4]..)
If you want to have mean values of those subvectors from A
, you can use np.bincount
-
mean_vals = np.bincount(out, weights=A)/np.bincount(out)
If you are looking to extend this functionality and use a custom function instead, you might want to look into MATLAB's accumarray
equivalent for Python/Numpy
: numpy_groupies
whose source code is available here.
There really isn't a single answer to your question, but several techniques that you can use as building blocks. Another one you may find helpful:
All numpy ufuncs have a .reduceat
method, which you can use to your advantage for some of your calculations:
>>> a = np.arange(100)
>>> breaks = np.linspace(0, 100, 11, dtype=np.intp)
>>> counts = np.diff(breaks)
>>> counts
array([10, 10, 10, 10, 10, 10, 10, 10, 10, 10])
>>> sums = np.add.reduceat(a, breaks[:-1], dtype=np.float)
>>> sums
array([ 45., 145., 245., 345., 445., 545., 645., 745., 845., 945.])
>>> sums / counts # i.e. the mean
array([ 4.5, 14.5, 24.5, 34.5, 44.5, 54.5, 64.5, 74.5, 84.5, 94.5])
You could use np.repeat
:
In [35]: np.repeat(np.arange(0, len(breaks)-1), np.diff(breaks))
Out[35]:
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4,
4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6,
6, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 9, 9,
9, 9, 9, 9, 9, 9, 9, 9])
To compute arbitrary binned statistics you could use scipy.stats.binned_statistic
:
import numpy as np
import scipy.stats as stats
breaks = np.linspace(0, 100, 11, dtype=int)
A = np.random.random(100)
means, bin_edges, binnumber = stats.binned_statistic(
x=np.arange(len(A)), values=A, statistic='mean', bins=breaks)
stats.binned_statistic
can compute means, medians, counts, sums; or,
to compute an arbitrary statistics for each bin, you can pass a callable to the statistic
parameter:
def func(values):
return values.mean()
funcmeans, bin_edges, binnumber = stats.binned_statistic(
x=np.arange(len(A)), values=A, statistic=func, bins=breaks)
assert np.allclose(means, funcmeans)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With