Given a numpy array M
of shape (r, c)
e.g.
M = np.array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9],
[10, 11, 12],
[13, 14, 15]]) # r = 5; c = 3
and a one-dimensional array a
of length r
, containing integers that can vary from 0 to k-1
, e.g.
a = np.array([0, 0, 2, 1, 0]) # k = 4
I want to use the values in a
to select rows from M
to get an intermediate result like this:
array([
[[1, 2, 3], [4, 5, 6], [13, 14, 15]] # rows of M where a == 0
[[10, 11, 12]], # rows of M where a == 1
[[7, 8, 9]] # rows of M where a == 2
[] # rows of M where a == 3 (there are none)
])
(I don't need this intermediate array, but am just showing it for illustration.) The returned result would be a (k, c)
array with the column-wise means from this array:
array([[ 6., 7., 8.], # means where a == 0
[10., 11., 12.], # means where a == 1
[ 7., 8., 9.], # etc.
[nan, nan, nan]])
I can do this with
np.array([M[a == i].mean(axis=0) for i in range(k)])
but is there a way (hopefully faster for large r
and k
) that purely uses numpy methods instead of using a for loop to make a list (which will then have to be converted back to an array)?
Approach #1
NumPy solution leveraging BLAS powered matrix-multiplication -
In [46]: m = a == np.arange(k)[:,None] #or np.equal.outer(np.arange(k), a)
In [47]: m.dot(M)/m.sum(1,keepdims=True)
Out[47]:
array([[ 6., 7., 8.],
[10., 11., 12.],
[ 7., 8., 9.],
[nan, nan, nan]])
For performance boost on larger arrays, convert the mask to float and use np.bincount
for counting mask elements -
In [108]: m.astype(float).dot(M)/np.bincount(a,minlength=k)[:,None]
Out[108]:
array([[ 6., 7., 8.],
[10., 11., 12.],
[ 7., 8., 9.],
[nan, nan, nan]])
Approach #2
Another with np.bincount
applied across columns -
n = M.shape[1]
out = np.empty((k,n))
for i in range(n):
out[:,i] = np.bincount(a,M[:,i], minlength=k)
out /= np.bincount(a,minlength=k)[:,None]
Benchmarking all approaches listed in @Ehsan's post.
Using benchit
package (few benchmarking tools packaged together; disclaimer: I am its author) to benchmark proposed solutions :
import benchit
funcs = [m1,m2,m3,m4,m5,m6]
in_ = {(nr,nc,k):[np.random.randint(100,size=(nr,nc)), np.random.randint(0,k,size=nr), k] for nr in [100,1000,10000,100000] for nc in [10,100,200] for k in [10, 100, 1000]}
t = benchit.timings([m1, m2, m3, m4, m5, m6], in_, multivar=True, input_name=['nrows','ncols','k'])
t.plot(logx=True, save='timings.png')
Since, m6
is the original one. Let's get the speedups w.r.t.
to it :
t.speedups(m6).plot(logx=True, logy=True, save='speedups.png')
There's no clear winner across all datasets. Nevertheless, m5
seems to be good with small ncols, while m1
seems better on bigger ncols.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With