Calculate column-wise mean of rows of a numpy matrix selected using values from another array

Question

Given a numpy array M of shape (r, c) e.g.

M = np.array([[1, 2, 3],
              [4, 5, 6],
              [7, 8, 9],
              [10, 11, 12],
              [13, 14, 15]])  # r = 5; c = 3

and a one-dimensional array a of length r, containing integers that can vary from 0 to k-1, e.g.

a = np.array([0, 0, 2, 1, 0])  # k = 4

I want to use the values in a to select rows from M to get an intermediate result like this:

array([
       [[1, 2, 3], [4, 5, 6], [13, 14, 15]]  # rows of M where a == 0
       [[10, 11, 12]],                       # rows of M where a == 1
       [[7, 8, 9]]                           # rows of M where a == 2
       []                                    # rows of M where a == 3 (there are none)
      ])

(I don't need this intermediate array, but am just showing it for illustration.) The returned result would be a (k, c) array with the column-wise means from this array:

array([[ 6.,  7.,  8.],   # means where a == 0
   [10., 11., 12.],       # means where a == 1
   [ 7.,  8.,  9.],       # etc.
   [nan, nan, nan]])

I can do this with

np.array([M[a == i].mean(axis=0) for i in range(k)])

but is there a way (hopefully faster for large r and k) that purely uses numpy methods instead of using a for loop to make a list (which will then have to be converted back to an array)?

Divakar · Accepted Answer

Approach #1

NumPy solution leveraging BLAS powered matrix-multiplication -

In [46]: m = a == np.arange(k)[:,None] #or np.equal.outer(np.arange(k), a)

In [47]: m.dot(M)/m.sum(1,keepdims=True)
Out[47]: 
array([[ 6.,  7.,  8.],
       [10., 11., 12.],
       [ 7.,  8.,  9.],
       [nan, nan, nan]])

For performance boost on larger arrays, convert the mask to float and use np.bincount for counting mask elements -

In [108]: m.astype(float).dot(M)/np.bincount(a,minlength=k)[:,None]
Out[108]: 
array([[ 6.,  7.,  8.],
       [10., 11., 12.],
       [ 7.,  8.,  9.],
       [nan, nan, nan]])

Approach #2

Another with np.bincount applied across columns -

n = M.shape[1]
out = np.empty((k,n))
for i in range(n):
    out[:,i] = np.bincount(a,M[:,i], minlength=k) 
out /= np.bincount(a,minlength=k)[:,None]

Benchmarking

Benchmarking all approaches listed in @Ehsan's post.

Using benchit package (few benchmarking tools packaged together; disclaimer: I am its author) to benchmark proposed solutions :

import benchit
funcs = [m1,m2,m3,m4,m5,m6]
in_ = {(nr,nc,k):[np.random.randint(100,size=(nr,nc)), np.random.randint(0,k,size=nr), k] for nr in [100,1000,10000,100000] for nc in [10,100,200] for k in [10, 100, 1000]}
t = benchit.timings([m1, m2, m3, m4, m5, m6], in_, multivar=True, input_name=['nrows','ncols','k'])
t.plot(logx=True, save='timings.png')

enter image description here

Since, m6 is the original one. Let's get the speedups w.r.t. to it :

t.speedups(m6).plot(logx=True, logy=True, save='speedups.png')

enter image description here

There's no clear winner across all datasets. Nevertheless, m5 seems to be good with small ncols, while m1 seems better on bigger ncols.

Calculate column-wise mean of rows of a numpy matrix selected using values from another array

Tags:

python

arrays

python-3.x

matrix

numpy

Stuart

1 Answers

Benchmarking

Divakar

Recent Activity

Donate For Us

Calculate column-wise mean of rows of a numpy matrix selected using values from another array

Tags:

python

arrays

python-3.x

matrix

numpy

Stuart

1 Answers

Benchmarking

Divakar

Related questions

Recent Activity

Donate For Us