Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Numpy split array based on condition without for loop

Tags:

So lets say i have a numpy array that holds points in 2d space, like the following

np.array([[3, 2], [4, 4], [5, 4], [4, 2], [4, 6], [9, 5]]) 

I also have a numpy array that labels each point to a number, this array is a 1d array with the length as the number of points in the point array.

np.array([0, 1, 1, 0, 2, 1])

Now i want to take the mean value of each point that have an index from the labels array. So for all points that have label 0, take the mean value of those points. My current way of solving this is the following way

return np.array([points[labels==k].mean(axis=0) for k in range(k)])

where k is the largest number in the labels array, or as it's called the number of ways to label the points.

I would like a way to do this without using a for loop, maybe some numpy functionality i haven't discovered yet?

like image 827
Shadesfear Avatar asked Feb 26 '19 16:02

Shadesfear


People also ask

What is __ Array_interface __?

__array_interface__ A dictionary of items (3 required and 5 optional). The optional keys in the dictionary have implied defaults if they are not provided. The keys are: shape (required) Tuple whose elements are the array size in each dimension.

What does .all do in NumPy?

all() in Python. The numpy. all() function tests whether all array elements along the mentioned axis evaluate to True.


1 Answers

Approach #1 : We can leverage matrix-multiplication with some help from braodcasting -

mask = labels == np.arange(labels.max()+1)[:,None]
out = mask.dot(points)/np.bincount(labels).astype(float)[:,None]

Sample run -

In [36]: points = np.array([[3, 2], [4, 4], [5, 4], [4, 2], [4, 6], [9, 5]]) 
    ...: labels = np.array([0, 1, 1, 0, 2, 1])

# Original soln
In [37]: L = labels.max()+1

In [38]: np.array([points[labels==k].mean(axis=0) for k in range(L)])
Out[38]: 
array([[3.5       , 2.        ],
       [6.        , 4.33333333],
       [4.        , 6.        ]])

# Proposed soln
In [39]: mask = labels == np.arange(labels.max()+1)[:,None]
    ...: out = mask.dot(points)/np.bincount(labels).astype(float)[:,None]

In [40]: out
Out[40]: 
array([[3.5       , 2.        ],
       [6.        , 4.33333333],
       [4.        , 6.        ]])

Approach #2 : With np.add.at -

sums = np.zeros((labels.max()+1,points.shape[1]),dtype=float)
np.add.at(sums,labels,points)
out = sums/np.bincount(labels).astype(float)[:,None]

Approach #3 : If all numbers from the sequence in 0 to max-label are present in labels, we can also use np.add.reduceat -

sidx = labels.argsort()
sorted_points = points[sidx]
sums = np.add.reduceat(sorted_points,np.r_[0,np.bincount(labels)[:-1].cumsum()])
out = sums/np.bincount(labels).astype(float)[:,None]
like image 107
Divakar Avatar answered Oct 05 '22 23:10

Divakar