Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

NumPy sum one array based on values in another array for each matching element in 3rd array

I have two numpy arrays, one containing values and one containing each values category.

values=np.array([1,2,3,4,5,6,7,8,9,10])
valcats=np.array([101,301,201,201,102,302,302,202,102,301])

I have another array containing the unique categories I'd like to sum across.

categories=np.array([101,102,201,202,301,302])

My issue is that I will be running this same summing process a few billion times and every microsecond matters.

My current implementation is as follows.

catsums=[]
for x in categories:
    catsums.append(np.sum(values[np.where(valcats==x)]))

The resulting catsums should be:

[1, 14, 7, 8, 12, 13]

My current run time is about 5 µs. I am somewhat new still to Python and was hoping to find a fast solution by potentially combining the first two arrays or lamdba or something cool I don't even know about.

Thanks for reading!

like image 826
hrschbck Avatar asked Jul 23 '17 16:07

hrschbck


People also ask

How do I sum two NumPy arrays?

To add the two arrays together, we will use the numpy. add(arr1,arr2) method. In order to use this method, you have to make sure that the two arrays have the same length. If the lengths of the two arrays are​ not the same, then broadcast the size of the shorter array by adding zero's at extra indexes.

How do you combine 3 arrays in Python?

How to concatenate NumPy arrays in Python? You can use the numpy. concatenate() function to concat, merge, or join a sequence of two or multiple arrays into a single NumPy array. Concatenation refers to putting the contents of two or more arrays in a single array.

What does [: :] mean on NumPy arrays?

The [:, :] stands for everything from the beginning to the end just like for lists. The difference is that the first : stands for first and the second : for the second dimension. a = numpy. zeros((3, 3)) In [132]: a Out[132]: array([[ 0., 0., 0.], [ 0., 0., 0.], [ 0., 0., 0.]])

What do you get if you apply NumPy sum () to a list that contains only Boolean values?

What do you get if you apply NumPy sum () to a list that contains only Boolean values? sum receives an array of booleans as its argument, it'll sum each element (count True as 1 and False as 0) and return the outcome.


2 Answers

You can use searchsorted and bincount -

np.bincount(np.searchsorted(categories, valcats), values)
like image 83
Divakar Avatar answered Oct 22 '22 08:10

Divakar


@Divakar just posted a very good answer. If you already have the array of categories defined, I'd use @Divakar's answer. If you don't have unique values already define, I'd use mine.


I'd use pd.factorize to factorize the categories. Then use np.bincount with weights parameter set to be the values array

f, u = pd.factorize(valcats)
np.bincount(f, values).astype(values.dtype)

array([ 1, 12,  7, 14, 13,  8])

pd.factorize also produces the unique values in the u variable. We can line up the results with u to see that we've arrived at the correct solution.

np.column_stack([u, np.bincount(f, values).astype(values.dtype)])

array([[101,   1],
       [301,  12],
       [201,   7],
       [102,  14],
       [302,  13],
       [202,   8]])

You can make this more obvious using a pd.Series

f, u = pd.factorize(valcats)
pd.Series(np.bincount(f, values).astype(values.dtype), u)

101     1
301    12
201     7
102    14
302    13
202     8
dtype: int64

Why pd.factorize and not np.unique?

We could have done this equivalently with

 u, f = np.unique(valcats, return_inverse=True)

But, np.unique sorts the values and that runs in nlogn time. On the other hand pd.factorize does not sort and runs in linear time. For larger data sets, pd.factorize will dominate performance.

like image 21
piRSquared Avatar answered Oct 22 '22 07:10

piRSquared