I have two numpy arrays, one containing values and one containing each values category.
values=np.array([1,2,3,4,5,6,7,8,9,10])
valcats=np.array([101,301,201,201,102,302,302,202,102,301])
I have another array containing the unique categories I'd like to sum across.
categories=np.array([101,102,201,202,301,302])
My issue is that I will be running this same summing process a few billion times and every microsecond matters.
My current implementation is as follows.
catsums=[]
for x in categories:
catsums.append(np.sum(values[np.where(valcats==x)]))
The resulting catsums should be:
[1, 14, 7, 8, 12, 13]
My current run time is about 5 µs. I am somewhat new still to Python and was hoping to find a fast solution by potentially combining the first two arrays or lamdba or something cool I don't even know about.
Thanks for reading!
To add the two arrays together, we will use the numpy. add(arr1,arr2) method. In order to use this method, you have to make sure that the two arrays have the same length. If the lengths of the two arrays are not the same, then broadcast the size of the shorter array by adding zero's at extra indexes.
How to concatenate NumPy arrays in Python? You can use the numpy. concatenate() function to concat, merge, or join a sequence of two or multiple arrays into a single NumPy array. Concatenation refers to putting the contents of two or more arrays in a single array.
The [:, :] stands for everything from the beginning to the end just like for lists. The difference is that the first : stands for first and the second : for the second dimension. a = numpy. zeros((3, 3)) In [132]: a Out[132]: array([[ 0., 0., 0.], [ 0., 0., 0.], [ 0., 0., 0.]])
What do you get if you apply NumPy sum () to a list that contains only Boolean values? sum receives an array of booleans as its argument, it'll sum each element (count True as 1 and False as 0) and return the outcome.
You can use searchsorted
and bincount
-
np.bincount(np.searchsorted(categories, valcats), values)
@Divakar just posted a very good answer. If you already have the array of categories defined, I'd use @Divakar's answer. If you don't have unique values already define, I'd use mine.
I'd use pd.factorize
to factorize the categories. Then use np.bincount
with weights
parameter set to be the values
array
f, u = pd.factorize(valcats)
np.bincount(f, values).astype(values.dtype)
array([ 1, 12, 7, 14, 13, 8])
pd.factorize
also produces the unique values in the u
variable. We can line up the results with u
to see that we've arrived at the correct solution.
np.column_stack([u, np.bincount(f, values).astype(values.dtype)])
array([[101, 1],
[301, 12],
[201, 7],
[102, 14],
[302, 13],
[202, 8]])
You can make this more obvious using a pd.Series
f, u = pd.factorize(valcats)
pd.Series(np.bincount(f, values).astype(values.dtype), u)
101 1
301 12
201 7
102 14
302 13
202 8
dtype: int64
Why pd.factorize
and not np.unique
?
We could have done this equivalently with
u, f = np.unique(valcats, return_inverse=True)
But, np.unique
sorts the values and that runs in nlogn
time. On the other hand pd.factorize
does not sort and runs in linear time. For larger data sets, pd.factorize
will dominate performance.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With