I have two numpy arrays, one containing values and one containing each values category. <pre class="prettyprint"><code>values=np.array([1,2,3,4,5,6,7,8,9,10]) valcats=np.array([101,301,201,201,102,302,302,202,102,301]) </code></pre> I have another array containing the unique categories I'd like to sum across. <pre class="prettyprint"><code>categories=np.array([101,102,201,202,301,302]) </code></pre> My issue is that I will be running this same summing process a few billion times and every microsecond matters. My current implementation is as follows. <pre class="prettyprint"><code>catsums=[] for x in categories: catsums.append(np.sum(values[np.where(valcats==x)])) </code></pre> The resulting catsums should be: <pre class="prettyprint"><code>[1, 14, 7, 8, 12, 13] </code></pre> My current run time is about 5 µs. I am somewhat new still to Python and was hoping to find a fast solution by potentially combining the first two arrays or lamdba or something cool I don't even know about. Thanks for reading!

You can use <code>searchsorted</code> and <code>bincount</code> - <pre class="prettyprint"><code>np.bincount(np.searchsorted(categories, valcats), values) </code></pre>

NumPy sum one array based on values in another array for each matching element in 3rd array

Tags:

numpy

I have two numpy arrays, one containing values and one containing each values category.

values=np.array([1,2,3,4,5,6,7,8,9,10])
valcats=np.array([101,301,201,201,102,302,302,202,102,301])

I have another array containing the unique categories I'd like to sum across.

categories=np.array([101,102,201,202,301,302])

My issue is that I will be running this same summing process a few billion times and every microsecond matters.

My current implementation is as follows.

catsums=[]
for x in categories:
    catsums.append(np.sum(values[np.where(valcats==x)]))

The resulting catsums should be:

[1, 14, 7, 8, 12, 13]

My current run time is about 5 µs. I am somewhat new still to Python and was hoping to find a fast solution by potentially combining the first two arrays or lamdba or something cool I don't even know about.

Thanks for reading!

826

asked Jul 23 '17 16:07

hrschbck

2 Answers

You can use searchsorted and bincount -

np.bincount(np.searchsorted(categories, valcats), values)

answered Oct 22 '22 08:10

Divakar

@Divakar just posted a very good answer. If you already have the array of categories defined, I'd use @Divakar's answer. If you don't have unique values already define, I'd use mine.

I'd use pd.factorize to factorize the categories. Then use np.bincount with weights parameter set to be the values array

f, u = pd.factorize(valcats)
np.bincount(f, values).astype(values.dtype)

array([ 1, 12,  7, 14, 13,  8])

pd.factorize also produces the unique values in the u variable. We can line up the results with u to see that we've arrived at the correct solution.

np.column_stack([u, np.bincount(f, values).astype(values.dtype)])

array([[101,   1],
       [301,  12],
       [201,   7],
       [102,  14],
       [302,  13],
       [202,   8]])

You can make this more obvious using a pd.Series

f, u = pd.factorize(valcats)
pd.Series(np.bincount(f, values).astype(values.dtype), u)

101     1
301    12
201     7
102    14
302    13
202     8
dtype: int64

Why pd.factorize and not np.unique?

We could have done this equivalently with

 u, f = np.unique(valcats, return_inverse=True)

But, np.unique sorts the values and that runs in nlogn time. On the other hand pd.factorize does not sort and runs in linear time. For larger data sets, pd.factorize will dominate performance.

answered Oct 22 '22 07:10

piRSquared

Related questions
                            
                                Add text next to vertical line in matplotlib
                            
                                Generate random numbers from lognormal distribution in python
                            
                                How to make action logging in Django with Django Rest Framework
                            
                                appending values to dictionary in for loop
                            
                                matplotlib scatterplot with legend
                            
                                numpy savetxt is not adding comma delimiter
                            
                                Iterate over numpy with index (numpy equivalent of python enumerate)
                            
                                seaborn heatmap color scheme based on row values
                            
                                Why is checking isinstance(something, Mapping) so slow?
                            
                                Python: How to add specific columns of .mean to dataframe
                            
                                Renaming tuple column name in dataframe
                            
                                Implementing 3D vectors in Python: numpy vs x,y,z fields
                            
                                Flatten hierarchically indexed pandas.DataFrame from groupby and multiple aggregation
                            
                                python inserts pictures to powerpoint, how to set the width and height of the picture?
                            
                                Plotting heatmap for 3 columns in python with seaborn
                            
                                How to locate and read Data Matrix code with python
                            
                                python astype(str) gives SettingWithCopyWarning and requests I use loc
                            
                                sqlalchemy.exc.UnboundExecutionError: Table object 'responsibles' is not bound to an Engine or Connection
                            
                                Dynamically generating elements of list within list
                            
                                Python Pandas Match Vlookup columns based on header values

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

NumPy sum one array based on values in another array for each matching element in 3rd array

Tags:

python

arrays

pandas