How can I sum across rows that have equal values in the first column of a numpy array? For example: <pre class="prettyprint"><code>In: np.array([[1,2,3], [1,4,6], [2,3,5], [2,6,2], [3,4,8]]) Out: [[1,6,9], [2,9,7], [3,4,8]] </code></pre> Any help would be greatly appreciated.

Pandas has a very very powerful groupby function which makes this very simple. <pre class="prettyprint"><code>import pandas as pd n = np.array([[1,2,3], [1,4,6], [2,3,5], [2,6,2], [3,4,8]]) df = pd.DataFrame(n, columns = ["First Col", "Second Col", "Third Col"]) df.groupby("First Col").sum() </code></pre>

Sum rows where value equal in column

Tags:

python

numpy

sum

row

How can I sum across rows that have equal values in the first column of a numpy array? For example:

In: np.array([[1,2,3],
             [1,4,6], 
             [2,3,5],
             [2,6,2],
             [3,4,8]])

Out: [[1,6,9], [2,9,7], [3,4,8]]

Any help would be greatly appreciated.

427

asked May 04 '15 22:05

user998476

2 Answers

Pandas has a very very powerful groupby function which makes this very simple.

import pandas as pd

n = np.array([[1,2,3],
             [1,4,6], 
             [2,3,5],
             [2,6,2],
             [3,4,8]])

df = pd.DataFrame(n, columns = ["First Col", "Second Col", "Third Col"])

df.groupby("First Col").sum()

answered Sep 20 '22 04:09

canyon289

Approach #1

Here's something in a numpythonic vectorized way based on np.bincount -

# Initial setup             
N = A.shape[1]-1
unqA1, id = np.unique(A[:, 0], return_inverse=True)

# Create subscripts and accumulate with bincount for tagged summations
subs = np.arange(N)*(id.max()+1) + id[:,None]
sums = np.bincount( subs.ravel(), weights=A[:,1:].ravel() )

# Append the unique elements from first column to get final output
out = np.append(unqA1[:,None],sums.reshape(N,-1).T,1)

Sample input, output -

In [66]: A
Out[66]: 
array([[1, 2, 3],
       [1, 4, 6],
       [2, 3, 5],
       [2, 6, 2],
       [7, 2, 1],
       [2, 0, 3]])

In [67]: out
Out[67]: 
array([[  1.,   6.,   9.],
       [  2.,   9.,  10.],
       [  7.,   2.,   1.]])

Approach #2

Here's another based on np.cumsum and np.diff -

# Sort A based on first column
sA = A[np.argsort(A[:,0]),:]

# Row mask of where each group ends
row_mask = np.append(np.diff(sA[:,0],axis=0)!=0,[True])

# Get cummulative summations and then DIFF to get summations for each group
cumsum_grps = sA.cumsum(0)[row_mask,1:]
sum_grps = np.diff(cumsum_grps,axis=0)

# Concatenate the first unique row with its counts
counts = np.concatenate((cumsum_grps[0,:][None],sum_grps),axis=0)

# Concatenate the first column of the input array for final output
out = np.concatenate((sA[row_mask,0][:,None],counts),axis=1)

Benchmarking

Here's some runtime tests for the numpy based approaches presented so far for the question -

In [319]: A = np.random.randint(0,1000,(100000,10))

In [320]: %timeit cumsum_diff(A)
100 loops, best of 3: 12.1 ms per loop

In [321]: %timeit bincount(A)
10 loops, best of 3: 21.4 ms per loop

In [322]: %timeit add_at(A)
10 loops, best of 3: 60.4 ms per loop

In [323]: A = np.random.randint(0,1000,(100000,20))

In [324]: %timeit cumsum_diff(A)
10 loops, best of 3: 32.1 ms per loop

In [325]: %timeit bincount(A)
10 loops, best of 3: 32.3 ms per loop

In [326]: %timeit add_at(A)
10 loops, best of 3: 113 ms per loop

Seems like Approach #2: cumsum + diff is performing quite well.

answered Sep 20 '22 04:09

Divakar

Related questions
                            
                                SQLAlchemy polymorphic association
                            
                                Understanding callbacks in Scrapy
                            
                                How can I check if an object is a file with isinstance()?
                            
                                SQLAlchemy session reconnect
                            
                                creating stacked histogram with pandas dataframes data python
                            
                                Python Tkinter: Attach scrollbar to listbox as opposed to window
                            
                                Creating a defaultdict with empty numpy array
                            
                                Travelling Salesman in scipy
                            
                                django celery beat DBAccessError
                            
                                jira python customfield
                            
                                how to get text from within a tag, but ignore other child tags
                            
                                Difficulty importing .dat file
                            
                                How should I vectorize the following list of lists with scikit learn?
                            
                                Why does JSON returned from the django rest framework have forward slashes in the response?
                            
                                Python 3: Multiply a vector by a matrix without NumPy
                            
                                unconverted data remains: 15 [closed]
                            
                                Passing results to depending on job - python rq
                            
                                Convert image to specific palette using PIL without dithering
                            
                                Pyspark py4j PickleException: "expected zero arguments for construction of ClassDict"
                            
                                How to Install Private Python Package as Part of Build

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With