Numpy array: group by one column, sum another

Question

I have an array that looks like this:

 array([[ 0,  1,  2],
        [ 1,  1,  6],
        [ 2,  2, 10],
        [ 3,  2, 14]])

I want to sum the values of the third column that have the same value in the second column, so the result is something is:

 array([[ 0,  1,  8],
        [ 1,  2, 24]])

I started coding this but I'm stuck with this sum:

import numpy as np
import sys

inFile = sys.argv[1]

with open(inFile, 'r') as t:
    f = np.genfromtxt(t, delimiter=None, names =["1","2","3"])

f.sort(order=["1","2"])
if value == previous.value:
   sum(f["3"])

Mad Physicist · Accepted Answer

If your data is sorted by the second column, you can use something centered around np.add.reduceat for a pure numpy solution. A combination of np.nonzero (or np.where) applied to np.diff will give you the locations where the second column switches values. You can use those indices to do the sum-reduction. The other columns are pretty formulaic, so you can concatenate them back in fairly easily:

A = np.array([[ 0,  1,  2],
              [ 1,  1,  6],
              [ 2,  2, 10],
              [ 3,  2, 14]])
# Find the split indices
i = np.nonzero(np.diff(A[:, 1]))[0] + 1
i = np.insert(i, 0, 0)
# Compute the result columns
c0 = np.arange(i.size)
c1 = A[i, 1]
c2 = np.add.reduceat(A[:, 2], i)
# Concatenate the columns
result = np.c_[c0, c1, c2]

IDEOne Link

Notice the +1 in the indices. That is because you always want the location after the switch, not before, given how reduceat works. The insertion of zero as the first index could also be accomplished with np.r_, np.concatenate, etc.

That being said, I still think you are looking for the pandas version in @jpp's answer.

jpp · Answer

You can use pandas to vectorize your algorithm:

import pandas as pd, numpy as np

A = np.array([[ 0,  1,  2],
              [ 1,  1,  6],
              [ 2,  2, 10],
              [ 3,  2, 14]])

df = pd.DataFrame(A)\
       .groupby(1, as_index=False)\
       .sum()\
       .reset_index()

res = df[['index', 1, 2]].values

Result

array([[ 0,  1,  8],
       [ 2,  2, 24]], dtype=int64)

Numpy array: group by one column, sum another

Tags:

python

arrays

numpy

Anom

2 Answers

Mad Physicist

jpp

Recent Activity

Donate For Us

Numpy array: group by one column, sum another

Tags:

python

arrays

numpy

Anom

2 Answers

Mad Physicist

jpp

Related questions

Recent Activity

Donate For Us