Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Most efficient way to sum huge 2D NumPy array, grouped by ID column?

Tags:

python

numpy

I have a massive data array (500k rows) that looks like:

id  value  score
1   20     20
1   10     30
1   15     0
2   12     4
2   3      8
2   56     9
3   6      18
...

As you can see, there is a non-unique ID column to the left, and various scores in the 3rd column.

I'm looking to quickly add up all of the scores, grouped by IDs. In SQL this would look like SELECT sum(score) FROM table GROUP BY id

With NumPy I've tried iterating through each ID, truncating the table by each ID, and then summing the score up for that table.

table_trunc = table[(table == id).any(1)]
score       = sum(table_trunc[:,2])

Unfortunately I'm finding the first command to be dog-slow. Is there any more efficient way to do this?

like image 592
thegreatt Avatar asked Aug 17 '11 07:08

thegreatt


2 Answers

you can use bincount():

import numpy as np

ids = [1,1,1,2,2,2,3]
data = [20,30,0,4,8,9,18]

print np.bincount(ids, weights=data)

the output is [ 0. 50. 21. 18.], which means the sum of id==0 is 0, the sum of id==1 is 50.

like image 89
HYRY Avatar answered Nov 06 '22 04:11

HYRY


I noticed the numpy tag but in case you don't mind using pandas (or if you read in these data using this module), this task becomes an one-liner:

import pandas as pd

df = pd.DataFrame({'id': [1,1,1,2,2,2,3], 'score': [20,30,0,4,8,9,18]})

So your dataframe would look like this:

  id  score
0   1     20
1   1     30
2   1      0
3   2      4
4   2      8
5   2      9
6   3     18

Now you can use the functions groupby() and sum():

df.groupby(['id'], sort=False).sum()

which gives you the desired output:

    score
id       
1      50
2      21
3      18

By default, the dataframe would be sorted, therefore I use the flag sort=False which might improve speed for huge dataframes.

like image 32
Cleb Avatar answered Nov 06 '22 02:11

Cleb