Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Group by max or min in a numpy array

I have two equal-length 1D numpy arrays, id and data, where id is a sequence of repeating, ordered integers that define sub-windows on data. For example:

id  data
1     2
1     7
1     3
2     8
2     9
2    10
3     1
3   -10

I would like to aggregate data by grouping on id and taking either the max or the min.

In SQL, this would be a typical aggregation query like SELECT MAX(data) FROM tablename GROUP BY id ORDER BY id.

Is there a way I can avoid Python loops and do this in a vectorized manner?

like image 216
Abiel Avatar asked Dec 24 '11 06:12

Abiel


People also ask

How do you find the max and min value of a NumPy array?

amax() will find the max value in an array, and numpy. amin() does the same for the min value.

Is Panda faster than NP?

NumPy performs better than Pandas for 50K rows or less. But, Pandas' performance is better than NumPy's for 500K rows or more. Thus, performance varies between 50K and 500K rows depending on the type of operation.

Does NumPy do multithreading?

First, numpy supports multithreading, and this can give you a speed boost in multicore environments!


2 Answers

with only numpy and without loops:

id = np.asarray([1,1,1,2,2,2,3,3])
data = np.asarray([2,7,3,8,9,10,1,-10])

# max
_ndx = np.argsort(id)
_id, _pos  = np.unique(id[_ndx], return_index=True)
g_max = np.maximum.reduceat(data[_ndx], _pos)

# min
_ndx = np.argsort(id)
_id, _pos  = np.unique(id[_ndx], return_index=True)
g_min = np.minimum.reduceat(data[_ndx], _pos)

# compare results with pandas groupby
np_group = pd.DataFrame({'min':g_min, 'max':g_max}, index=_id)
pd_group = pd.DataFrame({'id':id, 'data':data}).groupby('id').agg(['min','max'])

(pd_group.values == np_group.values).all()  # TRUE
like image 90
Marco Cerliani Avatar answered Nov 08 '22 00:11

Marco Cerliani


Ive packaged a version of my previous answer in the numpy_indexed package; its nice to have this all wrapped up and tested in a neat interface; plus it has a lot more functionality as well:

import numpy_indexed as npi
group_id, group_max_data = npi.group_by(id).max(data)

And so on

like image 26
Eelco Hoogendoorn Avatar answered Nov 07 '22 23:11

Eelco Hoogendoorn