I have two equal-length 1D numpy arrays, id
and data
, where id
is a sequence of repeating, ordered integers that define sub-windows on data
. For example:
id data
1 2
1 7
1 3
2 8
2 9
2 10
3 1
3 -10
I would like to aggregate data
by grouping on id
and taking either the max or the min.
In SQL, this would be a typical aggregation query like SELECT MAX(data) FROM tablename GROUP BY id ORDER BY id
.
Is there a way I can avoid Python loops and do this in a vectorized manner?
amax() will find the max value in an array, and numpy. amin() does the same for the min value.
NumPy performs better than Pandas for 50K rows or less. But, Pandas' performance is better than NumPy's for 500K rows or more. Thus, performance varies between 50K and 500K rows depending on the type of operation.
First, numpy supports multithreading, and this can give you a speed boost in multicore environments!
with only numpy and without loops:
id = np.asarray([1,1,1,2,2,2,3,3])
data = np.asarray([2,7,3,8,9,10,1,-10])
# max
_ndx = np.argsort(id)
_id, _pos = np.unique(id[_ndx], return_index=True)
g_max = np.maximum.reduceat(data[_ndx], _pos)
# min
_ndx = np.argsort(id)
_id, _pos = np.unique(id[_ndx], return_index=True)
g_min = np.minimum.reduceat(data[_ndx], _pos)
# compare results with pandas groupby
np_group = pd.DataFrame({'min':g_min, 'max':g_max}, index=_id)
pd_group = pd.DataFrame({'id':id, 'data':data}).groupby('id').agg(['min','max'])
(pd_group.values == np_group.values).all() # TRUE
Ive packaged a version of my previous answer in the numpy_indexed package; its nice to have this all wrapped up and tested in a neat interface; plus it has a lot more functionality as well:
import numpy_indexed as npi
group_id, group_max_data = npi.group_by(id).max(data)
And so on
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With