Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there an alternative, faster approach than idxmax? [duplicate]

Tags:

python

pandas

import time
np.random.seed(0)
df = pd.DataFrame({'gr': np.random.choice(7000, 500000),
              'col': np.random.choice(1000, 500000)})
groups = df.groupby('gr')
t1 = time.time()
idx = groups.col.idxmax()
print(round(time.time() - t1,1))
0.7

Is there a way to get these indeces significantly faster than with idxmax()?

Note, I am interested in the idx.values, I don't mind losing the idx.index() of the idx series

like image 984
Tony Avatar asked Mar 06 '23 22:03

Tony


1 Answers

From my side using drop_duplicates is faster than groupby idxmax, around 8 times faster

%timeit df.sort_values(['gr','col']).drop_duplicates('gr',keep='last').index
10 loops, best of 3: 67.3 ms per loop
%timeit df.groupby('gr').col.idxmax()
1 loop, best of 3: 491 ms per loop
like image 51
2 revs, 2 users 94% Avatar answered Apr 01 '23 15:04

2 revs, 2 users 94%