import time
np.random.seed(0)
df = pd.DataFrame({'gr': np.random.choice(7000, 500000),
'col': np.random.choice(1000, 500000)})
groups = df.groupby('gr')
t1 = time.time()
idx = groups.col.idxmax()
print(round(time.time() - t1,1))
0.7
Is there a way to get these indeces significantly faster than with idxmax()?
Note, I am interested in the idx.values
, I don't mind losing the idx.index()
of the idx
series
From my side using drop_duplicates
is faster than groupby
idxmax
, around 8 times faster
%timeit df.sort_values(['gr','col']).drop_duplicates('gr',keep='last').index
10 loops, best of 3: 67.3 ms per loop
%timeit df.groupby('gr').col.idxmax()
1 loop, best of 3: 491 ms per loop
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With